User:Laszlo/UMLS-Biotop/Plan
From OmegaWiki
Implementing an Upper Ontology for semantic support in a Biological Wiki
László van den Hoek (User:Laszlo)
September 2007
PLEASE FEEL FREE TO EDIT! Questions and remarks can be posted to the discussion page (see tab above).
Contents |
[edit] Introduction
WikiProteins is an on-line terminology system with a wiki interface. Several important databases already have been imported, among which the UMLS MetaThesaurus. Each concept in the MetaThesaurus is tagged as belonging to at least one of 135 Semantic Types (ST; see Table 1), a simplified, hierarchical categorization system covering the breadth of UMLS.
Using the UMLS Semantic Types as an upper ontology for WikiProteins confers some obvious limitations, because of the inconsistencies and ambiguities present. To ensure extensibility and interoperability of the databases contained within WikiProteins, and to reduce ambiguity, we seek to implement a new (Description Logic-based) upper ontology to WikiProteins, simultaneously mapping the STs to it.
[edit] Choice of ontology
Upper ontologies are not domain-specific by definition. Considering that we are looking to simultaneously implement an upper ontology and extend it downwards into the biomedical domain, it is mandatory that a mid-level, “glue” ontology can be seamlessly attached to the upper ontology chosen.
To avoid duplicating existing efforts, several existing authoritative upper ontologies were considered:
[edit] (Open)Cyc
Originally a research project started in 1984, Cyc is currently maintained by Cycorp, Inc., for commercial AI applications such as callcenter transcript analysis, text mining and game design. A publicly available, non-commercial version, OpenCyc, was released in 2001, with subsequent updates, but the number of assertions is much lower that in the commercial version. A research version, ResearchCyc, also exists, but it requires non-disclosure, which precludes use on a public project like WikiProteins. Cycorp has stated its intention to port all non-proprietary information from ResearchCyc to OpenCyc, but more than a year after this announcement, this has not yet happened. Assertions in Cyc are in First-order logic, with extensions for modal operators and higher order quantification. A complicating fact is the widespread presence of reification (a construct to allow the presence of contradictory statements). Given these conditions, it is unlikely that a DL representation of Cyc could be generated. In spite of its extensive subject area, very little specific attention to life sciences and medicine has been offered to Cyc. Implementing the Semantic Network would likely be a difficult task.
[edit] SUMO
Like Cyc, SUMO is part of the IEEE Standard Upper Ontology Working Group. The language is developed by Teknowledge Corp. in SUO-KIF, a variant of the Knowledge Interchange Format (KIF). While it may be possible to convert KIF in general to DL, the available OWL interpretation of SUMO is in OWL Full, not OWL DL. MILO is a copyright-encumbered mid-level ontology connecting SUMO to commercial Teknowledge domain ontologies, but “any ontology you create based on MILO or our domain ontologies is your property”. No scientific domain ontologies are available, only two very specific ones (“biological virii” and “atomic elements”). Even if it is possible to express the Semantic Network in (SUO-)KIF, the current level of detail in MILO with regards to the biomedical domain is down to the level of BiologicalProcess, which has two children: CausingHappiness and CausingUnhappiness. If this arbitrary division is considered representative of the orientation of SUMO/MILO, then it would appear to unsuited to our needs.
[edit] BioTop
BioTop is an upper-to-middle ontology, which, as the name would suggest, is geared towards the biomedical domain. It is developed at the Universities of Freiburg and Jena, and is entirely in DL. The ontology (Fig. 1) is divided into two parts: at the upper level is the Basic Formal Ontology (BFO), authored by Barry Smith, among others, conferring a philosophical nature. Added below that is a relatively deep network of biological classes, intended to provide an interface to the domain ontologies contained in the OBO foundry. Some classes appear to be a straight match with semantic types, some even being eponymous like “Animal”, “Fungus” and “Virus”. In other cases, a ST may be not exactly identical, but creating a new BioTop class would not likely result in a better ontology; for instance, the “Biologic Function” ST is described as “A state, activity or process of the body or one of its systems or parts”, while the BioTop class “BiologicalProcess” makes no mention of bodies. UMLS probably does not consider biological processes in non-human organisms, but creating a “BiologicalProcessInHuman” class does not make much sense.
Considering the sound and compatible structure of BioTop, and the willingness of the BioTop developers to cooperate in this project, the choice to use it as a base for our upper ontology is clear. Whereas other biological upper ontologies (Simple Bio Upper Ontology, GFO-Bio, UBO) do exist, and BioTop is neither complete nor without issues itself, none match its quality and extent.
[edit] Practical approach
[edit] Mapping the Semantic Types
To begin, we will try to map each Semantic Type to BioTop, or if that is impossible, determine where the latter must be extended. A preliminary effort is being undertaken; a scheme of the results so far is available. This is facilitated by the availability of textual definitions of all ST classes (web view) and most BioTop classes. An outline of the mapping will be circulated among domain experts, soliciting comments where necessary. BioTop class membership can then be assigned to individual MetaThesaurus entities inside WikiProteins, using the same mechanism with which STs are assigned now.
[edit] Implementation in OWL
After the semantic types have been mapped, creating an implementation in OWL should be possible. The most challenging part will likely be the mapping of the semantic relations from the frame-based UMLS Semantic Network (SN) to a DL, which has no ambiguity but less expressivity. The SN defines a list of permissible (and a few forbidden) relations between Semantic Types, as well as the (non-)inheritance of these relations from a generic parent type to its more specific children. Given such a list, and a classification with the STs, it can be inferred which types of relations are allowed between which concepts. A second issue might occur from the fact that concepts can belong to multiple semantic types and therefore inherit from multiple classes. This might lead to inconsistencies.
[edit] Explore incompatibilities
Because of the lack of ontological rigor in the STs, concepts that differ in fundamental ways (i.e., classes versus instances) may still both belong to the same ST. Then, after the mapping to a DL version, it may occur that a certain relation is allowed to exist between two concepts that originally belonged to certain semantic types, that would still hold, but not for another pair of the same types, because what was not distinguished under the ST classification scheme turned out to be distinct after mapping them to a DL-based representation. Seeking out these kinds of errors will bring out inconsistencies in UMLS, which may then possibly be corrected.
[edit] Extend WikiData for editing support
When this topic is sufficiently explored, the WikiProteins software will be improved to allow the use of such resources as the SN to provide feedback to wiki editors: given the knowledge of which relation types are allowed between entries marked with what ST, it should be possible to limit the number of options available to the user when adding relations between concepts. Once this functionality is in place, similar resources like the SwissProt annotations could be merged, further assisting the user. From this point onwards, cases where certain valid relations are not suggested and/or can not be added by users should be investigated and resolved by increasing the resolution and completeness of the ruleset in tandem with the upper ontology used.
[edit] Summary
Summarizing the tasks to be completed, the milestones in which they result, and in what order:
- Map Semantic Types to BioTop classes, expand the latter to accommodate the former. This will result in a spreadsheet describing the actions to take for each Semantic Type.
- Verification by domain experts (Christine Chichester, Barend Mons, Olivier Bodenreider, Stephan Schulz, Elena Beisswanger, Ronald Cornet) of the mapping; incorporate feedback. This will be repeated until written consensus is reached on the action to take for each Semantic Type.
- When satisfied, create an OWL implementation, and import the adapted BioTop classes into WikiProteins using the existing class membership mechanism.
- Investigate inconsistencies created through the DL mapping (“Barry Smith” examples). A few examples might include:
- "Bacteria / causes / Infection": All Bacteria cause {each/only/some} infection(s) / Some Bacteria cause {all/some} infections
- foo
- bar
- suggestions are welcome
- Extend the WikiProteins software to enable suggestions/restrictions on possible relation types when adding relations to the wiki. This should prevent the addition of statements like "Ferns suffer from depression" (plants do not possess mental processes). The deliverable code, extending the WikiData extension, will be released under a GPL license.
- Explore and expand the ruleset currently defined by the UMLS Semantic Network, and merge other sources of rules where possible.
[edit] Provisional Time Plan
| when | what |
|---|---|
| Done | evaluate compatibility of existing ST/BioTop classes |
| until 20-09-2007 | vacation |
| 24-09 – 5-10 | complete first mapping of all types, including the positions where classes are to be created |
| October | circulate to domain experts and iteratively revise according to comments |
| November | when satisfied with mapping, implement OWL version |
| December | Investigate inconsistencies |
| January-February | MediaWiki Programming |
| Up to April | evaluate functionality of ruleset-based edit support, improve ruleset |
| April | end, write report |
[edit] Literature
Aranguren, M.A. (2004) Improving the structure of the Gene Ontology. [1]
Bodenreider, O. and Stevens, R. (2006) Bio-ontologies: current trends and future directions. Briefings in Bioinformatics 7:3, 256-274
Bodenreider, O. (2003) Strength in Numbers: Exploring Redundancy in Hierarchical Relations across Biomedical Terminologies. AMIA 2003 Symposium Proceedings, p. 101 - 105
Burger, A., Davidson, D., Baldock, R. (2003) Formalization of mouse embryo anatomy. Bioinformatics 20(2):259-267
Burgun, A., Bodenreider, O. (2001) Mapping the UMLS Semantic Network into General Ontologies. Proc. AMIA Ann. Symp. 2001:86-90
Cimino, J.J. (1998) Auditing the Unified Medical Language System with Semantic Methods. JAMIA 5:41-51
Cimino, J.J., Min, H., Perl, Y. (2003) Consistency across the hierarchies of the UMLS Semantic Network and Metathesaurus. J. Biomed. Inf. 36:450-461
Dameron, Rubin, Musen (2005) Challenges in Converting Frame-Based Ontology into OWL: the Foundational Model of Anatomy Case-Study. AMIA 2005 Symposium Proceedings, 181-185
Giles, J. (2007) Key biology datbases go wiki. Nature 445, p. 691
Kashyap, V., Borgida, A. (2003) Representing the UMLS Semantic Network using OWL. Proc. 2nd Intl. Semantic Web Conference.
McCray, A.T. (2003) An upper-level ontology for the biomedical domain. Comp. Funct. Genom. 4:80-84
Van Mulligen, E.M., Möller, E., Roes, P., Weeber, M., Meijssen, G., Chichester, C., Mons, B. (2006) An on-line ontology: WiktionaryZ. Proc. 2nd Int. Workshop on Formal Biomedical Knowledge Representation: “Biomedical Ontology in Action” (KR-MED 2006). pp. 31-36.
Stefan Schulz, Elena Beisswanger, Udo Hahn, Joachim Wermter, Anand Kumar, Holger Stenzhorn (2006) From GENIA to BioTop: Towards a top-level ontology for biology. Proc. Int. Conf. FOIS2006.
Stefan Schulz, Elena Beisswanger, Joachim Wermter, Udo Hahn (2006) Towards an Upper Level Ontology for Molecular Biology. Proc. AMIA2006.
Schulze-Kremer, S., Smith, B., Kumar, A. (2004) Revising the UMLS Semantic Network. Proc. Medinfo 2004.
Wroe, C.J., Stevens, R., Goble, C.A., Ashburner, M. (2003) A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL. Pacific Symposium on Biocomputing 8:624-635
[edit] Attachments
Figure 1: The structure of the latest release of BioTop, version 1.0 (does not display on the wikiproteins.org server, use http://www.omegawiki.org/User:Laszlo/Project_plan instead if you can not see it)
Table 1: Distribution of Semantic Types within (part of) the UMLS MetaThesaurus
SQL: SELECT sty, COUNT(*) num FROM mrsty GROUP BY sty ORDER BY num DESC;
Semantic Type Occurrences Clinical Drug 156849 Organic Chemical 125148 Amino Acid, Peptide, or Protein 97504 Pharmacologic Substance 91643 Plant 65929 Invertebrate 65298 Body Part, Organ, or Organ Component 55945 Biologically Active Substance 47978 Bacterium 47642 Clinical Attribute 45493 Fungus 25554 Gene or Genome 23526 Enzyme 22017 Disease or Syndrome 20697 Therapeutic or Preventive Procedure 16350 Molecular Function 13383 Medical Device 13147 Immunologic Factor 12268 Finding 11935 Fish 11032 Neoplastic Process 10431 Body Location or Region 10067 Carbohydrate 9023 Steroid 8822 Drug Delivery Device 8162 Indicator, Reagent, or Diagnostic Aid 8142 Injury or Poisoning 7746 Nucleic Acid, Nucleoside, or Nucleotide 7735 Intellectual Product 6534 Bird 6496 Mammal 5487 Lipid 5473 Body Space or Junction 5356 Virus 4964 Alga 4768 Hazardous or Poisonous Substance 4604 Cell Function 4312 Reptile 4173 Receptor 4101 Inorganic Chemical 4034 Laboratory Procedure 4001 Antibiotic 3473 Idea or Concept 3344 Diagnostic Procedure 3018 Biomedical or Dental Material 3005 Pathologic Function 2860 Amphibian 2798 Cell Component 2527 Cell 2525 Quantitative Concept 2492 Sign or Symptom 2418 Qualitative Concept 2383 Manufactured Object 2274 Organophosphorus Compound 2160 Organ or Tissue Function 2145 Spatial Concept 2137 Population Group 2086 Health Care Activity 2075 Hormone 2021 Mental or Behavioral Dysfunction 1993 Congenital Abnormality 1668 Laboratory or Test Result 1666 Professional or Occupational Group 1654 Temporal Concept 1593 Anatomical Abnormality 1516 Functional Concept 1457 Tissue 1437 Organism 1339 Genetic Function 1305 Organism Attribute 1235 Organism Function 1223 Food 1199 Archaeon 1185 Eicosanoid 1141 Nucleotide Sequence 1072 Research Activity 1040 Health Care Related Organization 938 Rickettsia or Chlamydia 919 Cell or Molecular Dysfunction 826 Biomedical Occupation or Discipline 822 Acquired Abnormality 820 Social Behavior 792 Phenomenon or Process 788 Embryonic Structure 743 Body Substance 731 Vitamin 705 Mental Process 688 Element, Ion, or Isotope 673 Individual Behavior 660 Geographic Area 633 Neuroreactive Substance or Biogenic Amine 586 Physiologic Function 552 Natural Phenomenon or Process 533 Classification 517 Occupational Activity 488 Human-caused Phenomenon or Process 478 Language 478 Governmental or Regulatory Activity 476 Occupation or Discipline 470 Conceptual Entity 378 Body System 358 Educational Activity 337 Organization 280 Chemical Viewed Structurally 271 Regulation or Law 265 Biologic Function 264 Substance 262 Molecular Biology Research Technique 224 Daily or Recreational Activity 176 Family Group 157 Amino Acid Sequence 157 Activity 156 Patient or Disabled Group 147 Group Attribute 124 Chemical Viewed Functionally 114 Machine Activity 82 Experimental Model of Disease 71 Age Group 67 Research Device 65 Anatomical Structure 55 Environmental Effect of Humans 54 Animal 43 Self-help or Relief Organization 42 Professional Society 37 Human 36 Physical Object 24 Group 24 Behavior 23 Vertebrate 18 Event 18 Chemical 17 Fully Formed Anatomical Structure 6 Entity 6 Molecular Sequence 4 Carbohydrate Sequence 2

