As an anonymous user, you can only add new data. If you would like to also modify existing data, please create an account and indicate your languages on your user page.

User:Laszlo/UMLS-Biotop/Plan

From OmegaWiki
Jump to: navigation, search

Implementing an Upper Ontology for semantic support in a Biological Wiki

László van den Hoek (User:Laszlo)

September 2007

PLEASE FEEL FREE TO EDIT! Questions and remarks can be posted to the discussion page (see tab above).

Introduction[edit]

WikiProteins is an on-line terminology system with a wiki interface. Several important databases already have been imported, among which the UMLS MetaThesaurus. Each concept in the MetaThesaurus is tagged as belonging to at least one of 135 Semantic Types (ST; see Table 1), a simplified, hierarchical categorization system covering the breadth of UMLS.

Using the UMLS Semantic Types as an upper ontology for WikiProteins confers some obvious limitations, because of the inconsistencies and ambiguities present. To ensure extensibility and interoperability of the databases contained within WikiProteins, and to reduce ambiguity, we seek to implement a new (Description Logic-based) upper ontology to WikiProteins, simultaneously mapping the STs to it.

Choice of ontology[edit]

Upper ontologies are not domain-specific by definition. Considering that we are looking to simultaneously implement an upper ontology and extend it downwards into the biomedical domain, it is mandatory that a mid-level, “glue” ontology can be seamlessly attached to the upper ontology chosen.

To avoid duplicating existing efforts, several existing authoritative upper ontologies were considered:

(Open)Cyc[edit]

Originally a research project started in 1984, Cyc is currently maintained by Cycorp, Inc., for commercial AI applications such as callcenter transcript analysis, text mining and game design. A publicly available, non-commercial version, OpenCyc, was released in 2001, with subsequent updates, but the number of assertions is much lower that in the commercial version. A research version, ResearchCyc, also exists, but it requires non-disclosure, which precludes use on a public project like WikiProteins. Cycorp has stated its intention to port all non-proprietary information from ResearchCyc to OpenCyc, but more than a year after this announcement, this has not yet happened. Assertions in Cyc are in First-order logic, with extensions for modal operators and higher order quantification. A complicating fact is the widespread presence of reification (a construct to allow the presence of contradictory statements). Given these conditions, it is unlikely that a DL representation of Cyc could be generated. In spite of its extensive subject area, very little specific attention to life sciences and medicine has been offered to Cyc. Implementing the Semantic Network would likely be a difficult task.

SUMO[edit]

Like Cyc, SUMO is part of the IEEE Standard Upper Ontology Working Group. The language is developed by Teknowledge Corp. in SUO-KIF, a variant of the Knowledge Interchange Format (KIF). While it may be possible to convert KIF in general to DL, the available OWL interpretation of SUMO is in OWL Full, not OWL DL. MILO is a copyright-encumbered mid-level ontology connecting SUMO to commercial Teknowledge domain ontologies, but “any ontology you create based on MILO or our domain ontologies is your property”. No scientific domain ontologies are available, only two very specific ones (“biological virii” and “atomic elements”). Even if it is possible to express the Semantic Network in (SUO-)KIF, the current level of detail in MILO with regards to the biomedical domain is down to the level of BiologicalProcess, which has two children: CausingHappiness and CausingUnhappiness. If this arbitrary division is considered representative of the orientation of SUMO/MILO, then it would appear to unsuited to our needs.

BioTop[edit]

BioTop is an upper-to-middle ontology, which, as the name would suggest, is geared towards the biomedical domain. It is developed at the Universities of Freiburg and Jena, and is entirely in DL. The ontology (Fig. 1) is divided into two parts: at the upper level is the Basic Formal Ontology (BFO), authored by Barry Smith, among others, conferring a philosophical nature. Added below that is a relatively deep network of biological classes, intended to provide an interface to the domain ontologies contained in the OBO foundry. Some classes appear to be a straight match with semantic types, some even being eponymous like “Animal”, “Fungus” and “Virus”. In other cases, a ST may be not exactly identical, but creating a new BioTop class would not likely result in a better ontology; for instance, the “Biologic Function” ST is described as “A state, activity or process of the body or one of its systems or parts”, while the BioTop class “BiologicalProcess” makes no mention of bodies. UMLS probably does not consider biological processes in non-human organisms, but creating a “BiologicalProcessInHuman” class does not make much sense.

Considering the sound and compatible structure of BioTop, and the willingness of the BioTop developers to cooperate in this project, the choice to use it as a base for our upper ontology is clear. Whereas other biological upper ontologies (Simple Bio Upper Ontology, GFO-Bio, UBO) do exist, and BioTop is neither complete nor without issues itself, none match its quality and extent.

Practical approach[edit]

Mapping the Semantic Types[edit]

To begin, we will try to map each Semantic Type to BioTop, or if that is impossible, determine where the latter must be extended. A preliminary effort is being undertaken; a scheme of the results so far is available. This is facilitated by the availability of textual definitions of all ST classes (web view) and most BioTop classes. An outline of the mapping will be circulated among domain experts, soliciting comments where necessary. BioTop class membership can then be assigned to individual MetaThesaurus entities inside WikiProteins, using the same mechanism with which STs are assigned now.

Implementation in OWL[edit]

After the semantic types have been mapped, creating an implementation in OWL should be possible. The most challenging part will likely be the mapping of the semantic relations from the frame-based UMLS Semantic Network (SN) to a DL, which has no ambiguity but less expressivity. The SN defines a list of permissible (and a few forbidden) relations between Semantic Types, as well as the (non-)inheritance of these relations from a generic parent type to its more specific children. Given such a list, and a classification with the STs, it can be inferred which types of relations are allowed between which concepts. A second issue might occur from the fact that concepts can belong to multiple semantic types and therefore inherit from multiple classes. This might lead to inconsistencies.

Explore incompatibilities[edit]

Because of the lack of ontological rigor in the STs, concepts that differ in fundamental ways (i.e., classes versus instances) may still both belong to the same ST. Then, after the mapping to a DL version, it may occur that a certain relation is allowed to exist between two concepts that originally belonged to certain semantic types, that would still hold, but not for another pair of the same types, because what was not distinguished under the ST classification scheme turned out to be distinct after mapping them to a DL-based representation. Seeking out these kinds of errors will bring out inconsistencies in UMLS, which may then possibly be corrected.

Extend WikiData for editing support[edit]

When this topic is sufficiently explored, the WikiProteins software will be improved to allow the use of such resources as the SN to provide feedback to wiki editors: given the knowledge of which relation types are allowed between entries marked with what ST, it should be possible to limit the number of options available to the user when adding relations between concepts. Once this functionality is in place, similar resources like the SwissProt annotations could be merged, further assisting the user. From this point onwards, cases where certain valid relations are not suggested and/or can not be added by users should be investigated and resolved by increasing the resolution and completeness of the ruleset in tandem with the upper ontology used.

Summary[edit]

Summarizing the tasks to be completed, the milestones in which they result, and in what order:

  1. Map Semantic Types to BioTop classes, expand the latter to accommodate the former. This will result in a spreadsheet describing the actions to take for each Semantic Type.
  2. Verification by domain experts (Christine Chichester, Barend Mons, Olivier Bodenreider, Stephan Schulz, Elena Beisswanger, Ronald Cornet) of the mapping; incorporate feedback. This will be repeated until written consensus is reached on the action to take for each Semantic Type.
  3. When satisfied, create an OWL implementation, and import the adapted BioTop classes into WikiProteins using the existing class membership mechanism.
  4. Investigate inconsistencies created through the DL mapping (“Barry Smith” examples). A few examples might include:
    • "Bacteria / causes / Infection": All Bacteria cause {each/only/some} infection(s) / Some Bacteria cause {all/some} infections
    • foo
    • bar
    • suggestions are welcome
  5. Extend the WikiProteins software to enable suggestions/restrictions on possible relation types when adding relations to the wiki. This should prevent the addition of statements like "Ferns suffer from depression" (plants do not possess mental processes). The deliverable code, extending the WikiData extension, will be released under a GPL license.
  6. Explore and expand the ruleset currently defined by the UMLS Semantic Network, and merge other sources of rules where possible.

Provisional Time Plan[edit]

when what
Done evaluate compatibility of existing ST/BioTop classes
until 20-09-2007 vacation
24-09 – 5-10 complete first mapping of all types, including the positions where classes are to be created
October circulate to domain experts and iteratively revise according to comments
November when satisfied with mapping, implement OWL version
December Investigate inconsistencies
January-February MediaWiki Programming
Up to April evaluate functionality of ruleset-based edit support, improve ruleset
April end, write report

Literature[edit]

Aranguren, M.A. (2004) Improving the structure of the Gene Ontology. [1]

Bodenreider, O. and Stevens, R. (2006) Bio-ontologies: current trends and future directions. Briefings in Bioinformatics 7:3, 256-274

Bodenreider, O. (2003) Strength in Numbers: Exploring Redundancy in Hierarchical Relations across Biomedical Terminologies. AMIA 2003 Symposium Proceedings, p. 101 - 105

Burger, A., Davidson, D., Baldock, R. (2003) Formalization of mouse embryo anatomy. Bioinformatics 20(2):259-267

Burgun, A., Bodenreider, O. (2001) Mapping the UMLS Semantic Network into General Ontologies. Proc. AMIA Ann. Symp. 2001:86-90

Cimino, J.J. (1998) Auditing the Unified Medical Language System with Semantic Methods. JAMIA 5:41-51

Cimino, J.J., Min, H., Perl, Y. (2003) Consistency across the hierarchies of the UMLS Semantic Network and Metathesaurus. J. Biomed. Inf. 36:450-461

Dameron, Rubin, Musen (2005) Challenges in Converting Frame-Based Ontology into OWL: the Foundational Model of Anatomy Case-Study. AMIA 2005 Symposium Proceedings, 181-185

Giles, J. (2007) Key biology datbases go wiki. Nature 445, p. 691

Kashyap, V., Borgida, A. (2003) Representing the UMLS Semantic Network using OWL. Proc. 2nd Intl. Semantic Web Conference.

McCray, A.T. (2003) An upper-level ontology for the biomedical domain. Comp. Funct. Genom. 4:80-84

Van Mulligen, E.M., Möller, E., Roes, P., Weeber, M., Meijssen, G., Chichester, C., Mons, B. (2006) An on-line ontology: WiktionaryZ. Proc. 2nd Int. Workshop on Formal Biomedical Knowledge Representation: “Biomedical Ontology in Action” (KR-MED 2006). pp. 31-36.

Stefan Schulz, Elena Beisswanger, Udo Hahn, Joachim Wermter, Anand Kumar, Holger Stenzhorn (2006) From GENIA to BioTop: Towards a top-level ontology for biology. Proc. Int. Conf. FOIS2006.

Stefan Schulz, Elena Beisswanger, Joachim Wermter, Udo Hahn (2006) Towards an Upper Level Ontology for Molecular Biology. Proc. AMIA2006.

Schulze-Kremer, S., Smith, B., Kumar, A. (2004) Revising the UMLS Semantic Network. Proc. Medinfo 2004.

Wroe, C.J., Stevens, R., Goble, C.A., Ashburner, M. (2003) A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAML+OIL. Pacific Symposium on Biocomputing 8:624-635

Attachments[edit]

biotop structure.png Figure 1: The structure of the latest release of BioTop, version 1.0 (does not display on the wikiproteins.org server, use http://www.omegawiki.org/User:Laszlo/Project_plan instead if you can not see it)


Table 1: Distribution of Semantic Types within (part of) the UMLS MetaThesaurus SQL: SELECT sty, COUNT(*) num FROM mrsty GROUP BY sty ORDER BY num DESC;

Semantic Type                          Occurrences
Clinical Drug                               156849
Organic Chemical                            125148
Amino Acid, Peptide, or Protein              97504
Pharmacologic Substance                      91643
Plant                                        65929
Invertebrate                                 65298
Body Part, Organ, or Organ Component         55945
Biologically Active Substance                47978
Bacterium                                    47642
Clinical Attribute                           45493
Fungus                                       25554
Gene or Genome                               23526
Enzyme                                       22017
Disease or Syndrome                          20697
Therapeutic or Preventive Procedure          16350
Molecular Function                           13383
Medical Device                               13147
Immunologic Factor                           12268
Finding                                      11935
Fish                                         11032
Neoplastic Process                           10431
Body Location or Region                      10067
Carbohydrate                                  9023
Steroid                                       8822
Drug Delivery Device                          8162
Indicator, Reagent, or Diagnostic Aid         8142
Injury or Poisoning                           7746
Nucleic Acid, Nucleoside, or Nucleotide       7735
Intellectual Product                          6534
Bird                                          6496
Mammal                                        5487
Lipid                                         5473
Body Space or Junction                        5356
Virus                                         4964
Alga                                          4768
Hazardous or Poisonous Substance              4604
Cell Function                                 4312
Reptile                                       4173
Receptor                                      4101
Inorganic Chemical                            4034
Laboratory Procedure                          4001
Antibiotic                                    3473
Idea or Concept                               3344
Diagnostic Procedure                          3018
Biomedical or Dental Material                 3005
Pathologic Function                           2860
Amphibian                                     2798
Cell Component                                2527
Cell                                          2525
Quantitative Concept                          2492
Sign or Symptom                               2418
Qualitative Concept                           2383
Manufactured Object                           2274
Organophosphorus Compound                     2160
Organ or Tissue Function                      2145
Spatial Concept                               2137
Population Group                              2086
Health Care Activity                          2075
Hormone                                       2021
Mental or Behavioral Dysfunction              1993
Congenital Abnormality                        1668
Laboratory or Test Result                     1666
Professional or Occupational Group            1654
Temporal Concept                              1593
Anatomical Abnormality                        1516
Functional Concept                            1457
Tissue                                        1437
Organism                                      1339
Genetic Function                              1305
Organism Attribute                            1235
Organism Function                             1223
Food                                          1199
Archaeon                                      1185
Eicosanoid                                    1141
Nucleotide Sequence                           1072
Research Activity                             1040
Health Care Related Organization               938
Rickettsia or Chlamydia                        919
Cell or Molecular Dysfunction                  826
Biomedical Occupation or Discipline            822
Acquired Abnormality                           820
Social Behavior                                792
Phenomenon or Process                          788
Embryonic Structure                            743
Body Substance                                 731
Vitamin                                        705
Mental Process                                 688
Element, Ion, or Isotope                       673
Individual Behavior                            660
Geographic Area                                633
Neuroreactive Substance or Biogenic Amine      586
Physiologic Function                           552
Natural Phenomenon or Process                  533
Classification                                 517
Occupational Activity                          488
Human-caused Phenomenon or Process             478
Language                                       478
Governmental or Regulatory Activity            476
Occupation or Discipline                       470
Conceptual Entity                              378
Body System                                    358
Educational Activity                           337
Organization                                   280
Chemical Viewed Structurally                   271
Regulation or Law                              265
Biologic Function                              264
Substance                                      262
Molecular Biology Research Technique           224
Daily or Recreational Activity                 176
Family Group                                   157
Amino Acid Sequence                            157
Activity                                       156
Patient or Disabled Group                      147
Group Attribute                                124
Chemical Viewed Functionally                   114
Machine Activity                                82
Experimental Model of Disease                   71
Age Group                                       67
Research Device                                 65
Anatomical Structure                            55
Environmental Effect of Humans                  54
Animal                                          43
Self-help or Relief Organization                42
Professional Society                            37
Human                                           36
Physical Object                                 24
Group                                           24
Behavior                                        23
Vertebrate                                      18
Event                                           18
Chemical                                        17
Fully Formed Anatomical Structure                6
Entity                                           6
Molecular Sequence                               4
Carbohydrate Sequence                            2