dcGO - A comprehensive domain-centric ontology resource for post-genomic research on functions, phenotypes, diseases and more   
  
  

Algorithm behind dcGO

Domain2GO

GO annotations for individual domains

Summary

This tab contains information on the algorithm behind GO associations to individual domains. Domains are defined by the Structural Classification of Proteins database (Andreeva, et al., 2008). Our 2011 publication describes the algorithm which takes manually derived GO annotations from UniProtKB and comprehensive domain annotations from SUPERFAMILY, to statistically infer domain-centric GO annotations (de Lima Morais et al, 2011). The key elements are summarized as follows:

  • The statistical inference is based on the principle that: if a GO term tends to be attached to proteins in UniProtKB containing a certain domain, then we can infer that the functional/GO signal should be associated with that domain.
  • Taking into account the directed acyclic graph (DAG) of GO, and using the domain composition of proteins in UniProt determined by SUPERFAMILY (See Pipeline tab for details), we have generated GO associations for domains at the SCOP family level and at the SCOP superfamily level.
  • We have also generated a user friendly trimmed-down 'slim' version of domain-centric GO. In the slim version, instead of the whole GO hierarchy, only selected GO terms are included so as to be representative and comprehensive at broadly different levels of the hierarchy. We call this: Structural Domain Functional Ontology (SDFO), containing GO terms with four levels of increasing granularity (see the SDFO tab for details).
References

  • Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
  • de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]  

Pipeline

The motivation behind this pipeline comes from the following points: (i) from a biological point of view, structural domains constitute the functional units of proteins and thus in many cases GO terms can be associated with the domain more precisely than with the whole protein; (ii) from a methodological point of view, such inherent functional GO associations for a domain can be reversely inferred if the number of domain-containing GO-annotated proteins is significantly higher than would be expected by chance. Figure 1 illustrates the procedures used to generate the domain-centric GO annotations from individual protein-level annotations.

Figure 1. Flowchart of inferring domain-centric GO annotations using UniprotKB-GOA database and domain assignments in SUPERFAMILY database.

Data Source

  • Protein-level GO annotations are taken from UniProtKB-GOA. To reduce false-positives and avoid data circularity from InterPro and Pfam, we only consider those annotations supported by experimental or manual evidence codes (see GO Evidence Codes).

  • The domain compositions of sequences in UniProt are taken from the SUPERFAMILY database at SCOP family and superfamily levels.

  • We term those UniProt sequences annotated with at least a GO term and containing at least a domain as our analyzable UniProt sequence space. Notably, the large UniProt sequence space in this study allows us to ensure that statistical inference has adequate power to reveal significant associations between a GO term and a domain from protein-level GO annotations.

SCOP2GO Correspondence Matrix

  • The correspondence matrix between domains and GO terms contains the observed number of Uniprot proteins with a given domain (columns) and that are annotated with each GO term (rows).

  • Two sets of UniProt sequences (i.e., the whole of UniProt, including multidomain sequences, and the subset of single-domain sequences) are used to support Domain2GO associations.

  • The correspondence matrix for single-domain seqeunces is guaranteed to be domain-centric; but the statistical power is limited by the available number of single-domain sequences with GO annotation in UniProt.

  • The correspondence matrix for all sequences (including multidomain Uniprots) has greater statistical power enabling more associations; but the resulting association is not guaranteed to be precisely domain-centric or may be only dominant (rather than complete) in its association.

  • For potential users who are interested in genome-wide functional annotations, it is recommended to use Domain2GO supported by all of UniProt (high-coverage version). Otherwise for more specific studies, Domain2GO supported by both sets (high-quality version) should be used in the first instance.

Statistical Analysis

  • Given a SCOP2GO correspondence matrix, we use the hypergeometric distribution as a null-hypothesis and perform a statistical test (equivalent to Fisher’s exact test) to infer the possible associations between a GO term and a protein domain.

  • To respect the DAG structure and true-path rule, two types of inference are performed to infer the overall and relative associations between a domain and a GO term (Figure 2):

    Reasons:

    • The hierarchical structure of GO is organized as a directed acyclic graph (DAG) by viewing an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges.
    • GO follows the 'true-path rule', that is, a protein annotated to a term should also be annotated by its all parent terms.

    Steps:

    • First, we calculate an overall P-value (and the corresponding overall hypergeometric score, that is, standard score or z-score, which is calculated by the observed minus the expected and then divided by standard deviation under the hypergeometric distribution) using all analyzable UniProt proteins (i.e., those annotated to the root of the GO term after applying the true-path rule) as the background.
    • Also, we calculate a relative P-value (and the corresponding relative hypergeometric score) using the background of only those UniProt proteins annotated to all direct parental GO terms.
    • The purpose of the second background is: if a GO term and its direct parental term are both significantly associated with a domain according to the first background, and this term is not hugely different from the parental term (i.e., not significant using the second background), then it is sufficient to only report the parental term. Using these dual constraints, only the most significant GO terms will be retained to associate with domains.
    • Significance of association is measured by false discovery rate (FDR; <0.001), while the strength of association is measured by the hypergeometric distribution-based score.
    • For a domain, the associated GO terms (i.e., direct annotations) are propagated to all ancestor terms (i.e., inherited annotations); both together constitute the complete GO annotation profile.

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of domain-GO term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel).

    Given a GO term (say t) and a domain (say d), the explanations of parameters in the left and middel panels are:
  • (Left panel) N is the number of Uniprots annotated with at least a GO term and containing at least a domain, M for the number of Uniprots containing the domain d, K for the number of Uniprots annotated with the GO term t, X for the observed number of Uniprots annotated with the GO term t as well as containing the domain d, and Pwhole is the expected probability of observing X or more Uniprots under the hypergeometric distribution.
  • (Middle panel) Npa is the number of Uniprots annotated with all direct parents of that GO term t in DAG, Mpa for the number of Uniprots containing the domain d after intersecting with those Uniprots in Npa, K for the number of Uniprots annotated with the GO term t, X for the observed number of Uniprots annotated with the GO term t as well as containing the domain d, and Prel is the expected probability of observing X or more Uniprots under the hypergeometric distribution.

Domain2GO Mappings

  • Using the algorithm described in preceding tabs,dcGO provides two versions of mappings between domains and GO.

  • High-quality mappings are those that are supported no matter whether only single-domain proteins or all proteins (including multi-domain proteins) are used.

  • High-coverage mappings are those that are not supported when only considering single-domain proteins.

  • The high-quality mappings are more reliably domain-centric, but high-coverage mappings are still useful for large-scale studies, particularly when the accuracy can be compromised for the coverage.

  • Since GO depicts three complementary biological concepts including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC), and SCOP classifies evolutionary-related domains into superfamily level and family level, we have accordingly generated the domain-centric GO annotations for each of the three concepts at the two domain levels.

SDFO

Based on high-quality Domain2GO mappings, we have also generated a trimmed-down (or slim) version and refer to this as Structural Domain Functional Ontology (SDFO; Figure 3).

Figure 3. Flowchart of how structural domain functional ontology (SDFO) is created using an information theory-based analysis of Domain2GO annotation profiles.

    First, we define the information content (IC) of a GO term: negative log10-transformation of the frequency of observing domains annotated to that term. For any domain, GO terms annotated to that domain constitute a domain-GO annotation profile in the DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among GO terms (or so-called true-path rule), a domain/protein directly annotated to a specific GO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). The GO annotations generated as described can be considered direct annotations. The complete GO annotations (direct and inherited) are used to calculate the IC for all GO terms. N.B. those GO terms with similar IC can represent a partition of the DAG in terms of Domain2GO.

    Second, given a predefined IC (say 1) as a seed and a corresponding range (say, [0.75 1.25]), the proposed algorithm starts with all GO terms unmarked, and iteratively identifies the unmarked GO terms closest to a predefined IC until all GO terms are marked (Figure 4). To make sure that one and only one GO term can be identified per path in the DAG, the following constraints should be met: if multiple GO terms with identical IC are identified in the same path, those parental terms are filtered out; once a GO term is identified, all terms in the path in which that term is located will be marked to be ignored in any further search.

    Last, the outputs are chosen to be those GO terms with IC falling within the range. We run the algorithm four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SDFO, with GO terms corresponding to the four levels (least informative, moderately informative, informative, highly informative).

Figure 4. Illustration of the algorithm used to iteratively create structural domain functional ontology (SDFO). I). Initially, all GO terms in the DAG are unmarked (open circles); II). Identify those unmarked GO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental GO terms from identified GO terms in Step II. IV). Mark GO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked GO terms until all GO terms are marked. VII). Output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as SDFO.

Supra-domain2GO

GO annotations for supra-domains ('SP')

Summary

This tab explains the algorithm behind GO annotations for supra-domains. The Structural Classification of Proteins database (Andreeva, et al., 2008) defines and classifies domains as globular structural units and defines them such that they are the smallest unit of evolution. When it comes to function-Gene Ontology (GO), however, we are accustomed to considering whole proteins despite the fact that very often the domain is not only the structural and evolutionary unit, but also the functional unit. For this reason we present a novel domain-centric GO method (de Lima Morais, et al., 2011). Here, we extend the utility of the previous framework in capturing GO terms suitable for supra-domains in addition to individual superfamilies (see next Pipeline and SPFO for details). The motivation is summarized as follows:

  • Although domain-centric ontology annotation has great value in describing functionally independent domains, often domains do not only function alone. There is a need to understand how domain combinations contribute to functional diversity.
  • In multi-domain proteins, individual domains may be combined together to form distinct domain architectures, thus exerting neo-functions or more specific functions (Chothia and Gough, 2009).
  • The recombination of the existing domains is considered as one of major driving forces for gaining functions in multi-domain proteins. In particular, certain pair-wise domain combinations (or triplets or more) may occur in diverse domain architectures and thus can be viewed as larger evolutionary units (termed supra-domains)
  • Although supra-domains are clearly of evolutionary importance, their functions remain uncharacterized. In practice, they are far more difficult than individual domains to curate by manually examining the functions of multi-domain proteins they reside in.
  • At the core of this framework is that, if a GO term tends to annotate proteins containing a supra-domain, then this term should also confer functional/GO signals for that supra-domain.
References

  • Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
  • Chothia C, Gough J. (2009) Genomic and structural aspects of protein evolution, Biochem J 419: 15-28. Abstract [ PubMed ]  
  • de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]  

Pipeline

The implementation of this framework starts from protein domain architectures and GO annotations for Uniprots, available respectively from SUPERFAMILY database and UniproKB-GOAs database (Figure 1). We respect the hierarchical structure of GO, which is organized as a directed acyclic graph (DAG) by viewing an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges. Accordingly, two types of inference between a supra-domain (individual superfamily) and a GO term are performed in terms of the root and in terms of direct parental GO (Figure 2). These dual constraints make sure that only the most relevant GO terms are retained.

Figure 1. A general framework for inferring GO annotations for SCOP supra-domains using domain architectures and GO annotations for Uniprots (obtained from UniProtKB-GOA and SUPERFAMILY, respectively).

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of SP-GO term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel). SP denotes both of supra-domains and individual superfamilies.

    Given a GO term (say t) and a supra-domain (say d), the explanations of parameters in the left and middel panels are:
  • (Left panel) N is the number of Uniprots annotated with at least a GO term and containing at least a supra-domain, M for the number of Uniprots containing the supra-domain d, K for the number of Uniprots annotated with the GO term t, X for the observed number of Uniprots annotated with the GO term t as well as containing the supra-domain d, and Pwhole is the expected probability of observing X or more Uniprots under the hypergeometric distribution.
  • (Middle panel) Npa is the number of Uniprots annotated with all direct parents of that GO term t in DAG, Mpa for the number of Uniprots containing the supra-domain d after intersecting with those Uniprots in Npa, K for the number of Uniprots annotated with the GO term t, X for the observed number of Uniprots annotated with the GO term t as well as containing the supra-domain d, and Prel is the expected probability of observing X or more Uniprots under the hypergeometric distribution.

SPFO

Based on predicted SP2GO annotations, we have also generated a trimmed-down version of GO which is the most informative to annotate supra-domains (including individual superfamilies), and referred to this slim version as SuPra-domain Functional Ontology (SPFO; Figure 3).

Figure 3. Flowchart of creating supra-domain functional ontology (SPFO) based on information theoretic analysis of SP2GO annotation profiles.

    First, we apply information theory to define information content (IC) of a GO term: negative log10-transformation of the frequency of observing SP annotated to that term. For any SP, GO terms annotated to that SP constitute an SP-GO annotation profile in DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among GO terms (or so-called true-path rule), an SP directly annotated to a specific GO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). GO annotations generated above can be considered as direct annotations. The complete GO annotations (direct and inherited) are used to calculate IC for all GO terms. Of note, those GO terms with similar IC can represent a partition of DAG in terms of SP2GO.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all GO terms, and iteratively identifies unmarked GO terms closest to a predefined IC until all GO terms are marked (Figure 4). To make sure that one and only one GO term can be identified per path in DAG, the following constraints should be met: If multiple GO terms with identical IC are identified in the same path, those parental terms are filtered out; once a GO term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified GO terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SPFO, respectively corresponding to GO terms with four levels (least informative, moderately informative, informative, highly informative). In summary, we provide a meta-GO as a proxy for annotating both supra-domains and individual superfamilies at three sub-ontologies including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC).

Figure 4. Illustration of the algorithm how to iteratively create supra-domain functional ontology (SPFO). I). Initially, all GO terms in DAG are unmarked (open circles); II). Identify those unmarked GO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental GO terms from identified GO terms in Step II. IV). Mark GO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked GO terms until all GO terms are marked. VII). Output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as SPFO.

Domain2BO

BO annotations for individual domains

Summary

This tab explains the algorithm behind Biomedical Ontologies (BO) for individual domains that are classified in the Structural Classification of Proteins database (Andreeva, et al., 2008). In dcGO, the 'BO' generally refers to all other Biomedical Ontologies that are not GO. They mainly consist of phenotype ontologies that have been developed to classify and organize phenotypic information related to model organisms and to human.

  • Like GO, these other ontologies are hierarchical going from the very general at the top to the more specific terms at the bottom. Similarly to domain-centric GO, dcGO has the mappings of the BO terms to individual domains (see next Pipeline for details); each has its own slim version of the ontology at four levels of increasing granularity based on information content (see next SDBO for details).
  • Unlike the GO, the BO does not have the high-quality version of the mappings. This is largely due to an insufficient number of single-domain proteins for statistical inference, in particular in the case of species-specific annotations.

The dcGO now contains a panel of biomedical ontologies from a wide variety of contexts:

  • Disease Ontology (DO) is a standardized ontology for human disease that semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI's thesaurus, SNOMED and OMIM (Schriml, et al., 2009). Also available are their mappings onto the human genome (Osborne, et al., 2009).
  • Human Phenotype Ontology (HP) captures phenotypic abnormalities that are described in OMIM, along with the corresponding disease-causing genes (Robinson, et al., 2008). It includes three complementary biological concepts: Mode of Inheritance (MI), ONset and clinical course (ON), and Phenotypic Abnormality (PA).
  • Mammalian/Mouse Phenotype Ontology (MP) describes phenotypes of the mouse after a specific gene is genetically disrupted (Smith, et al., 2009). Using it, Mouse Genome Informatics (MGI) provides high-coverate gene-level phenotypes for the mouse.
  • Worm Phenotype Ontology (WP) classifies and organizes phenotype descriptions for C. elegans and other nematodes (Schindelman, et al., 2011). Using it, WormBase provides the primary resource for phenotype annotations for C. elegans.
  • Yeast Phenotype Ontology (YP) is the major contributor to the 'Ascomycete phenotype ontology'. Using it, the Saccharomyces Genome Database (SGD) provides single mutant phenotypes for every gene in the yeast genome (Engel, et al., 2010).
  • Fly Phenotype Ontology (FP) refers to the FlyBase controlled vocabulary. Specifically, a structured controlled vocabulary is used for the annotation of alleles (for their mutagen etc) in FlyBase (Grumbling, et al., 2006).
  • Fly Anatomy Ontology (FA) is a structured controlled vocabulary of the anatomy of Drosophila melanogaster, used for the description of phenotypes and where a gene is expressed (Grumbling, et al., 2006).
  • Zebrafish Anatomy Ontology (ZA) displays anatomical terms of the zebrafish using standard anatomical nomenclature, together with affected genes (Bradford, et al., 2011).
  • Xenopus Anatomy Ontology (XA) represents the lineage of tissues and the timing of development for frogs (Xenopus laevis and Xenopus tropicalis). It is used to annotate Xenopus gene expression patterns and mutant and morphant phenotypes (Bowes, et al., 2009).
  • Arabidopsis Plant Ontology (AP) is a major contributor to the Plant Ontology which describes plant ANatomical and morphological structures (PAN) and growth and DEvelopmental stages (PDE). The Arabidopsis Information Resource (TAIR) provides arabidopsis plant ontology annotations for the model higher plant Arabidopsis thaliana (Ilic, et al., 2006; Pujar, et al., 2006).
  • Enzyme Commission (EC) is a resource focused on enzyme nomenclature, which is a system of naming enzymes (protein catalysts) with Cross-references to UniProt sequences (Fleischmann et al., 2004). It uses four-digit EC numbers to define the reaction catalysed. The first three digits are to define the reaction catalysed and the fourth for a unique identifier (serial number).
  • DrugBank ATC code (DB) classifies at five different levels according to the organ or system (1st level, anatomical main group) on which they act and their therapeutic (2nd level, therapeutic subgroup), pharmacological (3rd level, pharmacological subgroup) and chemical properties (4th level, chemical subgroup; 5th level, chemical substance). Only drugs in DrugBank and with the Anatomical Therapeutic Chemical (ATC) classification system are considered (Knox et al., 2011).
  • UniProtKB KeyWords (KW) controlled vocabulary, providing a summary of the entry content and are used to index UniProtKB/Swiss-Prot entries based on 10 categories (the category "Technical term" being excluded here). Each keyword is attributed manually to UniProtKB/Swiss-Prot entries and automatically to UniProtKB/TrEMBL entries (according to specific annotation rules) (Bairoch et al., 2005).
  • UniProtKB UniPathway (UP) a fully manually curated resource for the representation and annotation of metabolic pathways, being used as a controlled vocabulary for pathway annotation in UniProtKB (Morgat et al., 2012).
  • CTD Diseases (CD) is a MEDIC disease vocabulary (adapted from "Diseases" [C] branch of MeSH along with OMIM) that is used by CTD to annotate disease-related genes (Davis et al., 2012).
  • CTD Chemicals (CC) is chemical vocabulary adapted by CTD from the "Chemicals and Drugs" category and Supplementary Concept Records of MeSH (Davis et al., 2012).
References

  • Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
  • Bairoch, A., Apweiler, R., et al. (2005) The Universal Protein Resource (UniProt), Nucleic Acids Res, 33, D154-9. Abstract [ PubMed ]  
  • Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society Series B-Methodological, 57, 289-300. Abstract [ PubMed ]  
  • Bowes, J. B., Snyder, K. A., Segerdell, E., Jarabek, C. J., Azam, K., Zorn, A. M., and Vize, P. D. (2009) Xenbase: gene expression and improved integration, Nucleic Acids Res, 38, D607-12. Abstract [ PubMed ]  
  • Bradford, Y., Conlin, T., Dunn, N., et al. (2011) ZFIN: enhancements and updates to the Zebrafish Model Organism Database, Nucleic Acids Res, 39, D822-9. Abstract [ PubMed ]  
  • Davis, A.P., Murphy, C.G., et al. (2012) The Comparative Toxicogenomics Database: update 2013. Nucleic Acids Res. Abstract [ PubMed ]  
  • Engel, S. R., Balakrishnan, R., Binkley, G., et al. (2010) Saccharomyces Genome Database provides mutant phenotype data, Nucleic Acids Res, 38, D433-6. Abstract [ PubMed ]  
  • Fleischmann, A., Darsow, M., Degtyarenko, K., Fleischmann, W., Boyce, S., Axelsen, K.B., Bairoch, A., Schomburg, D., Tipton, K.F. and Apweiler, R. (2004) IntEnz, the integrated relational enzyme database, Nucleic Acids Res, 32, D434-7. Abstract [ PubMed ]  
  • Gough, J. (2006) Genomic scale sub-family assignment of protein domains, Nucleic Acids Res, 34, 3625-3633. Abstract [ PubMed ]  
  • Grumbling, G. and Strelets, V. (2006) FlyBase: anatomical data, images and queries, Nucleic Acids Res, 34, D484-8. Abstract [ PubMed ]  
  • Ilic, K., Kellogg, E. A., Jaiswal, P., et al. (2006) The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant, Plant Physiol, 143, 587-99. Abstract [ PubMed ]  
  • Morgat, A., Coissac, E., et al. (2006) UniPathway: a resource for the exploration and annotation of metabolic pathways, Nucleic Acids Res, 40, D761-9. Abstract [ PubMed ]  
  • Osborne,J.D., Flatow,J., Holko,M., Lin,S.M., Kibbe,W.A., Zhu,L.J., Danila,M.I., Feng,G. and Chisholm,R.L. (2009) Annotating the human genome with Disease Ontology. BMC Genomics, 10, S1–S6. Abstract [ PubMed ]  
  • Pujar, A., Jaiswal, P., Kellogg, E. A., et al. (2006) Whole-plant growth stage ontology for angiosperms and its application in plant biology, Plant Physiol, 142, 414-28. Abstract [ PubMed ]  
  • Robinson, P.N., Kohler, S., Bauer, S., Seelow, D., Horn, D. and Mundlos, S. (2008) The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 83, 610-615. Abstract [ PubMed ]  
  • Knox, C., Law, V., et al. (2011) DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Nucleic Acids Res, 39, D1035-41. Abstract [ PubMed ]  
  • Schindelman, G., Fernandes, J. S., Bastiani, C. A., Yook, K. and Sternberg, P. W. (2011) Worm Phenotype Ontology: integrating phenotype data within and beyond the C. elegans community, BMC Bioinformatics, 12:32. Abstract [ PubMed ]  
  • Schriml LM, Arze C, Nadendla S, et al. (2012) Disease Ontology: A backbone for disease semantic integration. Nucleic Acids Res, 40, D940-D946. Abstract [ PubMed ]  
  • Smith, C.L. and Eppig, J.T. (2009) The Mammalian Phenotype Ontology: enabling robust annotation and comparative analysis, Wiley Interdiscip Rev Syst Biol Med, 1, 390-399. Abstract [ PubMed ]  

Pipeline

The principles on which the pipeline is based are: if a BO term tends to annotate proteins containing a certain domain, then such a term should also have the BO associated to that domain. Such inherent signals for domains can be reversely inferred if the number of domain-containing BO-annotated proteins is significantly higher than would be expected by chance. Figure 1 illustrates the procedure used to generate domain-centric BO associations from individual protein/gene-level annotations in the Human.

Figure 1. Flowchart for inferring domain-centric BO annotations using protein/gene-level BO annotations and domain assignments from the SUPERFAMILY database.

Data Source

  • Protein/gene-level BO annotations are taken from the relevant ontology. We only consider the longest transcript for each gene to ensure the one-gene-one-protein mapping is valid, as these species-specific annotations are gene-orientated rather than protein-based.

  • Unlike Domain2GO, associations between domains and ontological terms are only supported when using a background made up of all proteins (i.e. the SCOP2BO correspondence matrix). This is due to the lack statistical power arising from an insufficient number of single-domain proteins in the genome of a singel species.

  • The SCOP2BO correspondence matrix between domains and BO terms consists of the observed number of proteins/genes containing a given domain (columns) that have annotation containing a specific BO term (rows).

Statistical Analysis

  • Given a SCOP2BO correspondence matrix, we use the hypergeometric distribution as a null-hypothesis and perform a statistical test (equivalent to Fisher’s exact test) to infer the possible associations between a BO term and a protein domain..

  • To make use of the DAG structure and true-path rule, two types of inference are performed to infer the overall and relative associations between a domain and a BO term (Figure 2):

    Reasons:

    • The hierarchical structure of BO is organized as a directed acyclic graph (DAG) by considering an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges.
    • The BOs follow the 'true-path rule', that is, a protein annotated to a term should also be annotated by its all parent terms.

    Steps:

    • First, we calculate an overall P-value (and the corresponding overall hypergeometric score, that is, standard score or Z-score, which is calculated by the observed minus the expected and then divided by standard deviation under the hypergeometric distribution) using all analyzable proteins/genes (i.e., those annotated to the root of the BO term after applying the true-path rule) as the background.
    • Second, we calculate a relative P-value (and the corresponding relative hypergeometric score) using the background of proteins/genes annotated to the direct parental BO terms.
    • The purpose of conducting the second background is: if a BO term and its direct parental term are both significantly associated with a domain according to the first background, and this term is not hugely different from the parental term (i.e., not significant using the second background), then it is sufficient to only retain the parental term. Thus, only the most relevant BO terms will be retained and given associations to domains.
    • The significance of association is measured by false discovery rate (FDR), while the strength of association is measured using the hypergeometric distribution-based score.
    • For a domain, the associated GO terms (i.e., direct annotations) are propagated to all ancestor terms (i.e., inherited annotations); together they constitute a complete GO annotation.

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating an overall over-representation in terms of whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of Domain2BO associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel).

    Given a BO term (say t) and a domain (say d), the explanations of parameters in the left and middle panels are:
  • (Left panel) N is the number of genes annotated with at least a BO term and containing at least a domain, M for the number of genes containing the domain d, K for the number of genes annotated with the BO term t, X for the observed number of genes annotated with the BO term t as well as containing the domain d, and Pwhole is the expected probability of observing X or more genes under the hypergeometric distribution.
  • (Middle panel) Npa is the number of genes annotated with all direct parents of that BO term t in DAG, Mpa for the number of genes containing the domain d after intersecting with those genes in Npa, K for the number of genes annotated with the BO term t, X for the observed number of genes annotated with the BO term t as well as containing the domain d, and Prel is the expected probability of observing X or more genes under the hypergeometric distribution.

Domain2BO Mappings

  • The criterion for identifying domain-centric BO associations is a stringent FDR score (<0.001).

  • For a domain, the associated BO terms (i.e., direct annotations) are propagated to all ancestor terms (i.e., inherited annotations); together these constitute a complete BO annotation.

  • Since SCOP classifies evolutionary-related domains at both the superfamily level and family level, we have accordingly generated the domain-centric BO annotations at each of two classification levels in the SCOP hierarchy.

SDBO

Based on domain-centric BO annotations, we have also generated a trimmed-down version of BO which is the most informative for annotating individual domains, and referred to this slim version as Structural Domain Biomedical Ontology (SDBO; Figure 3). Notably, SDBO is ontology-specific; for example, it should be interpreted as SDDO when it comes to the Disease Ontology.

Figure 3. Flowchart for creating the structural domain biomedical ontology (SDBO) based on principles of information theory applied to Domain2BO annotation profiles.

    First, we apply information theory to define the information content (IC) of a BO term: negative log10-transformation of the frequency of observing domains annotated to that term. For any domain, BO terms annotated to that domain constitute a domain-PO annotation profile in the DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among BO terms (or so-called true-path rule), a domain/protein directly annotated to a specific BO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). BO annotations generated above can be considered as direct annotations. The complete BO annotations (direct and inherited) are used to calculate the IC for all BO terms. Of note, those BO terms with similar IC can represent a partition of DAG in terms of Domain2BO.

    Second, given a predefined IC (say 1) as a seed and a corresponding range (say, [0.75 1.25]), the proposed algorithm starts with all BO terms unmarked, and iteratively identifies the unmarked BO terms closest to a predefined IC until all BO terms are marked (Figure 4). To make sure that one and only one BO term can be identified per path in the DAG, the following constraints should be met: If multiple BO terms with identical IC are identified in the same path, those parental terms are filtered out; once a BO term is identified, all terms in the path in which that term is located will be marked to be excluded from any further search.

    Last, the output is those BO terms with an IC falling within the range. We run the algorithm using four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SDBO, respectively corresponding to BO terms with four levels (least informative, moderately informative, informative, highly informative).

Figure 4. Illustration of the algorithm how to iteratively create structural domain biomedical ontology (SDBO). I). Initially, all BO terms in DAG are unmarked (open circles); II). Identify those unmarked BO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental BO terms from identified BO terms in Step II. IV). Mark BO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked BO terms until all BO terms are marked. VII). Output only those identified BO terms with IC falling in the range (e.g., [0.75 1.25]) as SDBO.

Supra-domain2BO

BO annotations for supra-domains ('SP')

Summary

This tab explains the algorithm behind BO annotations of supra-domains. The Structural Classification of Proteins database (Andreeva, et al., 2008) defines and classifies domains as globular structural units and defines them such that they are the smallest unit of evolution. In multidomain proteins, certain domains often tend to co-occur/co-evolve with other specific domains. We define commonly-occuring combinations of two or more successive domains as 'supra-domains'. The domain architecture is a modular view of a protein sequence; in the SUPERFAMILY database (de Lima Morais et al, 2011), it is represented as the sequential order of SCOP domains (at the superfamily level) or gaps (estimated to be one or more unknown domains). Similarly to domain-centric GO, the dcGO has mappings of the BO terms to supra-domains (see next Pipeline for details); each has its own slim version of the ontology at four levels of increasing granularity based on information content (see next SPBO for details). At the core of this framework is that, if a BO term tends to annotate proteins containing a supra-domain, then this term should also confer BO signals for that supra-domain.

References

  • Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
  • de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method, Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]  

Pipeline

The implementation of this framework starts from protein domain architectures and Protein/gene-level BO annotations (Figure 1). We respect the hierarchical structure of BO, which is organized as a directed acyclic graph (DAG) by viewing an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges. Accordingly, two types of inference between a supra-domain (individual superfamily) and a BO term are performed in terms of the root and in terms of direct parental BO (Figure 2). These dual constraints make sure that only the most relevant BO terms are retained.

Figure 1. A general framework for inferring BO annotations for SCOP supra-domains using protein/gene-level BO annotations and domain architectures in SUPERFAMILY database.

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of SP-BO term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel). SP denotes both of supra-domains and individual superfamilies.

    Given a BO term (say t) and a supra-domain (say d), the explanations of parameters in the left and middel panels are:
  • (Left panel) N is the number of genes annotated with at least a BO term and containing at least a supra-domain, M for the number of genes containing the supra-domain d, K for the number of genes annotated with the BO term t, X for the observed number of genes annotated with the BO term t as well as containing the supra-domain d, and Pwhole is the expected probability of observing X or more genes under the hypergeometric distribution.
  • (Middle panel) Npa is the number of genes annotated with all direct parents of that BO term t in DAG, Mpa for the number of genes containing the supra-domain d after intersecting with those genes in Npa, K for the number of genes annotated with the BO term t, X for the observed number of genes annotated with the BO term t as well as containing the supra-domain d, and Prel is the expected probability of observing X or more genes under the hypergeometric distribution.

SPBO

Based on predicted SP2BO annotations, we have also generated a trimmed-down version of BO which is the most informative to annotate supra-domains (including individual superfamilies), and referred to this slim version as SuPra-domain Biomedical Ontology (SPBO; Figure 3). Notably, SPBO is ontology-specific; for example, it should be interpreted as SPDO when it comes to the Disease Ontology.

Figure 3. Flowchart of creating supra-domain biomedical ontology (SPBO) based on information theoretic analysis of SP2BO annotation profiles.

    First, we apply information theory to define information content (IC) of a BO term: negative log10-transformation of the frequency of observing SP annotated to that term. For any SP, BO terms annotated to that SP constitute an SP-GO annotation profile in DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among BO terms (or so-called true-path rule), an SP directly annotated to a specific BO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). BO annotations generated above can be considered as direct annotations. The complete BO annotations (direct and inherited) are used to calculate IC for all BO terms. Of note, those BO terms with similar IC can represent a partition of DAG in terms of SP2BO.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all BO terms, and iteratively identifies unmarked BO terms closest to a predefined IC until all BO terms are marked (Figure 4). To make sure that one and only one BO term can be identified per path in DAG, the following constraints should be met: If multiple BO terms with identical IC are identified in the same path, those parental terms are filtered out; once a BO term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified BO terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SPBO, respectively corresponding to BO terms with four levels (least informative, moderately informative, informative, highly informative).

Figure 4. Illustration of the algorithm how to iteratively create supra-domain biomedical ontology (SPBO). I). Initially, all BO terms in DAG are unmarked (open circles); II). Identify those unmarked BO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental BO terms from identified BO terms in Step II. IV). Mark BO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked BO terms until all BO terms are marked. VII). Output only those identified BO terms with IC falling in the range (e.g., [0.75 1.25]) as SPBO.