SUPERFAMILY 1.75 HMM library and genome assignments server

Superfamily is undergoing a server migration - you are now browsing on the new server. Please contact us if you experience any problems.

Domain-centric Gene Ontology (GO) Annotations and Structural Domain Functional Ontology

Jump to [ Top · Domain2GO · SDFO · Data availability ]

This document explains the details behind GO annotations of structural domains that are classified in the Structural Classification of Proteins (SCOP) database (Andreeva, et al., 2008). The motivations for domain-level functional annotations are driven by the following factors.

    First, there is a growing gap between the known sequences of proteins and the unknown functions of these proteins. Accordingly, there is an urgent need for an automated procedure of predicting functions from sequences on a large scale.

    Second, the domain assignments of a protein can be routinely assigned. For example, the SUPERFAMILY database (Gough, 2006) provides high-coverage domain assignments for proteins in all sequenced genomes and collections of sequences such as UniProtKB.

    Third, protein-level functional annotations are conventionally curated in a labor-intensive manner, always ignoring the context of the structural domains. For instance, the Gene Ontology Annotation (GOA) project offers high-quality GO annotations directly associated to proteins in the UniProtKB over a wide spectrum of species (Barrell, et al., 2009).

To sum up, a well of data these projects (genome sequencing, structural genomics, functional genomics) have already generated enable us to think over the feasibility of creating domain-centric GO annotations from individual protein-level annotations. Recently, we have taken advantage of manually derived GOA and comprehensive domain assignments for proteins in UniProtKB, to statistically infer domain-centric GO annotations (de Lima Morais et al, 2011). Such statistical inference is based on the assumption: if a GO term tends to annotate proteins containing a domain, then such term should also confer functional signals for that domain. Respecting the hierarchical structure of GO as well as the domain composition of proteins (See below for details), we have generated the first GO annotations for evolutionarily closed domains (at the SCOP family level) and distant domains (at the SCOP superfamily level). Moreover, we have initialized a trimmed-down version of GO which is the most informative to annotate domains. This resource represents an ongoing effort to develop a Structural Domain Functional Ontology (SDFO). We expect domain-centric GO annotations, together with other resources and tools in the SUPERFAMILY web server, could greatly facilitate our understanding of functional genomics across the tree of life. Together with a reference species tree of (sequenced) life, this resource can be practically useful to look at the distribution of sets of domains annotated by any chosen GO term along the course of species evolution.


The pipeline of building domain-centric GO annotations

Jump to [ Top · Domain2GO · SDFO · Data availability ]

The motivations behind are: (i) from the biological point of view, structural domains constitute functional units of proteins and thus their functions are inherent in the protein-level GO annotations; (ii) from the methodological point of view, such inherent GO annotations for a domain can be reversely inferred if the number of domain-containing GO-annotated proteins is significantly higher than would be expected by chance. Figure 1 summarizes the procedures how to generate domain-centric GO annotations from individual protein-level annotations.

Figure 1. Flowchart of inferring domain-centric GO annotations using UniprotKB-GOA database and domain assignments in SUPERFAMILY database.

    Data Source Protein-level GO annotations is taken from UniProtKB-GOA. To reduce false-positives and avoid data circularity from InterPro and Pfam, we only consider those annotations supported by experimental or manual evidence codes (see GO Evidence Codes). The high-coverage domain compositions of these UniProts are taken from SUPERFAMILY database at SCOP family and superfamily levels. We term those UniProt sequences annotated with at least a GO term and containing at least a domain as our analyzable UniProt sequence space. Notably, the large UniProt sequence space in this study allows us to ensure that statistical inference has adequate power to reveal significant associations between a GO term and a domain from protein-orientated GO annotations.

    UniProt2GO Matrix Two sets of UniProts (i.e., singleton domain UniProts and all UniProts including multidomain Uniprots) are used to support Domain2GO associations. UniProt2GO mapping matrix for singleton domain UniProts can lead to being truly domain-centric; the cons is the limitation in the number of singleton domain UniProts available for statistical testing (inadequate inference). On the contrary, UniProt2GO mapping matrix for all UniProts including multidomain Uniprots can lead to sufficient associations; but it requires the independent assumption. It is known that the contribution of each domain in a multidomain protein to its functions may be dominant or trivial or between. Domain2GO resulting from UniProt2GO mapping matrix for all UniProts can be considered to be dominant. For potential users who are interested in genome-wide functional annotations, they may not care about whether annotations are truly domain-centric or not; it is recommended to use the high-coverage version of Domain2GO resulting from UniProt2GO mapping matrix for all UniProts. Otherwise, high-quality Domain2GO supported by both should be taken as priority. Without the exceptional cases, it is not recommended to use the results only supported by singleton domain UniProts. In other words, being truly domain-centric also means being dominant.

    Statistical Analysis For a UniProt2GO mapping matrix, two types of enrichments are performed to infer the overall and relative associations between a domain and a GO term (Figure 2). The hierarchical structure of GO is organized as a directed acyclic graph (DAG) by viewing an individual term as a node and its relations to parental terms (allowing for multiple parents) as directed edges. Statistical inference of possible association between a GO term (say t) and a domain (say d), is performed not only in terms of our analyzable UniProt space, but also in the context of those UniProts annotated to all direct parents of that GO term. These dual constraints ensure that only those most informative GO terms are retained. When simultaneously comparing multiple hypothesis tests, statistical significance of domain-GO term associations can be assessed by the method of false discovery rate (FDR) (Benjamini and Hochberg, 1995). The resultant FDR is used to determine the significance of domain-GO term associations.

    Domain2GO The criteria for identifying the high-quality domain-GO associations are based on stringent FDR (<0.001), supported both by singleton domain UniProts and all UniProts. Since GO depicts three complementary biological concepts including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC), and SCOP classifies evolutionary-related domains into superfamily level and family level, we have accordingly generated the domain-centric GO annotations for each of the three concepts at the two domain levels.

Figure 2. The statistical significance of inference is assessed based on the hypergeometric distribution, generating overall over-representation in terms of the whole annotations (left panel) and relative over-presentation in terms of all direct parents (middle panel). Based on the maximal P-values, statistical significance of domain-GO term associations can be assessed by the method of FDR accounting for multiple hypothesis tests (right panel).


Initializing structural domain functional ontology

Jump to [ Top · Domain2GO · SDFO · Data availability ]

Based on high-quality Domain2GO, we have also initialized a trimmed-down version of GO which is the most informative to annotate structural domains (Figure 3).

Figure 3. Flowchart of creating structural domains functional ontology (SDFO) based on information theoretic analysis of Domain2GO annotation profiles.

    First, we apply information theory to define information content (IC) of a GO term: negative log10-transformation of the frequency of observing domains annotated to that term. For any domain, GO terms annotated to that domain constitute a domain-GO annotation profile in DAG, including direct annotations as well as inherited annotations according to the true-path rule. Considering the nature of dependencies among GO terms (or so-called true-path rule), a domain/protein directly annotated to a specific GO term (termed as direct annotations) should be inheritably annotated to its parental terms (terms as inherited annotations). GO annotations generated above can be considered as direct annotations. The complete GO annotations (direct and inherited) are used to calculate IC for all GO terms. Of note, those GO terms with similar IC can represent a partition of DAG in terms of Domain2GO.

    Second, given a predefined IC (say 1) as a seed and its corresponding the range (say, [0.75 1.25]), the proposed algorithm starts with initially unmarked all GO terms, and iteratively identifies unmarked GO terms closest to a predefined IC until all GO terms are marked (Figure 4). To make sure that one and only one GO term can be identified per path in DAG, the following constraints should be met: If multiple GO terms with identical IC are identified in the same path, those parental terms are filtered out; once a GO term is identified, all terms in the path in which that term is located will be marked for being immune from further search.

    Last, the outputs are those identified GO terms with IC falling in the range. We run the algorithm using each of four seed ICs (i.e., 0.5, 1, 1.5 and 2) to create SDFO, respectively corresponding to GO terms with four levels (least informative, moderately informative, informative, highly informative).

Figure 4. Illustration of the algorithm how to iteratively create structural domains functional ontology (SDFO). I). Initially, all GO terms in DAG are unmarked (open circles); II). Identify those unmarked GO terms (filled in pink) with IC closest to a predefined IC (e.g., 1); III). Filter out those parental GO terms from identified GO terms in Step II. IV). Mark GO terms identified as well as all of their ancestors and descendants. V-VI). Continue the Steps II-IV to iteratively identify unmarked GO terms until all GO terms are marked. VII). Output only those identified GO terms with IC falling in the range (e.g., [0.75 1.25]) as SDFO.


Data Availability

Jump to [ Top · Domain2GO · SDFO · Data availability ]

In additional to two hierarchies (SCOP-Hierarchy, or GO-Hierarchy) for the browsing, we here also provide two Domain2GO mapping results (i.e., Domain2GO_supported_by_both and Domain2GO_supported_only_by_all) in two parsable formats (i.e., plain files and mysql tables). It is users' decision which one to use. From our experiences, Domain2GO_supported_by_both favors small-scale studies (e.g., obtaining high-quality truely domain-centric lists), and Domain2GO_supported_only_by_all for large-scale studies (e.g., GO enrichment analysis). Although we also offer Domain2GO at the SCOP fold and class levels, special attention should paid to cos they are definitively useless in terms of evolutionary relevance.

Domain2GO supported by both

  • High-quality truly domain-centric GO annotations supported by singleton domain UniProts and all UniProts (including multidomain UniProts) are available in the Domain2GO_supported_by_both.txt file.

  • Statistics for the Domain2GO annotations are summarized in two forms: 1) SCOP hierarchy with the number of GO terms (direct and inherited; three GO sub-ontologies: BP, MF and CC), available in the Domain2GO_SCOP.both.obo file. 2) GO hierarchy with the number of domains (direct and inherited; four SCOP levels: FA, SF, CF and CL), available in the Domain2GO_GO.both.obo file. With the help of OBO-Edit, it is easy to browse these two obo format files.
  • GO terms which are regarded as SDFO (four levels: least informative, moderately informative, informative, and highly informative ) can be found in the SDFO.both.txt file. We highly recommend users to use these GO terms and their annotating domains from Domain2GO_supported_by_both.txt. Unlike of the whole GO hierarchy, those GO terms at different granularity are representative and comprehensive in terms of their relevance to domains (not proteins). Keep it in mind that SDFO corresponds to each of three GO sub-ontologies (i.e., BP, MF, and CC ) at each of four SCOP domain types (i.e., FA, SF, CF, and CL ).
Domain2GO supported only by all
  • High-coverage domain-centric GO annotations supported only by all UniProts (including multidomain UniProts) are available in the Domain2GO_supported_only_by_all.txt file.

  • Statistics for the Domain2GO annotations are summarized in two forms: 1) SCOP hierarchy with the number of GO terms (direct and inherited; three GO sub-ontologies: BP, MF and CC), available in the Domain2GO_SCOP.all.obo file. 2) GO hierarchy with the number of domains (direct and inherited; four SCOP levels: FA, SF, CF and CL), available in the Domain2GO_GO.all.obo file. With the help of OBO-Edit, it is easy to browse these two obo format files.
  • GO terms which are regarded as SDFO (four levels: least informative, moderately informative, informative, and highly informative ) can be found in the SDFO.all.txt file. We highly recommend users to use these GO terms and their annotating domains from Domain2GO_supported_by_all.txt. Unlike the whole GO hierarchy, those GO terms at different granularity are representative and comprehensive in terms of their relevance to domains (not proteins). Keep it in mind that SDFO corresponds to each of three GO sub-ontologies (i.e., BP, MF, and CC ) at each of four SCOP domain types (i.e., FA, SF, CF, and CL ).
Domain2GO MySQL tables
    We use four tables (Domain2GO.sql.gz) below to store info described above (i.e., Domain2GO supported by both, and Domain2GO supported only by all):

    GO_info: containing info about GO terms.
        > DESC GO_info;
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        | Field      | Type                                                                        | Null | Key | Default | Extra |
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        | go         | int(7) unsigned zerofill                                                    | NO   | PRI | NULL    |       | 
        | namespace  | enum('biological_process','molecular_function','cellular_component')        | NO   | MUL | NULL    |       | 
        | name       | varchar(255)                                                                | NO   |     | NULL    |       | 
        | synonym    | text                                                                        | YES  |     | NULL    |       | 
        | definition | text                                                                        | YES  |     | NULL    |       | 
        | distance   | tinyint(3) unsigned                                                         | NO   |     | NULL    |       | 
        +------------+-----------------------------------------------------------------------------+------+-----+---------+-------+
        
    • The go column is the numeric part of GO id. It is browsable via GO-Hierarchy.
    • The namespace column can be one of three GO sub-ontologies.
    • The name column shows the full name of GO terms.
    • The synonym column is the synonym of GO terms.
    • The definition column is the definition of GO terms.
    • The distance column shows the distance of GO terms to the corresponding sub-ontology.

    GO_hie: containing info about GO hierarchy.
        > DESC GO_hie;
        +----------+--------------------------+------+-----+---------+-------+
        | Field    | Type                     | Null | Key | Default | Extra |
        +----------+--------------------------+------+-----+---------+-------+
        | parent   | int(7) unsigned zerofill | NO   | PRI | NULL    |       | 
        | child    | int(7) unsigned zerofill | NO   | PRI | NULL    |       | 
        | distance | tinyint(3) unsigned      | NO   | PRI | NULL    |       | 
        +----------+--------------------------+------+-----+---------+-------+
        
    • The parent column is the numeric part of parental GO id.
    • The child column is the numeric part of child GO id.
    • The distance column shows the distance of parental GO id to child GO id. 1 for direct parent-child relationships, others indicating the existance of a path between them (reachable but indirect). Notably, each edge in GO DAG can be one of three relationships: 'is_a', 'part_of', and 'regulates'. Here, we only consider the first two (i.e., 'is_a' and 'part_of') and treat them equally.

    GO_mapping: containing info about Domain2GO annotations.
        > DESC GO_mapping;
        +--------------------+---------------------------+------+-----+---------+-------+
        | Field              | Type                      | Null | Key | Default | Extra |
        +--------------------+---------------------------+------+-----+---------+-------+
        | id                 | mediumint(8) unsigned     | NO   | PRI | NULL    |       |
        | level              | enum('cl','cf','sf','fa') | NO   |     | NULL    |       |
        | go                 | int(7) unsigned zerofill  | NO   | PRI | NULL    |       |
        | single_score       | double                    | NO   |     | 1       |       |
        | all_score          | double                    | NO   |     | 1       |       |
        | inherited_from     | text                      | YES  |     | NULL    |       |
        | inherited_from_all | text                      | YES  |     | NULL    |       |
        +--------------------+---------------------------+------+-----+---------+-------+
        
    • The id is the SCOP unique identifier, sunid. It is browsable via SCOP-Hierarchy.
    • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family.
    • The go column is the numeric part of GO id.
    • The single_score column is the FDR supported by singleton domain UniProts.
    • The all_score column is the FDR supported by all UniProts (including multidomain UniProts).
    • The inherited_from column is to mark the status of Domain2GO predicted annotations supported by both. 1) If it is marked with 'directed' (i.e., the column 'single_score'<0.001 and 'all_score'<0.001), Domain2GO is significantly supported both by singleton domain UniProts and all UniProts (including multidomain UniProts). 2) If it is a comma separated list of GO id (numeric part; not both the columns 'single_score'and 'all_score' are less than 0.001), Domain2GO is inherited from any descentant GO terms (significantly associated) when applying true-path rule in DAG. 3) Empty otherwise. Hence, the lists of Domain2GO supported by both can be obtained by selecting the column 'inherited_from' with NOT EMPTY.
    • The inherited_from_all column is to mark the status of Domain2GO predicted annotations supported by all. 1) If it is marked with 'directed' (i.e., 'all_score'<0.001), Domain2GO is significantly supported only by all UniProts (including multidomain UniProts). 2) If it is a comma separated list of GO id (numeric part; the column 'all_score' is not less than 0.001), Domain2GO is inherited from any descentant GO terms (significantly associated) when applying true-path rule in DAG. 3) Empty otherwise. Hence, the lists of Domain2GO supported only by all can be obtained by selecting the column 'inherited_from_all' with NOT EMPTY.

    GO_ic: containing info about SDFO.
        > DESC GO_ic;
        +---------+---------------------------------------------------------------+------+-----+---------+-------+
        | Field   | Type                                                          | Null | Key | Default | Extra |
        +---------+---------------------------------------------------------------+------+-----+---------+-------+
        | level   | enum('cl','cf','sf','fa','cl_all','cf_all','sf_all','fa_all') | NO   | PRI | NULL    |       |
        | go      | int(7) unsigned zerofill                                      | NO   | PRI | NULL    |       |
        | ic      | double                                                        | YES  |     | NULL    |       |
        | include | tinyint(2)                                                    | YES  | MUL | NULL    |       |
        +---------+---------------------------------------------------------------+------+-----+---------+-------+
        
    • The level in the SCOP hierarchy. Since this table stores both results (SDFO from Domain2GO supported by both, and SDFO from Domain2GO supported only by all), the level for former SDFO can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family, and the level for latter SDFO can be one of 'cl_all' for class, 'cf_all' for fold, 'sf_all' for superfamily, 'fa_all' for family.
    • The go column is the numeric part of GO id.
    • The ic column shows the infomration content of the GO term.
    • The include column indicates whether or not the GO term belongs to the SDFO. If the column is set to '0' then it is not a member of SDFO. Otherwise, '1' for least informative (i.e., the most general), '2' for moderately informative, '3' for informative, '4' for highly informative (i.e., the most specific).


References

    Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J., Chothia, C. and Murzin, A.G. (2008) Data growth and its impact on the SCOP database: new developments, Nucleic Acids Res, 36, D419-425. Abstract [ PubMed ]  
    Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C. and Apweiler, R. (2009) The GOA database in 2009--an integrated Gene Ontology Annotation resource, Nucleic Acids Res, 37, D396-403. Abstract [ PubMed ]  
    Benjamini, Y. and Hochberg, Y. (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society Series B-Methodological, 57, 289-300. Abstract [ PubMed ]  
    de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J. (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res 39: D427-434. Abstract [ PubMed ]  
    Gough, J. (2006) Genomic scale sub-family assignment of protein domains, Nucleic Acids Res, 34, 3625-3633. Abstract [ PubMed ]