Background on dcGO
- Motivated by the fact that proteins are of modular design, protein domains broadly contribute to the understanding of proteomic data, structurally, evolutionarily and functionally. Thus, instead of associating ontological terms only with full-length proteins, it sometimes makes more sense to associate terms with protein domains.
- Moreover, more than one domain are the operational unit responsible for a function, e.g. via acting together or acting at an interface between domains. Therefore, it is also useful to associate ontological terms with pairs of domains, triplets and longer supra-domains.
- To explore possibility above, we have developed a general method to detect functional/phenotypic signals and others in relevance from the gene/protein-level annotations that can be explained at the protein domain (and their combinations) levels.
- The algorithm behind the dcGO was initially published as an improvement to the SUPERFAMILY database in 2010.
- The dcGO ranked amongst the top functional predictors in the 2011 CAFA competition when applied to Gene Ontology.
- We have extended the dcGO application to major phenotype/anatomy ontologies and other hierarchically structured Biomedical Ontologies (collectively denoted as 'BO'), and for each, we generate the domain-centric mappings and the corresponding slim version.
- In addition to domains defined by SCOP, the GO annotations to Pfam families are also getting added in to expand the dcGO concept.
- The dcGO database was officially released in October 2012, and together with the website, is due for the publication as NAR 2013 database issue.
- The domain classifications used in dcGO are taken from the Structural Classification Of Proteins (SCOP) at both the superfamily and family levels.
- SCOP classifies domains into the superfamily level if there is structure, sequence and function evidence for a common evolutionary ancestor.
- Based on SCOP, the SUPERFAMILY database uses hidden Markov models to detect and classify SCOP domains at the superfamily level; subsequently, each protein sequence may be represented as a string of SCOP domains, called domain architectures.
- Some superfamilies are sub-classified into families, which often share a higher sequence similarity and share a similar function.
- In multidomain proteins, a certain domain tends to co-occur/co-evolve with other domains. We define combinations of two or more successive domains as supra-domains if such combinations were found in more than one distinct domain architecture. The domain architecture is a modular view of a protein sequence; in the SUPERFAMILY database, it is represented as a sequential order of SCOP domains (at the superfamily level) or gaps (estimated to be one or more unknown domains). To avoid the uncertainty, gaps were excluded from the presence in supra-domains.
Gene Ontology (GO)
Biomedical Ontologies (BO)
- The GO is designed to annotate the full-length proteins in a species-independent manner so as to maximize the reuse. It depicts three complementary biological concepts including Biological Process (BP), Molecular Function (MF) and Cellular Component (CC)The most comprehensive protein-level annotations are maintained by the GOA project.
- The most comprehensive, high-quality protein-level annotations over a wide spectrum of species are maintained by the Gene Ontology Annotation (GOA) project.
- In dcGO, the BO refers to all other Biomedical Ontologies that are not GO.
- Like GO, they are hierarchical going from the very general at the top to more specific terms at the bottom.
- They mainly consist of phenotype ontologies that have been developed to classify and organize phenotypic information related to the model organisms and human.
- The dcGO database now contains a panel of ontologies from a variety of contexts:
Disease Ontology (DO) is a standardized ontology for human disease by semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI, SNOMED and OMIM. Also available are their mappings onto human genome.
Human Phenotype Ontology (HP) captures phenotypic abnormalities that are described in OMIM, along with the corresponding disease-causing genes. It includes three complementary biological concepts: Mode of Inheritance (MI), ONset and clinical course (ON), and Phenotypic Abnormality (PA).
Mammalian/Mouse Phenotype Ontology (MP) describes phenotypes of the mouse after a specific gene is genetically disrupted. Using it, Mouse Genome Informatics (MGI) provides high-coverate gene-level phenotypes for the mouse.
Worm Phenotype Ontology (WP) classifies and organizes phenotype descriptions for C. elegans and other nematodes. Using it, WormBase provides primary resource for phenotype annotations for C. elegans.
Yeast Phenotype Ontology (YP) is the major contributor to the Ascomycete phenotype ontology. Using it, Saccharomyces Genome Database (SGD) provides single mutant phenotypes for every gene in the yeast genome.
Fly Phenotype Ontology (FP) refers to FlyBase controlled vocabulary. Specifically, a structured controlled vocabulary is used for the annotation of alleles (for their mutagen etc) in FlyBase.
Fly Anatomy Ontology (FA) is a structured controlled vocabulary of the anatomy of Drosophila melanogaster, used for the description of phenotypes and where a gene is expressed.
Zebrafish Anatomy Ontology (ZA) displays anatomical terms of the zebrafish using standard anatomical nomenclature, together with affected genes.
Xenopus Anatomy Ontology (XA) represents the lineage of tissues and the timing of development for frogs (Xenopus laevis and Xenopus tropicalis). It is used to annotate Xenopus gene expression patterns and mutant and morphant phenotypes.
Arabidopsis Plant Ontology (AP) is a major contributor to Plant Ontology which describes plant ANatomical and morphological structures (PAN) and growth and DEvelopmental stages (PDE). The Arabidopsis Information Resource (TAIR) provides arabidopsis plant ontology annotations for the model higher plant Arabidopsis thaliana.
Enzyme Commission (EC) is a resource focused on enzyme nomenclature, which is a system of naming enzymes (protein catalysts) with Cross-references to UniProts. It uses four-digit EC number to define the reaction catalysed. The first three digits are to define the reaction catalysed and the fourth for a unique identifier (serial number).
DrugBank ATC code (DB) classifies at five different levels according to the organ or system (1st level, anatomical main group) on which they act and their therapeutic (2nd level, therapeutic subgroup), pharmacological (3rd level, pharmacological subgroup) and chemical properties (4th level, chemical subgroup; 5th level, chemical substance). Only drugs in DrugBank and with the Anatomical Therapeutic Chemical (ATC) classification system are considered.
UniProtKB KeyWords (KW) controlled vocabulary provides a summary of the entry content and are used to index UniProtKB/Swiss-Prot entries based on 10 categories (the category "Technical term" being excluded here). Each keyword is attributed manually to UniProtKB/Swiss-Prot entries and automatically to UniProtKB/TrEMBL entries (according to specific annotation rules).
UniProtKB UniPathway (UP) is a fully manually curated resource for the representation and annotation of metabolic pathways, being used as controlled vocabulary for pathway annotation in UniProtKB.
CTD Diseases (CD) is a MEDIC disease vocabulary (adapted from "Diseases" [C] branch of MeSH along with OMIM) that is used by CTD to annotate disease-related genes.
CTD Chemicals (CC) is chemical vocabulary adapted by CTD from the "Chemicals and Drugs" category and Supplementary Concept Records of MeSH.
- BACKGROUND: A brief introduction is provided, including: the motivation behind dcGO, the history of dcGO development, the main two subjects (i.e., protein domains and ontologies) that dcGO deals with, and the sitemap of the dcGO website.
- MINING: Bioinformatics facilities include: exploring functions over the species tree of life; cross-linking similar phenotypes; and predicting functions, phenotypes and diseases for over 80 million sequences including more than 2,000 genomes, UniProt and hundreds of meta-genomes. A full-text query via 'faceted search' allows the user to easily mine the dcGO resource.
- Faceted Search: A quick entrance to start mining the dcGO resource with keywords of interest.
- Phenotype Similarity Network (PSnet): Cross-linking phenotypes and other ontologies based on shared domain-centric annotations. It can be accessed via Faceted Search or BO Hierarchy.
- Species Tree Of Life (sTOL): Adds an evolutionary dimension for both domains and functions to the dcGO resource utility. The sTOL can be used to display the evolutionary history of individual domains (or lists of domains associated to a GO/BO term of interest via the Faceted Search), and to infer enriched GO terms of extant and ancestral genomes.
- dcGO Predictor: A sequence submission utility to predict function, phenotype and disease. Pre-computed results are available for over 80 million sequences, which can be accessed either via a Single Query form Faceted Search, or via a Batch Query of up to 1000 sequences. The prediction results for all sequences in a query are available for the downloaded, but also are summarized to give an overview of the content.
- dcGO Enrichment: identify the knowledge of function, phenotype and disease enriched/over-represented within a list of protein domains submitted.
- dcGO Pevo: explore architecture plasticity potentials of dcGO terms during eukaryotic evolution.
- BROWSE: SCOP Hierarchy, GO Hierarchy, BO Hierarchy and PFAM Hierarchy, wherein mappings among domains, functional and phenotypic terms can be browsed hierarchically.
- ALGORITHM: Domain2GO, Supra-domain2GO, Domain2BO and Supra-domain2BO, wherein detailed lengthy explanations illustrated with figures summarise the underlying pipeline.
- DOWNLOAD: Annotations based on SCOP domains and supra-domains (including Domain2GO, Supra-domain2GO, Domain2BO and Supra-domain2BO) and annotations for PFAM domains and supra-domains (PFAM2GO). Fore each, both flat files and MySQL tables are available together with detailed documentation.
- CITATIONS: A list of scientific publications related to the dcGO resource.