SUPERFAMILY HMM library and genome assignments server

SUPERFAMILY Description

SUPERFAMILY is a database of structural and functional protein annotations for all completely sequenced organisms.

A domain is the smallest unit of evolution; a large protein can be split into smaller domains. Domains can occur by themselves or in combination with other domains. A superfamily groups together domains of different families which have a common evolutionary ancestor based on structural, functional and evolutionary data.
The SUPERFAMILY web site and database provides protein domain assignments, at the SCOP 'superfamily' and 'family' levels, for the predicted protein sequences in over 900 organisms, (plus sequence collections such as UniProt). Please contact us if you think we have missed any organisms.
SUPERFAMILY domain assignments are generated using an expert curated set of profile hidden Markov models. All models and structural assignments are available for browsing and download. Sophisticated tools are provided for the analysis of superfamily (and family) domain assignments.

SUPERFAMILY is a member of the InterPro consortium of protein annotation databases, and has been integrated into the Ensembl eukaryotic genome project and The Arabidopsis Information Resource. To date, the SUPERFAMILY publications have been cited over 400 times. SUPERFAMILY has been used in structural, functional, evolutionary and phylogenetic research projects.

Purpose

The purpose of this server is to provide structural (and hence implied functional) assignments to protein sequences primarily at the SCOP superfamily level. A superfamily contains all proteins for which there is structural evidence of a common evolutionary ancestor. What this service offers is sophisticated and expertly chosen remote homology detection. What it does not offer is an improvement in speed or assignment of superfamilies not of known structure.

There is a facility to compute assignments for your own DNA or protein sequences, and there is access to genome assignments and to multiple sequence alignments of SCOP superfamilies. If you have an interest in running large numbers of sequences, then please don't hesitate to contact us via superfamily@mrc-lmb.cam.ac.uk.

The web site includes services such as domain architectures and alignment details for all protein assignments, searchable domain combinations, domain occurrence network visualization, detection of over- or under-represented superfamilies for a given genome by comparison with other genomes, assignment of manually submitted sequences and keyword searches.

Sequence Search Description

The sequence search method uses a library (covering all proteins of known structure) consisting of 1539 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models. Your query sequences will be assigned e-value scores for all models, and the significant ones will be returned. Each sequence may well hit a superfamily more than once as there are several overlapping models for each superfamily, however it is the hit to the superfamily which is meaningful. Each model is created from a seed sequence which is aligned to many superfamily homologues. The model is built from the alignment (please see the SAM website for a detailed explanation). A hit to a model is not a hit to the seed but is a hit to the superfamily which the model represents. You may view sequences aligned to the models which represent a view of the superfamily although it may be biased towards the seed.  You may also see the genome assignments for each superfamily or view alignments of the genome sequences.

The SUPERFAMILY server is based upon release 1.69 of the SCOP structural classification of proteins, the corresponding sequences from ASTRAL, and the SAM hidden markov model software.

Comparative Genomics Tools

The SUPERFAMILY web site provides a number of comparative genomics tools for the analysis of superfamily, and family, domains from across the tree of life. These tools include: lists of unusual (over- and under-represented) superfamilies and families, adjacent domain pair lists and graphs, unique domain pairs, domain combinations, domain architecture co-occurrence networks and domain distribution across taxonomic kingdoms for each organism. A detailed description of what these tools can do, and how to use them can be found on the comparative genomics page.

Downloads

Downloads are instantly available upon application for a free license. The model library, genome assignments and some software are available. Genome assignments are updated weekly. There is a low traffic announcement mailing list for notification of updates/changes.

Citation

Groups using results derived from this project for publication are asked to cite:

Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

Gough J, Karplus K, Hughey R, Chothia C.

J Mol Biol. 2001 Nov 2;313(4):903-19.

Abstract [ PubMed ]   Full text [ HTML · PDF ]

A detailed list of the SUPERFAMILY publications can be found here.