SUPERFAMILY 1.75 HMM library and genome assignments server

Superfamily is undergoing a server migration - you are now browsing on the new server. Please contact us if you experience any problems.

Explanation of Domain Architectures Similarity Function

Links to the similar domain architectures can be found on any of the gene pages or the domain combination pages.

The similarity function compares two domain architectures: the domain architecture of interest and a "query" domain architecture. The domain architecture of interest is in turn compared with all the other domain architectures in the SUPERFAMILY database. The 10 domain architectures which are most similar to the architecture of interest are selected for display.

There are three main components in the similarity function.

1: Domain architecture copy numbers

The first component of the similarity function focuses on the set of genomes which share both the domain architecture of interest and the "query" domain architecture. The number of copies of each domain architecture in each genome is taken into account. A domain architecture with only one copy in a genome, and another domain architecture with 100 copies in the same genome may both contribute to the metabolism of that organism, but most probably to different degrees. Hence, for every shared genome containing both domain architectures this component compares the copy number similarity by taking the ratio of the smaller copy number over the larger copy number,

where Ai and Bi are the copy numbers of domain architectures A and B in genome i.

2: Information content of genomes

This component of the similarity function accounts for the relative significance of genomes. A genome which contains a small number of domain architecture copies with similar phylogenetic distribution is less likely to be present in the phylogenetic profile of a randomly chosen domain architecture than a genome with a large number of domain architecture copies. The former genome is not only more genetically specific, but also more statistically significant. Therefore, information content is used as a weighting measure for the relative significance of genomes. The information content of genome g is calculated as,

where Sg is the sum of copy numbers of all domain architectures in genome g, and S is the sum of protein copy numbers across all genomes in the SUPERFAMILY database. Sg is calculated as,

where Ci is the copy number of the ith domain architecture in genome g and n is the number of unique domain architectures in genome g.

S is calculated as,

where N is the number of genomes in the SUPERFAMILY database.

3: Phylogenetic diversity of genomes

This final component of the similarity function factors in the phylogenetic diversity of genomes. If two domain architectures share genomes which are phylogenetically distant to each other it suggests a stronger relationship between the domain architectures than if the common genomes are all phylogenetically close to each other. To measure phylogenetic diversity we use phylogenetic distance between the genomes to assign a weight to each genome according to its relative distance to the other shared genomes.

The relative distance data for all genomes was calculated using a neighbour-joining algorithm based on domain presence/absence data from the SUPERFAMILY database. The distance between any two genomes is defined as the average of their distances to the nearest common ancestor. The distance weighting factor Dg for genome g is determined by the average distance to all other shared genomes, i.e.

where NS is the number of shared genomes and dig is the distance between genome i and genome g from the neighbour-joining tree.

Similarity scoring function

The complete scoring function to assess similarity between domain architecture A and domain architecture B, is

where NL is the larger number of genomes in the phylogenetic profiles of domain architecture A or B, and NsN.

Description adapted from:
Genomic Distribution Pattern Matcher for Protein Structural Domain Architectures
Yiduo Zhou, Masters Thesis, Dept. of Computer Science, University of Bristol, 2008.