SUPERFAMILY 1.75 HMM library and genome assignments server

Domain architectures with similar genomic distributions to architecture: _gap_,52540,_gap_,158702,_gap_

The selected domain combination is the occurrence of the following superfamily domains in N- to C-Terminal order:

P-loop containing nucleoside triphosphate hydrolases
Sec63 N-terminal domain-like


Jump to [ Top of page · Similar domain architectures explanation ]

Top 10 similar domain architectures

The method used to find similar domain architectures is explained below.

The following are the suggested domain architectures ordered by the level of genomic distribution similarity:

0: _gap_,52283,51735,_gap_

Similarity Level:

52283   Formate/glycerate dehydrogenase catalytic domain-like
51735   NAD(P)-binding Rossmann-fold domains

1: _gap_,57535,_gap_

Similarity Level:

57535   Complement control module/SCR domain

2: _gap_,90002,_gap_

Similarity Level:

90002   Hypothetical protein YjiA, C-terminal domain

3: 64484,64484,_gap_

Similarity Level:

64484   beta and beta-prime subunits of DNA dependent RNA-polymerase
64484   beta and beta-prime subunits of DNA dependent RNA-polymerase

4: _gap_,53448,53448,_gap_

Similarity Level:

53448   Nucleotide-diphospho-sugar transferases
53448   Nucleotide-diphospho-sugar transferases

5: 109779,_gap_

Similarity Level:

109779   Domain from hypothetical 2610208m17rik protein

6: 51294

Similarity Level:

51294   Hedgehog/intein (Hint) domain

7: 54913,_gap_

Similarity Level:

54913   GlnB-like

8: 143555,_gap_

Similarity Level:

143555   FwdE-like

9: 158372

Similarity Level:

158372   AF1782-like


Jump to [ Top of page · Top 10 similar domain architectures ]

Explanation of domain architectures similarity function

The similarity function compares two domain architectures: the domain architecture of interest and a "query" domain architecture. The domain architecture of interest is in turn compared with all the other domain architectures in the SUPERFAMILY database. The 10 domain architectures which are most similar to the architecture of interest are selected for display.

There are three main components in the similarity function.

1: Domain architecture copy numbers

The first component of the similarity function focuses on the set of genomes which share both the domain architecture of interest and the "query" domain architecture. The number of copies of each domain architecture in each genome is taken into account. A domain architecture with only one copy in a genome, and another domain architecture with 100 copies in the same genome may both contribute to the metabolism of that organism, but most probably to different degrees. Hence, for every shared genome containing both domain architectures this component compares the copy number similarity by taking the ratio of the smaller copy number over the larger copy number,

where Ai and Bi are the copy numbers of domain architectures A and B in genome i.

2: Information content of genomes

This component of the similarity function accounts for the relative significance of genomes. A genome which contains a small number of domain architecture copies with similar phylogenetic distribution is less likely to be present in the phylogenetic profile of a randomly chosen domain architecture than a genome with a large number of domain architecture copies. The former genome is not only more genetically specific, but also more statistically significant. Therefore, information content is used as a weighting measure for the relative significance of genomes. The information content of genome g is calculated as,

where Sg is the sum of copy numbers of all domain architectures in genome g, and S is the sum of protein copy numbers across all genomes in the SUPERFAMILY database. Sg is calculated as,

where Ci is the copy number of the ith domain architecture in genome g and n is the number of unique domain architectures in genome g.

S is calculated as,

where N is the number of genomes in the SUPERFAMILY database.

3: Phylogenetic diversity of genomes

This final component of the similarity function factors in the phylogenetic diversity of genomes. If two domain architectures share genomes which are phylogenetically distant to each other it suggests a stronger relationship between the domain architectures than if the common genomes are all phylogenetically close to each other. To measure phylogenetic diversity we use phylogenetic distance between the genomes to assign a weight to each genome according to its relative distance to the other shared genomes.

The relative distance data for all genomes was calculated using a neighbour-joining algorithm based on domain presence/absence data from the SUPERFAMILY database. The distance between any two genomes is defined as the average of their distances to the nearest common ancestor. The distance weighting factor Dg for genome g is determined by the average distance to all other shared genomes, i.e.

where NS is the number of shared genomes and dig is the distance between genome i and genome g from the neighbour-joining tree.

Similarity scoring function

The complete scoring function to assess similarity between domain architecture A and domain architecture B, is

where NL is the larger number of genomes in the phylogenetic profiles of domain architecture A or B, and NsN.

Description adapted from:
Genomic Distribution Pattern Matcher for Protein Structural Domain Architectures
Yiduo Zhou, Masters Thesis, Dept. of Computer Science, University of Bristol, 2008.

Jump to [ Top of page · Top 10 similar domain architectures · Similar domain architectures explanation ]