SUPERFAMILY 1.75 HMM library and genome assignments server

Domain architectures with similar genomic distributions to architecture: _gap_,52540,_gap_,158702,_gap_

The selected domain combination is the occurrence of the following superfamily domains in N- to C-Terminal order:

P-loop containing nucleoside triphosphate hydrolases
Sec63 N-terminal domain-like


Jump to [ Top of page · Similar domain architectures explanation ]

Top 10 similar domain architectures

The method used to find similar domain architectures is explained below.

The following are the suggested domain architectures ordered by the level of genomic distribution similarity:

1: _gap_,52283,51735,_gap_

Similarity Level:

52283   Formate/glycerate dehydrogenase catalytic domain-like
51735   NAD(P)-binding Rossmann-fold domains

2: 48403,_gap_,48403,_gap_

Similarity Level:

48403   Ankyrin repeat
48403   Ankyrin repeat

3: 48173,_gap_

Similarity Level:

48173   Cryptochrome/photolyase FAD-binding domain

4: _gap_,53067,53067,100920,_gap_

Similarity Level:

53067   Actin-like ATPase domain
53067   Actin-like ATPase domain
100920   Heat shock protein 70kD (HSP70), peptide-binding domain

5: _gap_,141255

Similarity Level:

141255   YccV-like

6: _gap_,53335,_gap_,53335

Similarity Level:

53335   S-adenosyl-L-methionine-dependent methyltransferases
53335   S-adenosyl-L-methionine-dependent methyltransferases

7: _gap_,54364,55200,_gap_

Similarity Level:

54364   Translation initiation factor IF3, N-terminal domain
55200   Translation initiation factor IF3, C-terminal domain

8: 50985,50985,_gap_

Similarity Level:

50985   RCC1/BLIP-II
50985   RCC1/BLIP-II

9: 50978,48452

Similarity Level:

50978   WD40 repeat-like
48452   TPR-like

10: _gap_,46689,51905,54373,_gap_

Similarity Level:

46689   Homeodomain-like
51905   FAD/NAD(P)-binding domain
54373   FAD-linked reductases, C-terminal domain


Jump to [ Top of page · Top 10 similar domain architectures ]

Explanation of domain architectures similarity function

The similarity function compares two domain architectures: the domain architecture of interest and a "query" domain architecture. The domain architecture of interest is in turn compared with all the other domain architectures in the SUPERFAMILY database. The 10 domain architectures which are most similar to the architecture of interest are selected for display.

There are three main components in the similarity function.

1: Domain architecture copy numbers

The first component of the similarity function focuses on the set of genomes which share both the domain architecture of interest and the "query" domain architecture. The number of copies of each domain architecture in each genome is taken into account. A domain architecture with only one copy in a genome, and another domain architecture with 100 copies in the same genome may both contribute to the metabolism of that organism, but most probably to different degrees. Hence, for every shared genome containing both domain architectures this component compares the copy number similarity by taking the ratio of the smaller copy number over the larger copy number,

where Ai and Bi are the copy numbers of domain architectures A and B in genome i.

2: Information content of genomes

This component of the similarity function accounts for the relative significance of genomes. A genome which contains a small number of domain architecture copies with similar phylogenetic distribution is less likely to be present in the phylogenetic profile of a randomly chosen domain architecture than a genome with a large number of domain architecture copies. The former genome is not only more genetically specific, but also more statistically significant. Therefore, information content is used as a weighting measure for the relative significance of genomes. The information content of genome g is calculated as,

where Sg is the sum of copy numbers of all domain architectures in genome g, and S is the sum of protein copy numbers across all genomes in the SUPERFAMILY database. Sg is calculated as,

where Ci is the copy number of the ith domain architecture in genome g and n is the number of unique domain architectures in genome g.

S is calculated as,

where N is the number of genomes in the SUPERFAMILY database.

3: Phylogenetic diversity of genomes

This final component of the similarity function factors in the phylogenetic diversity of genomes. If two domain architectures share genomes which are phylogenetically distant to each other it suggests a stronger relationship between the domain architectures than if the common genomes are all phylogenetically close to each other. To measure phylogenetic diversity we use phylogenetic distance between the genomes to assign a weight to each genome according to its relative distance to the other shared genomes.

The relative distance data for all genomes was calculated using a neighbour-joining algorithm based on domain presence/absence data from the SUPERFAMILY database. The distance between any two genomes is defined as the average of their distances to the nearest common ancestor. The distance weighting factor Dg for genome g is determined by the average distance to all other shared genomes, i.e.

where NS is the number of shared genomes and dig is the distance between genome i and genome g from the neighbour-joining tree.

Similarity scoring function

The complete scoring function to assess similarity between domain architecture A and domain architecture B, is

where NL is the larger number of genomes in the phylogenetic profiles of domain architecture A or B, and NsN.

Description adapted from:
Genomic Distribution Pattern Matcher for Protein Structural Domain Architectures
Yiduo Zhou, Masters Thesis, Dept. of Computer Science, University of Bristol, 2008.

Jump to [ Top of page · Top 10 similar domain architectures · Similar domain architectures explanation ]