SUPERFAMILY 1.75 HMM library and genome assignments server

Domain architectures with similar genomic distributions to architecture: 52540

The selected domain combination is the occurrence of the following superfamily domains in N- to C-Terminal order:

P-loop containing nucleoside triphosphate hydrolases

Jump to [ Top of page · Similar domain architectures explanation ]

Top 10 similar domain architectures

The method used to find similar domain architectures is explained below.

The following are the suggested domain architectures ordered by the level of genomic distribution similarity:

1: 48726

Similarity Level:

48726   Immunoglobulin

2: 51735

Similarity Level:

51735   NAD(P)-binding Rossmann-fold domains

3: 81321

Similarity Level:

81321   Family A G protein-coupled receptor-like

4: 53335

Similarity Level:

53335   S-adenosyl-L-methionine-dependent methyltransferases

5: 48726,_gap_

Similarity Level:

48726   Immunoglobulin

6: 53474

Similarity Level:

53474   alpha/beta-Hydrolases

7: _gap_,56112

Similarity Level:

56112   Protein kinase-like (PK-like)

8: 103473

Similarity Level:

103473   MFS general substrate transporter

9: 56112

Similarity Level:

56112   Protein kinase-like (PK-like)

10: 50978

Similarity Level:

50978   WD40 repeat-like

Jump to [ Top of page · Top 10 similar domain architectures ]

Explanation of domain architectures similarity function

The similarity function compares two domain architectures: the domain architecture of interest and a "query" domain architecture. The domain architecture of interest is in turn compared with all the other domain architectures in the SUPERFAMILY database. The 10 domain architectures which are most similar to the architecture of interest are selected for display.

There are three main components in the similarity function.

1: Domain architecture copy numbers

The first component of the similarity function focuses on the set of genomes which share both the domain architecture of interest and the "query" domain architecture. The number of copies of each domain architecture in each genome is taken into account. A domain architecture with only one copy in a genome, and another domain architecture with 100 copies in the same genome may both contribute to the metabolism of that organism, but most probably to different degrees. Hence, for every shared genome containing both domain architectures this component compares the copy number similarity by taking the ratio of the smaller copy number over the larger copy number,

where Ai and Bi are the copy numbers of domain architectures A and B in genome i.

2: Information content of genomes

This component of the similarity function accounts for the relative significance of genomes. A genome which contains a small number of domain architecture copies with similar phylogenetic distribution is less likely to be present in the phylogenetic profile of a randomly chosen domain architecture than a genome with a large number of domain architecture copies. The former genome is not only more genetically specific, but also more statistically significant. Therefore, information content is used as a weighting measure for the relative significance of genomes. The information content of genome g is calculated as,

where Sg is the sum of copy numbers of all domain architectures in genome g, and S is the sum of protein copy numbers across all genomes in the SUPERFAMILY database. Sg is calculated as,

where Ci is the copy number of the ith domain architecture in genome g and n is the number of unique domain architectures in genome g.

S is calculated as,

where N is the number of genomes in the SUPERFAMILY database.

3: Phylogenetic diversity of genomes

This final component of the similarity function factors in the phylogenetic diversity of genomes. If two domain architectures share genomes which are phylogenetically distant to each other it suggests a stronger relationship between the domain architectures than if the common genomes are all phylogenetically close to each other. To measure phylogenetic diversity we use phylogenetic distance between the genomes to assign a weight to each genome according to its relative distance to the other shared genomes.

The relative distance data for all genomes was calculated using a neighbour-joining algorithm based on domain presence/absence data from the SUPERFAMILY database. The distance between any two genomes is defined as the average of their distances to the nearest common ancestor. The distance weighting factor Dg for genome g is determined by the average distance to all other shared genomes, i.e.

where NS is the number of shared genomes and dig is the distance between genome i and genome g from the neighbour-joining tree.

Similarity scoring function

The complete scoring function to assess similarity between domain architecture A and domain architecture B, is

where NL is the larger number of genomes in the phylogenetic profiles of domain architecture A or B, and NsN.

Description adapted from:
Genomic Distribution Pattern Matcher for Protein Structural Domain Architectures
Yiduo Zhou, Masters Thesis, Dept. of Computer Science, University of Bristol, 2008.

Jump to [ Top of page · Top 10 similar domain architectures · Similar domain architectures explanation ]