Explanation of domain architectures similarity function

The similarity function compares two domain architectures:
the domain architecture of interest and a "query" domain architecture.
The domain architecture of interest is in turn compared with all the other domain
architectures in the SUPERFAMILY database. The 10 domain architectures which are
most similar to the architecture of interest are selected for display.

There are three main components in the similarity function.

1: Domain architecture copy numbers

The first component of the similarity function focuses on the
set of genomes which share both the domain architecture of interest and the "query"
domain architecture. The number of copies of each domain architecture in each
genome is taken into account. A domain architecture with only one copy in a
genome, and another domain architecture with 100 copies in the same genome may
both contribute to the metabolism of that organism, but most probably to
different degrees. Hence, for every shared genome containing both domain
architectures this component compares the copy number similarity by taking the ratio
of the smaller copy number over the larger copy number,

where A_{i} and
B_{i} are the copy numbers of domain architectures
A and B in genome i.

2: Information content of genomes

This component of the similarity function accounts for the
relative significance of genomes. A genome which contains a small number of
domain architecture copies with similar phylogenetic distribution is less likely to
be present in the phylogenetic profile of a randomly chosen domain architecture
than a genome with a large number of domain architecture copies. The
former genome is not only more genetically specific, but also more statistically
significant. Therefore, information content is used as a weighting measure for
the relative significance of genomes. The information content of genome g is
calculated as,

where S_{g} is the sum of copy numbers of all
domain architectures in genome g, and S is the sum of protein copy numbers across
all genomes in the SUPERFAMILY database. S_{g} is calculated as,

where C_{i} is the copy number of the
i^{th} domain architecture in genome g and n is
the number of unique domain architectures in genome g.

S is calculated as,

where N is the number of genomes in the SUPERFAMILY database.

3: Phylogenetic diversity of genomes

This final component of the similarity function factors in
the phylogenetic diversity of genomes. If two domain architectures share genomes
which are phylogenetically distant to each other it suggests a stronger
relationship between the domain architectures than if the common genomes are all
phylogenetically close to each other. To measure phylogenetic diversity we use
phylogenetic distance between the genomes to assign a weight to each genome
according to its relative distance to the other shared genomes.

The relative distance data for all genomes was calculated
using a neighbour-joining algorithm based on domain presence/absence data from the
SUPERFAMILY database. The
distance between any two genomes is defined as the average of their distances to
the nearest common ancestor. The distance weighting factor D_{g} for
genome g is determined by the average distance to all other shared genomes,
i.e.

where N_{S} is the number of shared genomes and
d_{ig} is the distance between genome i and genome g from the
neighbour-joining tree.

Similarity scoring function

The complete scoring function to assess similarity
between domain architecture A and domain architecture B, is

where N_{L} is the larger number of genomes in the
phylogenetic profiles of domain architecture A or B, and
N_{s} ≤ N.

Description adapted from: Genomic Distribution Pattern Matcher for Protein Structural Domain Architectures
Yiduo Zhou, Masters Thesis, Dept. of Computer Science, University of Bristol, 2008.