|
|
Generating the SUPERFAMILY Domain Architectures 'comb' Table
(Christine Vogel, Mary Pacold, and Julian Gough)
This page describes the creation of the 'comb' table in the SQL database. The comb table has an entry for
each protein in SUPERFAMILY. The protein is represented as a sequence of SCOP domains rather than amino
acids. This condenses the information into a very convenient format for analysis. We call the organisation of
domains the 'architecture' of the protein, and the composition and interaction of the domains of the
architecture determine the function of the protein.
This table is at the heart of the engine for various tools in SUPERFAMILY which compare protein architectures
between whole genomes, and also for individual sequences.
Summary
The individual assignments of SCOP domains to genome sequences were converted
into a string of superfamily domains for each protein; the order is defined by the sequence of the
centre-points of each domain. The assignments of known structural domains do not always cover the entire
sequence. So as well as the ordering of the assigned domains, it was necessary to determine the presence of
unknown (not assigned, possibly of unknown structure) domains before, after and in betweeen the assigned
domains. It is impossible to determine the number of unknown domains in any given unassigned region, so each
region was simply labelled as a gap containing one or more unknown domains. Both the presence of gaps and the
process of forcing domains into a simple string (when some domains may actually be interlaced or inserted
within each other), means that the string will not always be a complete description of the architecture.
However, more importantly, the procedure is unambiguous in that each domain architecture will be assigned a
unique string.
The figure below outlines the procedure.

Overview of the algorithm to convert domain assignments into strings.
The algorithm
For the SUPERFAMILY assignments to be converted into a string of domains,
first the order must be determined. There are at least two factors which complicate this: the assignments are
very good at detecting the correct number of domains in the correct place yet (in the case of very distant
homologues) not always good at accurately predicting domain boundaries. Furthermore some domains are not
linear in sequence and are interlaced or inserted within each other. See the following example, there are
only two domains present:

How is the order decided? It is more important for relating proteins to each
other using strings of domains describing domain architectures that the assignment algorithm is unambiguous
rather than that each individual protein is fully described. I.e. each string should be unique. In order to
do that, the centre-point of each domain is picked, and the order is calculated from the sequence of the
domain centre-points. Calculating the positions of the domain centre-points along the sequence is non-trivial
since the assignments are sometimes incomplete and only recognise a short conserved region which may lie at
the far end of the sequence. (As is the case for the third domain in the overview figure.) The assigned
region is aligned to the model which it has hit, and the position of the centre-point of the model is chosen.
Should the centre of the model be outside the bounds of the assignment then its position is projected along
to a theoretical position on the sequence outside the area of assignment. In that case we assume that the
gene prediction is incomplete and the whole domain exists in the protein.
There is a further wrinkle in all this, and one with a less simple solution:
Proteins consist of a string of domains joined by usually short (less than 30 amino acids) linker sequences.
The structural domain assignments do not cover all of the sequence leaving unassigned regions longer than 30
amino acids (see here for the exact coverage). In those unassigned
regions, we expect domains of unknown structure (no SUPERFAMILY assignment) and length in the unassigned
regions. We need to decide whether or not an unassigned domain lies between any two domains, or a domain and
the N or C terminus of the sequence. For example, in the following sequence it is estimated that there is a
missing (unassigned) domain in the region between the domains labelled 1 and 5 and another between 4 and 2,
yet there is none between 5 and 4 or between the N terminal and the domain labelled 1:

Before we describe the algorithm to decide whether an unassigned region
contains an unassigned domain or is merely a short linker region, we look at some statistics on the domain
superfamilies in SCOP. There are roughly 1000 SCOP superfamilies (see here for up to date numbers), and each superfamily has
one or more members of solved 3-D structure. The mean and standard deviation of the lengths of every domain
were calculated. Some members of a superfamily are very similar to each other but others are very different.
To correct for this unevenness in the database the mean and standard deviation were calculated using a
weighting scheme much in the same way as pseudo-counts are used in sequence profile methods. The standard
deviation describes the variability of the domain length. The higher the variability in length of a domain
is, the higher is its standard deviation of the length, and the more likely it is for this domain to be
longer than its average length and cover an unassigned region rather than to have a gap in the unassigned
region. The smallest domain is 27 amino acids long; most of the domains are about 110 amino acids long
(geometric mean), and two thirds of all domains are between 50 and 350 aa long.

Figure A. Length Distribution of SCOP domains. The arithmetic mean is 175 aa +- 119 aa.
The geometric mean is 110 aa
The maximum length of an unassigned region found was 11 000 aa, but
unassigned regions longer than 250 aa occur very infrequently.

Figure B. Length Distribution of Unassigned Regions. Plotted with 0 and 1 standard
deviation added. Most unassigned regions are smaller than 30 aa. Unassigned regions larger than 250 aa only
have frequencies close to 1. The maximum length of an unassigned region is 11 000 aa.
We assessed the number of gaps found depending on the number of standard
deviations added to the domain length and the threshold length taken for a gap to be present. The number of
standard deviations added to the domain length is a measurement for the probability that the two domains are
adjacent to each other with no gap in between. The total number of gaps that occur is less sensitive to the
number of standard deviations (from 0 to 1) added to each domain length than to the threshold gap length.
Hence the threshold length of a gap is the crucial parameter.

Figure C. Number of Gaps dependent on Standard Deviation and Threshold Gap DomainLength.
Plotted is the number of gaps in the complete data set with 0 and 1 standard deviation added to each domain
length and a range of threshold gap lengths from 0 to 600.
The centres which were calculated earlier for the ordering of assigned
domains come in useful now. The centre is used as an anchor for the domain, and from there it is possible to
use the mean and standard deviation of the length of the domain to give a distribution to the possible domain
boundary (see overview figure). Looking at the distributions between any two domains it is possible to
estimate whether or not there may be an unknown domain between the two assigned ones. If the gap between any
two assigned domains (or an assigned domain and an N or C terminus) is longer than a certain threshold length
when both boundaries are calculated at one standard deviation from the normal, then it is deemed to be
unlikely that the domain boundaries meet, and therefore there must be another domain in between them.
The two parameters that influence this estimate are the number of added standard deviations and the threshold
length of assignment of a gap. The relationship of those two parameters and their influence on the number of
gaps has been described above.
In order to derive a reasonable threshold length for a gap, we estimated the
maximum expected number of unknown domains in the whole data set: Our data set comprises assignments of 262
730 domains to 167 843 sequences. All domains taken together cover 70% of the length of all sequences; that
leaves 30 % unassigned regions with gaps. Under the assumption that the length distribution of unknown
domains is the same as the length distribution of known domains, we expect a maximum of 113 878 domains in
the 30 % unassigned regions. Since every unassigned region only counts once as on gap irrespective whether
several unknown domains are in this region, the actual number of gaps will be even lower. Knowing that, we
can calculate the standard deviation and threshold gap length values needed to obtain fewer than 113 878
gaps. We see, that at, for example, 1 standard deviation, a threshold gap length of at least 30 aa is needed
to obtain a maximum of 113 878 gaps. For less added standard deviation or a smaller threshold gap length, the
number of gaps would be over-estimated.

Figure D. Standard deviation and Threshold gap length at a fixed number of gaps (110 000
to 113878).
Testing
| Parameter Set |
A |
B |
C |
| Standard Deviation Factor |
0 |
1 |
2 |
| Gap Threshold Length |
60 |
30 |
12 |
| Number of Gaps in 71 genomes |
112184 |
112814 |
110421 |
Table 1. Parameter sets: A: stv 0 gapl 60, B: stdv 1 gapl 30, C: stv 2 gapl 12.
Three different possible parameter sets (Table 1: set A: stv 0 gapl 60, set
B: stdv 1 gapl 30, set C: stv 2 gapl 12) were tested:
I) For all domains, in four genomes (dm, ec, mg, ce) two sample sequences were taken and the number of gaps
compared. In cases, where in one set a gap had been predicted but not in the other, the domain assignment was
inspected by eye and judged whether a gap could fit in. In each set A and B, six gaps have been correctly
predicted but were absent in the other set. For set B, further four gaps were ambiguous, one gap was
over-predicted. No clear conclusion from that about whether A or B is better.
II) Thirteen well-characterised Ig proteins from D.melanogaster and C.elegans have been inspected; preferably
sequences with many different domains have been chosen. Out of these sequences, nine had exactly the same
prediction of gaps in all three parameter sets, and were considered correct. In two sequences, set A had one
over-predicted gap. In three sequences, set C had one over-predicted gap. Overall, all three sets performed
well in predicting no gap between consecutive (Ig) domains and in predicting gaps in longer unassigned
regions. Most of the differences occurred in very short unassigned regions, mainly at the end of the
sequences (only one neighbouring domain -> only one stdv'shortening' the region).
III) Two well-characterised Cadherin proteins from C. elegans (W02B9.1 and F15B9.7) have been inspected. In
all three parameter sets, the prediction of gaps was identical. All three sets performed well in predicting
no gap between Cadherin domains and gaps in unassigned regions.
All three parameter sets did reasonably well. Gaps tend to be 'over-predicted' at the end of sequences.
However, this is less crucial if, later on, one queries for a particular combination of domains. Generally,
hardly any false-positive predictions of gaps were found. Set C performed worse than A and B.
From these tests, one reasonable set of parameters to estimate the presence of gaps is one standard
deviation added to the domain lengths and a threshold gap length of 30 aa.
|