Hidden Markov models and their application to genome analysis in the context of protein structure

Julian Gough (Sidney Sussex College)

 

Click here for PDF file.

Summary

The bulk of the thesis is concerned with the application of hidden Markov models (HMMs) to remote protein homology detection. The thesis both addresses how best to utilise HMMs, and then uses them to analyse all completely sequenced genomes. There is a structural perspective to the work, and a section on three-dimensional protein structure analysis is included.

The Structural Classification of Proteins (SCOP) database forms the basis of the structural perspective. SCOP is a hierarchical database of protein domains classified by their structure, sequence and function. The main aim of the work is to use HMMs to classify all protein sequences produced by the genome projects (including human) into their constituent structural domains. To do this HMMs were built to represent all proteins of known structure; the thesis examines ways in which to best do this and describes the construction of a library of models.

Structural domain assignments to the genome sequences were generated by scoring the model library against the genomes, and then selecting the most probable domain architecture for each sequence. The genome assignments (within the framework of the evolutionary-based SCOP classification) provide the capability to study evolution at a molecular level, as well as at the level of the whole organism.

The model library and genome assignments have been made public via the SUPERFAMILY database. The data and services provided have been used for a growing number of projects several of which have already led to publication.

 







`But I don't want to go among mad people,' Alice remarked.

`Oh, you can't help that,' said the Cat: `we're all mad here. I'm mad. You're mad.'

`How do you know I'm mad?' said Alice.

`You must be,' said the Cat, `or you wouldn't have come here.'


LEWIS CARROL.




 


Contents

1. Introduction

When I embarked on this work three years ago I had no previous training in the field. The last biology and chemistry I had done was seven years previously when I was taking my GCSE level exams at the age of sixteen. I thought a protein was something you get when you eat meat, and that an amino acid would probably dissolve plastic or burn your skin. I had never had a computer before and had never heard of Perl or Linux which are the two tools which I have most heavily relied upon for my daily work. So this thesis really describes what a PhD should be, which is a journey of learning; I have also been lucky that my learning has resulted in some interesting science. To preserve the flow of the learning process the work in this thesis is arranged in roughly chronological order.

Sequence homology

``Well over 80% of the biological information concerning protein sequences reported in the published literature has been inferred by homology'' [1].

Proteins are the fundamental building blocks of life, and they are completely specified by a simple linear DNA sequence code. Since the invention of dideoxy sequencing [2] it has been very easy to determine these sequences, and millions have already been determined. The rate at which new sequences are being determined is rising faster than that dictated by Moore's law, and more importantly the complete sequences of whole genomes, including the human, are now available.

The sequence alone does not tell us anything about the protein which it codes for. Further information can be obtained by experimentation, but it is very expensive and time consuming. The number of proteins for which there is experimental information is very small relative to the number of proteins for which the sequence is now known. However almost all sequences are homologous to some other sequences, and this homology can be detected using computer algorithms which are very cheap and fast. Therefore information about a large number of sequences can be inferred from knowledge about a small number.

Sequence homology provides reliable information about the large number of sequences being generated by the sequencing projects. The value of this information makes homology methods the most important computational tool in genome biology.

The power of structure

Although sequence analysis easily reveals homology between closely related proteins, structural analysis can be used to detect homology between far more distantly related proteins. Although variation in sequences may have allowed two related proteins to mutate almost all common residues, the three-dimensional core of the protein retains the same main-chain conformation. The conservation of the structure of the core is a requirement for the protein to maintain a stable fold, and hence also a requirement for the protein to remain functional.

It was suggested some time ago that there is a limited repertoire of approximately 1000 protein families [3,4], which has been backed up with stronger evidence more recently [5]. By combining sequence and structure information it should be possible in the future to classify all of the sequences in all of the genomes into this limited number of families. Genome projects have so far supplied us with complete sequence sets, and structural genomics will hopefully soon provide us with structural representatives for every family.

Evolution

``Nothing in biology makes sense except in the light of evolution'', Theodosius Dobzhansky.

No biological thesis would be complete without mentioning evolution. Having started out with very little biological knowledge I would like to think that I would have independently arrived at the same conclusion as Theodosius Dobzhansky. Although few general conclusions are drawn about evolution in this thesis, the goal of the work from the beginning is to provide necessary information to allow new studies of evolution to be carried out. As a consequence the direction of the thesis is guided by `domains' as minimum evolutionary units, and `superfamilies' as groups of domains sharing a common evolutionary ancestor [6].

The logical progression from this work is to use the results presented here to learn more about the nature of evolution. This is touched upon, but it is really where this thesis ends and future work takes off from [7,8,9].

A journey of discovery

The first chapter describes the structural analysis of a single protein family. It was important to do this to fully understand the role of structural information in determining the evolutionary relationships of protein domains.

The second chapter is the bulk of the work and investigates the use of hidden Markov model (HMM) technology[10,11,12] for remote protein homology detection. Several key ideas are expressed in this chapter and it sets out the understanding acquired for carrying out the work in the rest of the thesis. It was important to learn about the behaviour of HMMs before they were used to build a model library. Chapter three describes the building of the library of models representing all proteins of known structure [13].

Chapter four takes a sidestep and describes a comparison of different HMM software packages [14,15]. A deeper understanding of HMMs was gained by analysing the performance of the two most commonly used packages.

In chapter five the model library built in chapter three is finally applied; it used to carry out structural domain assignments at the superfamily level to all sequences in the complete genomes. Finally chapter six describes the public server for the model library and genomes assignments.

A historical overview of sequence comparison methods

Pair-wise sequence comparison methods

A simplistic yet systematic approach to sequence homology detection was first developed by Fitch in 1966[16]. This method used a substitution matrix based on the number of nucleotide mutations required to translate between amino acids. This matrix was used to calculate similarity scores for all possible ungapped alignments of a pair of sequences, and hence the significance of the homology. The first systematic sequence comparison procedure which allowed for gaps and insertions in the sequences was published by Needleman and Wunsch in 1970[17]. This work was a major breakthrough at the time, offering the capability to automatically detect homology between much more divergent sequences than was previously possible. Through the use of a dynamic programming algorithm the method was also very efficient, and could be used to compare large numbers of sequences. It was however still very slow compared to the methods subsequently developed. In 1978 major improvements were made to the substitution matrix by Dayhoff and Schwartz[18].

The next major advance was the result of work by Smith and Waterman in 1980[19] which introduced local scoring and included affine gap penalties. An extension to the dynamic programming algorithm developed by Needleman and Wunsch allowed the detection of local similarities, i.e. between subsequences of the pair of whole sequences rather than across the full length of both. This allows for a match between individual domains (subsequences) belonging to larger multi-domain sequences.

The dynamic programming algorithm used by Smith and Waterman guarantees an optimal local alignment, and hence the maximum score, for a given pair of sequences and a substitution matrix. It is however possible to increase the efficiency of the algorithm by adding heuristics which consider a window of residues along the sequence. The optimal alignment is no longer guaranteed but something very close is usually achieved. In 1988 Pearson and Lipman introduced a heuristic method called FASTA[20], and in 1992 another by Altschul and Lipman et al. called BLAST[21]; these offered an increase in efficiency with some loss of sensitivity. Further improvements were made available by Henikoff and Henikoff in 1990, introducing the BLOSSUM substitution matrices[22].

The BLAST and FASTA programs are still in common use, although some continuing work has been done to improve the BLAST algorithm, and the loss in sensitivity due to the heuristics is now far less.

Profile sequence comparison methods

The problem with pair-wise sequence comparison methods is that each position is treated with equal importance and they use the same substitution matrix for every position in the sequence. In reality some positions are far more important than others. For example sites which are in the buried hydrophobic core of a protein are less variable than sites in loop regions on the edge of the protein, and so a difference between a pair of proteins in core residues should be more heavily penalised than at a site on the surface. Furthermore some positions are conserved for different reasons than others, so certain types of substitution should be more heavily penalised than the same type of substitution at a different site.

By considering the 3-dimensional structure of a protein, some of the properties of the residues at each site can be characterised. Information can also be gained by creating a multiple sequence alignment of related sequences, and observing the frequency of the occurance of different residues at each position. These two sources of information can be used together or separately to characterise the features of the sequences at different positions. A position specific scoring matrix (or model) representing these characteristics can be constructed. This is called a profile, and can be used to match and align sequences of far more distantly related proteins than simple pair-wise sequence homology methods.

An early attempt at using a profile to identify protein sequence homology was made by Taylor[23] in 1986. He demonstrates that the use of a profile-based procedure using what he calls ``template fingerprints'' can be used to identify conserved features in immunoglobulin sequences, and attempts to define a general algorithm for the recognition of sequence patterns in globular proteins. The approach used large multiple sequence alignments of variable and constant domains to generate templates representing the conserved features. It was observed that the conserved features were different for immunoglobulins and other beta-sheet proteins. Taylor concludes that ``Matching a new sequence against a data bank of fingerprint templates is potentially a better approach to sequence identification than the current practice of comparing an entire sequence against every other sequence in the databanks ...''. Further template generation and matching was subsequently carried out on the globins by Bashford et al. in 1987[24], but the focus of this work was to identify the sequence features which determine a protein fold, rather than developing a general procedure for sequence alignment and searching. As the conserved features were found to be common to all globins, they were able to distinguish globins from any other sequences.

Growing out of this work in 1987, Gribskov and Eisenberg et al. wrote a program called PROFMAKE[25] and used it to build profiles for globins and immunoglobulins, and search them against a sequence database. This method saw the introduction of position specific gap penalties, and uses a dynamic programming algorithm analogous to that introduced by Needleman and Wunsch for pair-wise sequence comparison. Gribskov and Eisenberg then went on to formally develop programs[26], which Bowie and Eisenberg extended and applied to more families[27].

There still did not exist a robust profile method which could be widely used, until Krogh and Haussler et al. introduced hidden Markov models to computational biology in 1994[10]. Hidden Markov models put the profile methods on firmer mathematical ground by casting the problem within a tight probabilistic framework. This resolved many of the problems associated with profile analysis, improved performance, and by 1996 there were two independent (freely available) software packages being applied to large-scale problems[11]. Without the probabilistic framework provided by the hidden Markov model theory, previous methods had been assigning the position specific scores and gap penalties in a relatively ad hoc manner. The early profile methods could be applied to specific families by tuning the parameters by trial and error, but were unable to achieve the automatic and reliable construction of profiles necessary for them to be applied on a large scale.

Iterative procedures

The most recent development in the field of profile-based sequence searching which has made a big difference is the use of iterative procedures. In 1997 Altschul and Lipman et al. produced PSI-BLAST[28] which is currently the most commonly used profile-based method. Position Specific Iterated (PSI) BLAST is an improvement, and more importantly, an extension of the original BLAST algorithm. The procedure requires a large database of known protein sequences which it uses to iteratively search for homologues to build a profile from. The initial query sequence is searched against the large database for close homologues, which are then aligned. Position specific score matrices are constructed from the alignment and used to search the database for further homologues not detectable by a pair-wise comparison to the initial query sequence using BLAST. The procedure repeats itself a number of times improving the profile by adding more homologous sequences from the large database to the alignment. The resulting automatically generated profile can be used to search for sequences related to the initial query sequence.

Although the position specific score matrices used by PSI-BLAST are in essence very similar to hidden Markov models, they are somewhat optimised for speed, and lack features such as position specific gap penalties and insertion score matrices. They also fail to make use of the Dirichlet mixtures[29] and null models used by hidden Markov model methods. In 1999 Karplus and Hughey et al. published[14] an iterative hidden Markov model procedure similar to PSI-BLAST, but using hidden Markov models instead of position specific score matrices. There have subsequently been improvements made to PSI-BLAST[30], but a comparison in this thesis, finds the performance of Karplus and Hughey's hidden Markov model procedure SAM-T99 to be superior.

Profile versus multiple pair-wise searches

It is worth noting that there have been alternative approaches to profile methods which use multiple pair-wise sequence searches. In 1997 Park and Chothia et al. produced an intermediate sequence search procedure[31] (ISS). The principle is to detect distant relationships using the fact that sequence homology is transitive. Pair-wise methods may be unable to detect a relationship between two divergent proteins, but if there exists an intermediate sequence to which both may be individually linked, then homology between the two divergent proteins is inferred. This gave a significant improvement over the simple pair-wise methods, but was less successful than the profile methods[32].

More recently in 1999 Retief and Pearson et al. developed a visual method[33] for identifying novel homology. This included a multiple sequence search procedure for finding new members of an existing protein family starting from a phylogenetic tree. 20-50 query sequences are selected from different branches of the tree, searched individually with FASTA, and the results collated. The authors liken their procedure to profile-based methods but it is in essence quite different, although there is some analogy between this procedure and the use in this thesis of multiple models to represent a single family.

2. Protein structure analysis

This chapter briefly presents a structural analysis of the cupredoxin family of proteins. This work is not directly linked to the other chapters of this thesis, but embodies the structural theme which runs throughout. Supplementary information for this chapter may be found in the appendix.

2.1 Cupredoxins

The object of the analysis of the cupredoxin family of proteins is to determine the structural and sequence features which define the family. In particular to identify the core of the protein which is common to all members of the family, and to examine the properties which those residues in the core possess. These properties represent the evolutionary constraints on the amino acids required to maintain the stable fold of the protein.

2.1.1 The family

The cupredoxin family was chosen to extend previous work on the relationship between sequence and structure of beta-sandwich folds. The previous work by Chothia et al.[34] examined the variable domains of the immunoglobulins. The cupredoxins however are more divergent from each other than the immunoglobulins and there is far less structural and sequence information available; this makes the analysis less straightforward.

2.1.1.1 Single domain cupredoxins

There are multi-domain and single domain cupredoxins. Only the family of single domain cupredoxins is considered here. There are seven types of cupredoxin for which a structure has been experimentally solved: plastocyanin [35], amicyanin [36], pseudoazurin[37], plantacyanin [38], azurin [39], rustacyanin [40], and stellacyanin [41].The cupredoxins are a superfamily of two-sheet beta sandwiches, of which the plastocyanin/azurin-like family comprises all of the monomeric proteins. The highest resolution structure available for each of the seven domains contained within the family are considered. Plastocyanin is used as a standard for the family, to which the others are compared. The members of the family are single copper (or blue) binding proteins involved in electron transport, particularly in systems of photosynthesis and respiration. Although it is a very diverse family (all sequences less than 20% identical), there is nevertheless a conserved beta-sheet sandwich, with a common core, and the strongly conserved ligand binding site. The two sheets are made up of seven parallel and anti-parallel strands, with a variable alpha-helix at one edge, and the binding site mostly atop one sheet. They occur mostly in plants but also in some bacteria.

2.1.1.2 Previous work

Previous work has been carried out on the analysis of the structures of the cupredoxin family [42] but it differs significantly from the work presented here. Their approach is quite different to that here, and in addition to the monomeric cupredoxins they analyse the multi-domain cupredoxins which are not considered here. In particular they are concerned with a sequence alignment for the whole family. The work shown here is all novel; in particular the definition of the core of the family, the identification and characterisation of the key residues, analysis of the binding sites and sheet movements are all new.

2.1.2 Structures

The basic structure in figure 2.1 is common to all, except for the alpha helix and loop regions which are variable. Although the two sheets are conserved amongst the structures, the positions of the sheets relative to each other may vary. There are two beta sheets with the strands labeled alphabetically from the N to C terminus, and a copper atom binding site supported by the sheets.

Figure 2.1: The structure of plastocyanin viewed with the second beta sheet behind the first. Note the copper atom bound atop strands G and F of the first sheet, and the alpha helix and irregular loop region to the edge of the two sheets.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/cupredoxin.eps}}

2.1.2.1 Hydrogen bonds

The first task was to calculate all of the backbone hydrogen bonds which join the strands into parallel and anti-parallel beta sheets. This was done for all seven structures, and diagrams were drawn detailing the bonds in a similar way to figures 2.3 and 2.4 (later). The pattern of hydrogen bonds identifies the strands in the two sheets and suggests an initial set of equivalent positions between structures. The conservation of hydrogen bonds also gives an initial suggestion as to the extent of conservation. Later on, figures 2.3 and 2.4 show the hydrogen bonds which are conserved between structures.

When comparing two hydrogen bond patterns, a spatial shift of two residues either up or down, in the plane of the sheet produces an alternative solution. As a consequence the hydrogen bond diagrams do not always give an unambiguous positional equivalence of strands, and the correct solution must be determined by doing three-dimensional superpositions.

2.1.2.2 Superpositions

Once the backbone hydrogen bond patterns have been used to suggest an initial structural alignment, full superpositions can be carried out. Figure 2.2 shows the result of full superpositions between three structures; the conserved strands are clearly aligned whereas the loops and helices are not. Plastocyanin was chosen as a `master' structure and the other six structures were each in turn aligned to it.

The procedure for structural alignment made use of a program called `PINQ' for three-dimensional analysis written by Arthur Lesk and is as follows. The superpositions were all pair-wise between plastocyanin and another. Although there is very little difference in the structures between strands in a sheet, there can be more variation in the relative position of the two sheets to each other (see the end of the section). The initial work on hydrogen bond patterns was used to define an equivalence of a few residues on each strand of a sheet between the two structures. These positions alone were used to fit the two structures to each other by minimising the r.m.s deviation between the backbone atoms. The region of each strand which was superposed was extended by attempting to include one more residue along the chain, then re-fitting the structures. Any residue position with c-alpha backbone atoms in the two structures deviating by more than three angstroms from each other was rejected, but a better fit meant that the new position was added to those already included in the structural alignment. In this way all strands were gradually extended at either end by adding one residue at a time until no more positions could be added without producing a c-alpha atom difference of more than three angstroms.

Once the two sheets of the structures had been independently aligned to their plastocyanin counterparts, the positions from both sheets were used to fit the whole structure minimising the r.m.s deviations of both sheets at once; this gave the final structural alignment. The results of the structural alignments are shown in table 2.1.

Figure 2.2: This figure shows the conserved and variable regions of the cupredoxin family. Three cupredoxin structures have been superposed. The yellow regions are aligned, and the bound copper atom is green. The light purple regions show the variation in structure of the non-conserved regions
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/mult_align.eps}}


Table 2.1: The structural alignment of seven structures to plastocyanin (1plc). For each structure the PDB code is shown, followed by the number of residues of the domain, the number of those which align to the structure of plastocyanin, the number of those which are identical in sequence, the identical residues expressed as a percentage of those aligned, and the root mean square deviation in angstroms between the aligned parts of the two structures.
Name PDB Residues Aligned Identical % identity R.M.S. fit
Amicyanin 1aaj 105 73 20 27 1.077
Azurin 1aiz 129 63 13 21 1.979
Cucumber protein 2cbp 96 55 15 27 1.452
Stellacyanin 1jer 109 58 13 22 1.300
Pseudo-azurin 1paz 123 81 21 26 1.063
Rustacyanin 1rcy 155 59 13 22 2.021
Plastocyanin 1plc 99 99 99 100 0


Once the pair-wise alignments have been made, a comparison of the regions of plastocyanin which align to the other structures can be used to find the regions which are common to all. This effectively gives the multiple structural alignment between all of the cupredoxins. The equivalent regions of all seven structures are listed here using the PDB residue numbering in the `ATOM' records:

It became clear during the analysis that the cupredoxins fall into two sub-families with rustacyanin (1rcy) and stellacyanin (1jer) being more similar to each other than to the rest, and vice versa.

2.1.2.3 Contacts

It is possible to calculate the contacts made by residues in the protein, that is atoms of different residues whose distance apart is less than a specified threshold. Although contacts were mostly used to identify the residues involved in the binding site (see later), they were also used to help decide the alignment of strands of beta sheet between two structures when there was any uncertainty. Contacts between peptides in the tightly packed core of the structure tend to be conserved, and hence give information in addition to the hydrogen bonds.

2.1.2.4 Accessible surface area

The accessible surface area was calculated using PINQ for every residue aligned in the structural alignments. The accessible surface area is the area of the peptide which can make a van der Waals contact with the surface of a water molecule on the surface of the protein. The accessible surface area indicates which residues are on the surface of the protein and which are buried and to what extent. All of the analysis so far is brought together in figure 2.4 for the rustacyanin-stellacyanin sub-family, and in figure 2.3 for the superfamily including plastocyanin and the others. The hydrogen bonds, structurally conserved residues, secondary structure, contacts, and accessible surface area all go toward defining the common features of the family, and most importantly the inner core of the protein (see later).

Figure 2.3: The conserved structure of the plastocyanin-like members of the cupredoxin family. The dashed lines represent hydrogen bonds, the solid lines the backbone of the chain, with the thicker sections for beta strands. Each circle is a peptide with the darkness representing the surface accessible area. The darker the circle the less accessible it is, going from maximum exposure (white) to minimum exposure (black).
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/schem1.eps}}

Figure 2.4: The conserved structure of the rustacyanin-like members of the cupredoxin family. Note the short section of alpha-helix on the far right, and the disruption in strand `A' in the centre. The dashed lines represent hydrogen bonds, the solid lines the backbone of the chain, with the thicker sections for beta strands. Each circle is a peptide with the darkness representing the surface accessible area. The darker the circle the less accessible it is, going from maximum exposure (white) to minimum exposure (black).
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/schem2.eps}}

2.1.3 The buried inner core

The information gathered in the previous section is necessary to define the common core, which consists of the binding site and the inward-facing residues of the two sheets which support it. The part of the core of the cupredoxin family responsible for maintaining the fold is shown from different angles in figures 2.5, 2.6, and 2.7. The residues in the inner core of a protein are generally buried and pack tightly together. As a consequence there are very strong structural constraints on the residues in the core, and only a limited variation is possible without disrupting the fold of the protein and de-stabilising it. These evolutionary structural constraints are observable in the sequences of the proteins of the family; the residues identified in the core are strongly conserved, whereas the other regions can be extremely variable in sequence. The identification of these key residues in the core which are responsible for maintaining the fold constitute the minimum requirements of any member of the family. If the properties of these residues can be ascertained then this information is a very powerful and can be used to determine whether a sequence of unknown structure could be a member of the family, with key residues with the same properties, and the same conserved core, and hence overall three-dimensional structure, and ultimately the same (or related) function.

Figure 2.5: The core structure (side view) of the cupredoxin family (plastocyanin shown) rendered in space-fill representation. Atoms which are part of sheet one are shown in yellow, sheet two in red. The binding site residues are shown in green.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/pcs3_white.eps}}

Figure 2.6: The core structure (front view) of the cupredoxin family (plastocyanin shown) rendered in space-fill representation. Atoms which are part of sheet one are shown in yellow, sheet two in red. The binding site residues are shown in green.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/pcs2_white.eps}}

Figure 2.7: The core structure (back view) of the cupredoxin family (plastocyanin shown) rendered in space-fill representation. Atoms which are part of sheet one are shown in yellow, sheet two in red. The binding site residues are shown in green.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/pcs1_white.eps}}

2.1.4 Sequences

To investigate the properties of the residues in the core of the protein, additional sequences of unknown structure are used to provide more information. By aligning other sequences to those of known structure the possible variations in the residues in core positions can be observed. It was mentioned at the end of the last sub-section that the key residue information can be used to identify homologues, but it is still possible to detect some homologues using sequence information alone.

2.1.4.1 Homologues

The sequence of each of the structures was compared to a non-redundant database[43]. Fasta[20] searches were performed using the sequence of each structure in the analysis individually. Sequences that were found by more than one of the searches were assigned to that with the strongest similarity. For each search, partial sequences were eliminated, then those remaining were aligned to the sequence of the structure used for the search using clustalw[44]. The result was seven sequence alignments of non-redundant homologues from a sequence database. The accurate structure-based sequence alignment of the seven proteins was used to align the seven alignments to each other, creating one large alignment of 84 sequences. This large alignment was carefully studied and adjusted using the results of the structural analysis to correct the alignment of the homologous sequences. Where necessary further investigation of the structures was carried out to solve specific uncertainties in the alignment. The final alignment is in the appendix.

2.1.4.2 Residue analysis

The alignment of 84 sequences was analysed to see what variations occurred in the residues of the conserved core and binding site. The data referred to in the rest of this sub-section is in the appendix. The occurrence of every amino acid in each column was counted. From this the number of residues were added up of each of the types: surface, neutral, and buried. These types are simply defined by grouping the residues based on characteristics such as hydrophobicity and can be seen in the appendix. The same grouping as in the previous immunoglobulin work [34] was used.

A simple program was written to extract the salient features at each site based on the frequency at which the residues and types are observed. The result of the following residue analysis for the conserved positions in the large cupredoxin alignment is in the appendix. A score was calculated for every combination of residues, and for every combination of types (surface, neutral, or buried). The highest scoring combination of types and the top-scoring three amino acid combinations were displayed along with the percentage of all sequences satisfying the combination. The score for any given feature (combination of allowed residues or types) was calculated as the negative log odds probability of getting the observed frequency of the amino acids in the feature. The background distribution of amino acids was taken from the numbers observed in the whole alignment. Thus the program picks and scores the most significant features of any column in an alignment.

As an example the position corresponding to residue 14 in plastocyanin has 49 phenylalanine, 20 tryptophan, 7 tyrosine, and 1 aspartic acid residue in the alignment. It is obvious that a good score would be given to the feature ``phenylalanine or tryptophan'' , but the calculation shows that a better score is given to ``phenylalanine, tryptophan, or tyrosine'' and that a worse score is given to ``phenylalanine, tryptophan, tyrosine, or aspartic acid''. This might not be obvious by inspection.

2.1.5 The family fingerprint

Combining the structural and sequence analysis leads to the final characterisation of the family. Figures 2.8 and 2.9 show the structural and sequence determinants of the two beta sheets making up the proteins in the cupredoxin family. This information summarises and concludes the main part of the analysis.

Figure 2.8: Sheet one. The key features of the conserved core of the cupredoxin family. The conserved hydrogen bonds are shown by dashed lines. The boxes represent key residues with their possible amino acids and observed accessible surface area inside. The circles represent other conserved amino acids which are not in the buried inner core.
\resizebox*{17cm}{!}{\includegraphics{figures/core1.eps}}

Figure 2.9: Sheet two. The key features of the conserved core of the cupredoxin family. The conserved hydrogen bonds are shown by dashed lines. The boxes represent key residues with their possible amino acids and observed accessible surface area inside. The circles represent other conserved amino acids which are not in the buried inner core.
\includegraphics{figures/core2.eps}

2.1.6 The binding site

In addition to the main part of the structure, the binding site is an important and well-conserved feature worthy of further examination.

In plastocyanin the copper ligand is completely buried and has bonds with five residues, which form a tetrahedral shape. At the ends of strands G and F of sheet one there are a methionine and a cysteine which both form sulphur bonds with the copper. This is the base of the tetrahedron. Two histidines in loop regions form charged aromatic ring bonds. These are the sides of the tetrahedron. One of the histidines is in the mid-loop between the G and F strands of sheet 1, and the other in the loop between the C and D strands joining the two sheets. The second histidine has an important adjacent alanine residue preceding it which is involved in maintaining the position of the histidine. Following the second histidine is the fifth and final binding residue which is the weakest of the five and forms a bond via a backbone oxygen atom with the copper (fairly weakly at just less than four angstroms). The binding site of plastocyanin is shown in figure 2.10 and the binding sites of all of the other six structures in the analysis are in the appendix.

The binding sites of the other structures in the family follow the same pattern but the fifth bond with the backbone is often weaker. The fifth binding residue (glycine) is less conserved because it is the backbone which binds, putting less of a constraint on the residue which is a proline or a leucine in some other structures. The contribution of the fifth residue to the binding site would not be identifiable if it were not for the collective analysis of all structures showing that although it is very weak it is conserved. In fact the publications of the structures do not identify the backbone as being involved in the binding site at all. The greatest variation of the binding site is in stellacyanin where the cysteine in strand F is a glutamine which makes the site more exposed and flatter.

Figure 2.10: The copper binding site of plastocyanin (1plc). The copper atom is shown as a sphere in the centre with the wire-frame representation peptides around. The five bonds are shown with dotted lines and the distances are marked in angstroms.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/plc_bind.ps}}

2.1.7 Sheet movements

Table 2.2 shows the relative shifts of the sheets in the structures, compared to the positions of the sheets in plastocyanin (1plc).


Table 2.2: The movement of the sheets relative to plastocyanin. The numbers correspond to the size of the transformation required to move the sheets relative to each other to obtain the same relative positioning as plastocyanin.
Structure Translation (angstroms) Rotation (degrees)
1plc 0 0
1paz 0.275 1.57
1aaj 0.789 5.14
2cbp 1.866 6.43
1jer 1.961 10.61
1aiz 3.062 15.66
1rcy 3.767 9.30


The procedure for calculating the shifts is as follows. First the master structure (plastocyanin) is moved to a chosen orientation about the z-axis. Then the master structure is moved such that the backbone atoms of the conserved residues in the second sheet have a minimum r.m.s. deviation from the x-y plane. The master structure is then moved in the plane such that the centre of mass of the backbone atoms of conserved residues in the second sheet lies at the origin. The second structure which is to be compared to the master is then placed such that the backbone atoms of the conserved residues in the first sheets of both structures have a minimum r.m.s. deviation from each other. Finally the translations and rotations necessary to move the second sheet of the master structure to the position of the second sheet of the second structure are calculated.

The movements of the sheets relative to each other are large compared to the movements within sheets. This means that although there are very strong structural constraints holding the strands in their positions in the sheets, there is much less of a constraint on the whole sheets to move relative to each other.

2.1.8 Structural and functional constraints on the sequence

Unlike the immunoglobulin variable domains which make use of the variability of their loops in their function, the cupredoxin family has a strong constraint on the binding site required to maintain the function of binding the copper atom for electron transport. To maintain the tetrahedral binding there is strong conservation of the residues in the loops which actually bind the copper atom. Residues in the loops surrounding the binding residues are also important to hold them in the right position and are also conserved. The binding site is supported by the rest of the structure. The binding site sits mainly atop sheet 1 consisting of residues in the loops between strands of the sheet. As a consequence the sheet supporting the binding site is also required to maintain the function, and hence the tight-packing, hydrophobic, inward-facing residues of the sheet have a limited number of allowed sequence variations. Although sheet 2 provides the other half of the sandwich and inward-facing residues of the buried core of the structure, it's only direct involvement in the binding site is the central strand C which leads into a loop joining sheet 1 which contains one of the binding residues. As a consequence, unlike in the immunoglobulins, there are fewer constraints on the sequence of sheet 2 whose position relative to sheet 1 and the binding site is variable.

2.2 Conclusion

This chapter describes the detailed structural analysis of the family of monodomain cupredoxins. The buried tightly packing core of the structure is defined. The key residues which are in the core and thus determine the fold of the protein are determined. The structurally constrained key residues have their characteristics determined which are required to maintain the structure of the protein common to the family.

The work in this chapter can be considered independently from the other chapters, but the structural viewpoint emerging from this detailed analysis of a single protein family, is carried through the other chapters, which move toward less detailed work on the broader set of all known structures.

3. Hidden Markov models

3.1 A brief description of hidden Markov models

3.1.0.1 General hidden Markov models

Hidden Markov models (HMMs)[10,11,12] have more uses than those described in this thesis, most notably in speech recognition [45]. HMMs can be used to detect distant relationships between proteins based on their amino acid sequences, and this is what is described here. There are other applications within computational biology, such as recognition of trans-membrane helices [46], CpG islands, and signal peptides [47]. Some of the principles in this brief description also apply to other applications, but it is by no means general.

3.1.1 The model

The model itself is not of primary interest and may be most easily thought of as a profile (in this case of a group of proteins) which embodies certain characteristics of protein sequences. What follows is a description of the model constructed from a set of sequences, whose function is to emit random sequences consistent with patterns of residues found in the sequences used to construct the model. The relevance of random sequence generation is not explained until the next sub-section.

3.1.1.1 The states

The model is made up of states and transitions between these states. The most basic state is a match state, whose function is to emit a single amino acid residue. The state is usually more likely to emit some residues than others. The match state has a probability of emission for each amino acid, and the 20 probabilities sum to one. A linear chain of these states joined by transitions between them from one to the next is a very basic model which is capable of emitting a random sequence, with one residue emitted by each state.

Since in nature the sequences have different lengths additional states are needed. A delete state is a state which simply emits nothing. The transitions to and from this state may allow it to bypass an emission state.

There are also insert states which are effectively the same as emission states because they emit a single residue and have a probability for each one. These however appear between match states and have a transition from themselves to themselves allowing the insertion of one or more residues between two match states. Match states vary a great deal in their probabilities but insert states are usually more generic.

3.1.1.2 The architecture

The architecture is the way in which the states are linked by transitions to make a complete model. See figure 3.1 drawn by Martin Madera.

Figure 3.1: This is the architecture of a hidden Markov model with only four segments, for real proteins there are a similar number of segments to amino acids in the protein. Match states are squares, insert states diamonds, and delete states circles. The colouring distinguishes the four segments, and the begin and end states. Transitions are indicated by arrows.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/figure1b.eps}}

There are two special states to begin and end the sequence. Not only are there emission probabilities in the insert and match states, but there are probabilities for all transitions between the states, making deletions and insertions more or less likely in different places. A sequence is generated by starting in the begin state, then following a transition to another state, where a residue may be emitted, then on to the next state and so on until the end state is reached. There are many possible paths through the model.

3.1.1.3 A sequence generator

Random transitions through the model and emissions from the states are guided by the probabilities, and a sequence is generated. Since the output from the model is a sequence, the states are `hidden' because they cannot be seen by looking at the output sequence. The sequences generated will on average have the characteristics which the model embodies via the variation in transition and emission probabilities. If a hidden Markov model is designed to represent a group of proteins, then the sequences it emits should be potential members of the group (in reality these generated sequences are unlikely to be biologically functional).

3.1.2 The uses of a sequence generator

The ability to generate sequences is of limited use, but the model lends itself to more useful applications.

3.1.2.1 Sequence scoring

The most important application for the work in this thesis is that of remote homology detection. Pair-wise sequence comparison methods ask the question ``are two sequences related or not?'', but this is limited since some related sequences have very little similarity[48]. Profile based methods ask the question ``given a group of related sequences, is another sequence related to that group?'' and are far more effective than pair-wise methods[32]. An HMM is such a profile representing a group of proteins.

It is possible to calculate the probability of a sequence having been generated by a model, and hence a score relating the sequence to the group represented by the HMM. A database of protein sequences can be searched by a model for members of the group by scoring all sequences in the database and selecting those with a significant score.

The detailed method by which the emission probability of a given sequence from the model is calculated is quite complicated, and two algorithms are widely used: the Viterbi algorithm calculates the probability of the sequence generated from the most likely path through the model and the forward algorithm calculates across all paths. Probabilities obtained in this way are actually a bad statistic in practice, so a null model is introduced to correct for sequence length. There are a variety of null models to choose from. A very simple null model would be a model of the same length with emission probabilities corresponding to the frequencies observed in nature, thus mimicking a random sequence with no preference. A better null model has the emission probabilities derived from the geometric mean of the probabilities in the model, thus reducing the importance given to similarity in sequence composition. Possibly the best null model is the reverse null model. This takes the reversed model as the null, simultaneously addressing the problems of sequence composition and low complexity sequence. Another advantage of the reverse null model is that it can be used to generate E-values (see later).

For more detailed information on scoring algorithms see `Biological sequence analysis' [49].

3.1.2.2 Model building

If a model is to represent a group of proteins, the emission and transition probabilities must be generated in such a way as to capture the characteristics of the group, for example figure 3.2. A model is generated from a multiple sequence alignment. The residues in the columns of the alignment are used to generate the match state probabilities, and the insertions and deletions in the alignment to generate the transition probabilities. Sequences in the alignment are weighted such that their contribution is related to their similarity. In this way if there are a large number of sequences with high similarity and a few with distant similarity the model will not be over-dominated by the numerous similar sequences.

Figure 3.2: At the top is a part of a model built from an alignment of homologues to plastocyanin which is a cupredoxin protein. Below is part of the sequence alignment from which it was built. The bars are divided into letters, the size of which are proportional to their probabilities. The overall size of the bar is the importance of the state to the score. This is calculated as the sum of the standard deviations of the log probabilities from the geometric mean for the model.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/model_align.eps}}

The algorithm used to estimate the probabilities from the (weighted) amino-acid frequencies in the alignment is quite involved. Since the observed residues in the alignment only represent a sample of the possibilities, one would not want a zero probability for a residue which does not appear in the column of the alignment. This is avoided by mixing the observed residue frequencies with some fixed set of probabilities. These fixed sets are called Dirichlet distributions [29] and were obtained by examining the observed frequencies in a very large number of naturally occurring sequences to give a background probability (Dirichlet distribution). One final twist is that there is not just one Dirichlet distribution to use on a given column in the alignment; there is a set of Dirichlet distributions called a Dirichlet mixture (or library of priors) with different characteristics representing different residue types.

3.1.2.3 Multiple sequence alignment

Also relevant to this thesis is the use of HMMs for sequence alignments. As was explained earlier, it is possible to calculate the most likely path through the model for a given sequence. This path gives an alignment to the model; every match state which the path goes through gives a residue of the sequence aligned to that segment (column) of the model, and insertion emissions fall unaligned between states, with delete states leaving a column in the alignment blank. This alignment alone is of limited interest, but if other sequences are also aligned to the model then this gives a pseudo-multiple alignment of sequences. It is not a true multiple alignment because the alignment of every sequence is unaffected by the others, however it is more than a collection of pair-wise alignments because the model to which they are aligned is a profile representing a group of sequences. Figure 3.1 shows the alignment of sequences in figure 3.2.


Table 3.1: This is the alignment of the sequences to the plastocyanin model above. The numbering at the top corresponds to the segment number in the model. Residues passing through match states are in upper case, passage through a delete state is shown with the `-' character, and residues emitted from insert states are in lower case. The `.' characters merely pad the alignment to maintain the columns.
\begin{table}{\centering\resizebox*{0.9\textwidth}{!}{\includegraphics{tables/cupred_align.eps}}\par }
\par\end{table}


3.1.3 Expectation values

Expectation values (E-values) are a general value not specific to HMM methods, but they are used extensively in this work. In theory E-values from any method should be comparable.

3.1.3.1 What E-values mean

Since HMMs are a probabilistic method, statistically one expects occasional chance errors. E-values are the expectation of the number of errors per query. If an HMM is used to score a database of sequences (a query), then the expected number of sequences unrelated to the model (errors), which have the same score or better than any sequence, is given by its E-value. It is important to note that an E-value is only true in the context of the query, for example a query on a large database will have more errors than a query on a small database searched with the same model. E-values can be used to decide what error rate you are prepared to allow in your results, but they say nothing about the success rate.

3.1.3.2 How E-values are calculated

The errors produced by HMM scores follow an extreme value distribution. One way of estimating E-values is to score the model against a large number of random sequences (assumed unrelated to the model, and therefore all errors) and plot the distribution. An extreme value distribution can be fitted to the results and used to estimate the E-values of scores produced by the model. Another approach is to score the reverse of the sequence as well as the sequence itself; this is the reverse null model mentioned earlier.

Since the reverse sequence has the same error distribution as the forward sequence, subtracting the reverse score from the forward score cancels out the component of `error' from the distribution. Both forward and reverse scores are expected to be drawn from the same Gumbel distribution which has two parameters, however the difference of two variables drawn from the same Gumbel distribution follows a simpler one-parameter distribution. It appears to be the case empirically that the parameter is the same for all models and equal to one, so the distribution of the differences in forward and reverse scores is known. The E-value can then be directly estimated from the query conditions (database size), without the need for ad hoc model calibration requiring the scoring of random sequences and fitting the distribution.

3.2 HMMs for detecting domains in protein sequences

There are numerous methods for detecting homology between protein sequences, but they fall into two basic categories: pair-wise methods (such as Blast[21] and FASTA[20]) and profile methods. As mentioned earlier, HMMs are a profile method and although pair-wise methods are much faster, profile methods are able to detect much more distant relationships[32]. Probably the most commonly used profile method is Position Specific Iterative Blast (PSI-Blast)[28] which is comparable in performance (see later), but HMMs are also widely used for remote homology detection.

3.2.0.1 Software packages

There are two commonly used software packages for using HMMs for protein homology detection: Sequence Alignment and Modeling system (SAM http://www.cse.ucsc.edu/research/compbio/sam.html) and HMMER (http://hmmer.wustl.edu). The two methods are examined in more detail later but a key point is that to build a model, a multiple sequence alignment of homologous proteins is required, and the SAM package includes a procedure for automatically generating them, whereas HMMER does not. This represents a significant advantage of the SAM package.

3.2.1 The SAM T98 method

The SAM package comes with a procedure called `T98'[14] which was later improved and renamed `T99'. This procedure creates multiple alignments of homologous sequences for the purpose of model building. The procedure begins with a seed, which can be a single sequence or a set of sequences, aligned or not. The procedure then uses a large database of sequences from which to extract homologues by an iterative process to add to a growing alignment.

3.2.1.1 The method

Figure 3.3: The SAM T98 procedure starting with a single seed sequence.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/T99.eps}}

The default T98 procedure follows these steps (see figure 3.3):

The large non-redundant database of sequences used by the T98 procedure throughout this thesis is NRDB90 [43] which contained over 200000 sequences at the start of this work, but now has almost 400000. All sequences in this database have less than 90% of the residues in their sequences identical.

3.2.2 A comparison of different model building procedures for five protein families

3.2.2.1 Previous procedures employed

Previous practice by the MRC bioinformatics group[50] and others[51] has been to build one HMM for each protein family or superfamily using an accurate alignment of the sequences of diverse family members. However, there are two problems with this approach: one practical and one theoretical. The practical problem is that producing accurate sequence alignments is not a trivial problem and for very diverse sequences requires expert human intervention. The theoretical problem is that it has not been demonstrated that using one model built from a good alignment of selected diverse sequences produces better results than using multiple models built from different single seed sequences and their homologues (as described above). To investigate this second problem various methods of modeling five chosen superfamilies were compared.

3.2.2.2 A comparison of four procedures

Five superfamilies were selected because detailed structural and sequence analyses were available, along with the expert knowledge acquired from the analysis (previous work, published[52] and unpublished). This provides both accurate structure-based hand built sequence alignments, and the means with which to verify the results of homology searches. The different models were built for each superfamily and searched against NRDB90, and then checked and compared. All of the models were built using the SAM T98 iterative procedure described above. The following four variations on the method were investigated:


Table 3.2: The number of homologous sequences found by the four procedures.
Protein family: Cupredoxins CytochromeC Flavodoxins Globins I set IGs
procedure 1 94 22 121 492 8440
procedure 2 63 21 121 489 8431
procedure 3 82 22 119 492 8298
procedure 4 106 22 130 505 9687


The results are shown in table 3.2. The homology criteria were as follows: any hit with a `reverse' score lower (better) than -15 was taken as a homologue, hence the comparisons presented above depend on the scores produced by the different methods being roughly equivalent. This value (-15) was the score found to produce a 1% rate of errors per query when using T98 to score all of PDB versus PDB[32]. The `reverse' score is the forward score offset against the score of the sequence in reverse, which filters out low-complexity matches. The target sequences were checked by hand for false homologues using a combination of annotation, alignments, structural knowledge and further sequence searching. A small number of unannotated potential false homologues were found to be in the search results, and only a couple of certain false homologues. The immunoglobulins were not checked because the homologues were too many, and in the case of the flavodoxins the nitric oxide reductases were not counted as false because they are a well known case of sequence similarity between different proteins[53]. The results indicate that the use of multiple models where each starts with a single sequence produces the best results. When sequence alignments are used the addition of the constraints described in (2.) have little effect. The Cupredoxin hand alignment included only members of 1 of the 3 families within the superfamily and in this case procedure 2 did badly. The data also shows that there is some loss in performance using automatic alignments for seeding the T99 procedure, but not a great deal. Further analysis of this data (tables 3.3,3.4, and 3.5) shows that not only did the multiple models procedure (4) provide more hits but it found essentially everything found by the others.


Table 3.3: A comparison of the hits made by procedures 4 and 1.
Protein family: Cupredoxins CytochromeC Flavodoxins Globins I set IGs
unique to 1 0 1 0 0 3
unique to 4 12 1 9 13 1250
intersection 94 21 121 492 8437



Table 3.4: A comparison of the hits made by procedures 4 and 2.
Protein family: Cupredoxins CytochromeC Flavodoxins Globins I set IGs
unique to 2 0 0 0 0 7
unique to 4 43 1 9 16 1263
intersection 63 21 121 489 8424



Table 3.5: A comparison of the hits made by procedures 4 and 3.
Protein family: Cupredoxins CytochromeC Flavodoxins Globins I set IGs
unique to 3 0 1 0 0 5
unique to 4 24 1 11 13 1394
intersection 82 21 119 492 8293


3.2.2.3 A breakdown of the models

The multiple model method can be seen as modeling a family of proteins by several multiple overlapping models rather than one single model. Because a single model is too general to contain all of the features and variations in a family, the multiple models together can better represent more complex and diverse sequence variation. See figure 3.4 for a representation.

Figure 3.4: This picture shows a theoretical sequence space, populated by family members (blue) and non-family members (red). On the left it is shown how multiple overlapping models are more specific than the single general model on the right.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/models.eps}}


Cupredoxins

Seed sequence only this model only other models this and others model sequences
1plc 1 59 46 49
1paz 0 60 46 48
1aaj 2 59 45 49
1aiz 14 86 6 21
1rcy 2 95 9 6
1jer 0 69 37 39
2cbp 1 68 37 40

CytochromeC-prime
 
Seed sequence only this model only other models this and others model sequences
256b 2 20 0 21
2ccy 20 2 0 4

Flavodoxins
 
Seed sequence only this model only other models this and others model sequences
5ull 20 5 105 73
1ofv 5 20 105 67

Globins
 
Seed sequence only this model only other models this and others model sequences
1bab 0 77 428 419
1mbd 0 55 450 442
1eca 0 53 452 196
1lhl 4 20 481 266
2lhb 1 22 482 491
1mba 0 13 492 329
1flp 1 21 483 265
3sdh 0 39 466 388
1hbg 1 29 475 386
1ith 0 53 452 215
1hlb 0 25 480 440
1ash 1 408 96 36




Table 3.6: A breakdown of homologies found by individual models. The numbers in the columns refer to: the number of homologues found by one model alone, the number not found by the model but found by other models, the number found by the model and also found by others, and the number of sequences from which the model was built.
Immunoglobulin
(I set)
Seed sequence only this model only other models this and others model sequences
TEL 84 519 9084 2853
HNCAM1D1 4 6061 3622 1666
Twitchin18 1 6382 3304 1239
TitinM5 103 453 9193 2759
VCAMD1 15 5549 4123 1109
ICAM1D1 0 9669 18 20
ICAM2D1 0 9668 19 20
H1 16 6442 3229 1436
H2 1 9001 685 660
H3 46 1619 8022 1682
H4 101 1392 8194 2684




The complete breakdown of models is in table 3.6. The individual models in this procedure mostly found the same hits, which implies that a successful model built with T99 represents the family, and not the seed as is often misunderstood. This important point is discussed in more detail later. Some models were completely redundant with respect to others and some found outliers which the others did not find (see later for a more detailed analysis of model redundancy). These results solve the theoretical problem of whether one model or multiple models are most effective, and hence remove the need for solving the practical problem of accurately aligning distantly related sequences for the purpose of generating good hidden Markov models.

3.3 Large scale assessment

The examination of a small number of families was necessary because of the lack of good structure-based hand alignments, but when investigating the multiple models procedure further, it is possible to carry out large scale assessments which will give more accurate results.

3.3.1 The SCOP all against all test

3.3.1.1 The necessity for the test

The test procedure used in the previous section relied on a hand analysis of the results to verify the success or failure of a model. This is far too time consuming to carry out on a large scale, or to repeat for many conditions. It is desirable to increase the size of the data set which tests are carried out on (to make the results more significant), and to find a data set which spans more diverse families of proteins (to make the results more universal). To do this a large set of protein sequences is needed, with reliable independent family information. The SCOP[6] database provides such a set.

3.3.1.2 Structural Classification of Proteins (SCOP)

The Structural Classification of Proteins database contains all proteins of known structure with coordinates deposited in the PDB[54], and is usually kept up to date to within six months. SCOP is a domain based classification where genes may be comprised of one or more SCOP structural domains. A SCOP domain is defined as a minimum evolutionary unit. These are globular units, but unlike other databases, combinations of globular units are only separated if they are observed separately elsewhere in nature, i.e. in combination with different partners or in isolation. This means that some pairs (or more) of globular units are classified as a single domain because they are only observed together, and hence the pair (or more) is the minimum evolutionary unit.

SCOP is a hierarchical classification with several levels:

The classification is carried out by Alexey Murzin. He examines every new protein structure individually using a variety of techniques which assures an accurate and reliable classification, which makes the database ideal for use as a benchmark for comparisons.

3.3.1.3 Description of the test

As was alluded to above, the test is carried out at the superfamily level. Since the structural and functional evidence used to classify the domains is able to link more remotely related domains than sequence information alone, the ability of sequence methods to detect SCOP superfamily relationships is a sufficiently hard test. The test could be used to assess the performance of any method, but the T99 method is used as an example in this description.

The sequences of all of the SCOP domains filtered to 95% sequence identity (to eliminate redundancy) were used as seeds for the T99 procedure, creating an HMM for each one. These models are scored against the sequence database from which they were built. The SCOP classification is then used to decide for every hit made by every model whether it is a true or a false homologue. A model built from any given seed is supposed to represent the superfamily which the seed belongs to, and any hit to a sequence in the same superfamily is considered a true hit. See figure 3.5 for a diagram of the test procedure.

Figure 3.5: This diagram shows how a set of models is searched against PDB and the SCOP classification used to decide whether the hits are true or false. The dash-dotted lines represent domain sequences.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/scop_test.eps}}

This test has been used in previous work [48,32]. The same test was used to compare profile and pair-wise methods as mentioned above. The data produced by the test carried out on the HMMs provides the framework for the analysis in following sections.

3.3.1.4 Exceptions

When considering the hits made by any method on the SCOP test some exceptions to the superfamily classification should be taken into account. Since the classification is conservative it is possible that some superfamilies are classified apart when there could be a relationship between them. This can be dealt with by allowing hits outside the superfamily but within the same fold to be ambiguous, and are not taken as either true or false. There are a few known inter-superfamily relationships which are considered acceptable, most notably amongst TIM barrels[55], and between Rossmann folds[56].

3.4 Analysis of hidden Markov model behaviour

The test described above has been used to investigate various aspects of the behaviour of HMMs built with the T99 procedure.

3.4.1 Early investigations

The early tests were carried out on SCOP version 1.48, using all domain sequences filtered to 95% sequence identity, producing 4951 models. The sequences were obtained from the ASTRAL database[57], which is discussed in more detail later. The default T99 procedure was used with no modifications.

3.4.1.1 SCOP classes

The first thing examined was the ratio of true to false hits for any given E-value threshold (figure 3.6). The different SCOP classes were compared and it was found that their behaviour differed significantly. The same results were calculated taking hits to the same fold but a different superfamily as false, rather than ambiguous as described above. This showed no qualitative difference, and only a very minor quantitative change. This means that differences in fold variations in the different classes are not responsible for the differing results in figure 3.6.

\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph1.eps}}

Figure 3.6: The ratio of true to false positives for all SCOP classes. Hits to the same fold but a different superfamily were considered ambiguous here (neither true nor false). The SCOP classes are as follows:
Class Description
1 All alpha proteins
2 All beta proteins
3 Alpha and beta proteins (a/b)
4 Alpha and beta proteins (a+b)
5 Multi-domain proteins (alpha and beta)
6 Membrane and cell surface proteins and peptides
7 Small proteins

3.4.1.2 Test size

Since the classes vary a great deal in size, different sized sub-sets of the SCOP domains were examined (figure 3.7). The sub-sets were created by filtering the seeds to different percentages of sequence identity. The classes have different numbers of folds and superfamilies; they are not homogeneous with many superfamilies over-represented, under-represented, or not yet represented at all by known structures.

Figure 3.7: The ratio of true to false hits for sets of models filtered to different percentage sequence identity (given by the `pdb' number in the legend).
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph2.eps}}

There is an obvious trend in the graph showing an increase in the ratio of false hits to true as the set size decreases. This is what one would expect as at lower percentage sequence identity the closer homologues which are easier to detect are fewer with respect to the more distant ones. The behaviour is unclear below 40% sequence identity because below this level, percentage sequence identity is a very poor measure of similarity.

To observe the effect of size alone, the same experiment was carried out except the seeds were filtered randomly to produce sets the same size as those filtered on percentage sequence identity (figure 3.8). It is clear from these results that there is no significant difference between the sets of different sizes, and so this cannot explain the observed differences between SCOP classes.

Figure 3.8: The ratio of true to false hits for sets of models randomly filtered to sizes equivalent to those sets produced by percentage sequence identity filtering. The numbers in the legend correspond to the percentage sequence identity-filtered set with the same number of members.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph3.eps}}

3.4.1.3 Inter-superfamily relationships

From the graphs above it can be seen that class 3 has by far the greatest ratio of false hits to true hits, so an experiment was devised to test the hypothesis that this is an artifact of classification. If for some reason proteins in class 3 which are related have been classified in different folds (the treatment of same-fold hits as ambiguous rules out merely trans-superfamily relationships), then these can be detected by low-scoring (good) inter-superfamily relationships. Any inter-superfamily hits below an E-value threshold of 0.001 define an `allowed' inter-superfamily relationship. These relationships then become true hits rather than false hits.

Figure 3.9: The ratio of true to false hits for all SCOP classes where inter-superfamily hits below an E-value of 0.001 define `allowed' relationships.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph4.eps}}

Figure 3.9 shows that the difference between class 3 and the others is due to inter-superfamily relationships which are being detected by the HMMs, which may be true relationships or errors in the HMMs. Classes 5 and 6 still produce more false hits to true hits than the others, but they are both quite small and the data is insufficient to draw conclusions from as can be seen by the `stepping' nature of the graph. What is also interesting is that relatively few inter-superfamily relationships are detected below 0.001 and that they appear to account for most of the false hits. To confirm that the improvement observed in class 3 by allowing certain relationships is across all scores, and is not merely explained by the exclusion of false hits below 0.001, the same graph was plotted simply making all hits below 0.001 true (figure 3.10).

Figure 3.10: The ratio of true to false hits for all SCOP classes where hits below an E-value of 0.001 are considered true.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph5.eps}}

Although the errors are reduced by considering all hits below an E-value of 0.001 as true, they only account for a partial contribution and the same differences as in figure 3.6 between classes are observed. It is thus safe to conclude that the differences between classes are due to discrepancies between the SCOP classification and the relationships detected by the HMMs, and that for some reason there are many more of these in class 3 than any other. Later on it is discussed whether these are failures of the HMMs or genuine relationships.

3.4.1.4 A change in behaviour

Since an E-value of 0.001 as a threshold is a somewhat arbitrary but reasonable value, the number of inter-superfamily relationships detected by the HMMs was plotted for a range of E-values to investigate what a suitable threshold might be (figure 3.11).

Figure 3.11: The number of inter-superfamily relationships detected by models from all SCOP classes across a range of E-value thresholds.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph6.eps}}

Most classes seem to detect a similar number of relationships (although class 3 has by far the greatest number of hits between them), with the exception of class 4 which detects far fewer. This is still unexplained. Examining the two extremes of the graph shows that at very high E-values relationships increase linearly implying totally random behaviour, and at very low E-values the number of relationships is almost constant implying a fixed number of non-random relationships.

To optimise a model library assembled from models such as have been created for the test, it is desirous to eliminate as many false hits as possible by removing models which are the cause of the false hits, yet at the same time keeping as many models as possible. To look at this issue the proportion of models making false hits was plotted for different E-value thresholds (figure 3.12). This is another approach to the same question addressed in the previous graph.

Figure 3.12: The proportion of models from all SCOP classes making false hits against the E-value threshold.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph7.eps}}

A detailed analysis of this data revealed a change in behaviour at some point in all of the curves. This change in behaviour can also be observed in the previous graph but is far less marked. The observation of this change in behaviour lead to an important conclusion which explains all of the data presented in this section. In retrospect it is obvious and the reader may have already been guided to the same conclusion, but the implication is very important for the work which follows.

Errors observed in HMMs are the result of two very distinct causes:

At low E-values almost all of the errors will come from model-building because the E-value being low means the expected number of scoring errors is very small. At high E-values the errors will be completely dominated by scoring errors because the number of scoring errors rises much faster than the model-building errors. The change in behaviour in the graphs is at the point at which the errors swap from being dominated by model-building errors to scoring errors. The importance of this conclusion is that although scoring errors are inevitable, there is the possibility of improving a model library by eliminating model-building errors.

3.4.1.5 Excluding models

The point at which the change in behaviour was observed in the previous graphs was used as a threshold to identify models which had errors in them from the building process. This amounted to 4% of all models, and these 187 models were found to be producing 90% of the false hits. As mentioned earlier these `false' hits could be occurring for two reasons: either genuine inter-superfamily relationships are being detected, or the model has been poisoned during the model-building stage (I call these bad models). On close examination of the 187 models roughly half turned out to be bad models. The other half were explained by reasonable inter-superfamily relationships, a large proportion of which were Rossmann-like folds in class 3 with very distant similarities. There are some Rossmann folds which are obviously related and these were mentioned in the description of the test, but there are many more less obvious relationships which are plausible. This explains the observation of differences between class 3 and the others.

Comparing the errors produced by the model library before and after the bad models are excluded shows that the data fits the theoretical value predicted by the E-value better afterward than before, especially in the critical region (figures 3.13 and 3.14). At the bottom of the graph of the modified library no scoring errors are observed where there should be some; this is an artifact of using the E-value to select the bad models, and ideally an independent measure would be preferable. This result supports the explanation of scoring errors and model-building errors, with a small number of bad models generating most of the errors.

Figure 3.13: The errors per query versus E-value for all SCOP classes. The theoretical prediction is that the E-value should equal the errors per query.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph8.eps}}

Figure 3.14: The errors per query versus E-value for all SCOP classes where bad models have been excluded based on inter-superfamily hits. The theoretical line is the diagonal.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph9.eps}}

3.4.1.6 A validated threshold

Since there is some discrepancy between the theoretical E-value and the observed errors per query it is possible to choose an E-value threshold which corresponds to the desired real error rate. Figure 3.15 is focused on the critical region of the modified library, and shows that the theoretical E-value is very close to the observed error rate.

Figure 3.15: The E-value versus observed errors per query in the critical region for a model library with all detected bad models removed. The dotted line is the ideal curve.
\resizebox*{0.9\textwidth}{!}{\includegraphics{figures/graph10.eps}}

3.5 Comparison to PDB-ISL

A comparison was made to another method for assigning structural domains to protein sequences which has previously been shown to perform well[58].

3.5.1 PDB-ISL

3.5.1.1 Intermediate sequences

Sequence homology is distributive, so that if two sequences `a' and `b' are related, and if `b' and `c' are also related, we may infer that `a' and `c' are related. It may be that we are only able to detect homology between `a' and `b' and between `b' and `c'. In this case we say that `b' is an intermediate sequence for detecting homology between `a' and `c'. This is a powerful technique as very distant relationships can be detected via a number of intermediates.

3.5.1.2 PDB-ISL

The Protein Data Bank Intermediate Sequence Library (PDB-ISL) is a library of sequences homologous to proteins in the PDB which can be used as intermediates for detecting remote homologues to proteins of known structure. The intermediates were found using PSI-Blast[28], and the performance of PDB-ISL is comparable to PSI-Blast alone, the advantage being that PDB-ISL is faster to search and hence applicable on a genomic scale. More information on PDB-ISL can be obtained from [58].

3.5.2 Genome assignments

3.5.2.1 Comparing coverage

Work by Jong Park and Sarah Teichmann used PDB-ISL to assign structural domains to 6 complete genomes, so the model library was also used to carry out assignments to the same genomes so that the coverage could be compared. The results (figures 3.16 and 3.17) show that a significant improvement in coverage has been obtained over PDB-ISL. Both methods aim to achieve an error rate of less than 1%, but this can only be verified on PDB sequences.

Figure 3.16: The percentage of sequences with an assignment for selected genomes: af (Archaeoglobus fulgidus), ec (Escherichia coli), ph (Pyrococcus horikoshii), rp (<