SUPERFAMILY 1.73 HMM library and genome assignments server

Generate SCOP Domain Assignments using the SUPERFAMILY Models

This page describes how to produce SCOP protein domain assignments using the SUPERFAMILY hidden Markov models (HMMs) and associated scripts.

Introduction

The process involves running a set of FASTA formatted sequences against the models using the scripts from the ftp site. The results are a set of SCOP superfamily, and optionally family, level domain assignments.

This page is divided into three main sections:

Setting up the models and scripts is a multi-step process. There may be issues for some combinations of machines and operating systems. If you have read this document, and the relevant sections of the SAM or hmmer documentation, and are still having a problem, then please contact us: superfamily@mrc-lmb.cam.ac.uk, feedback form.

Alternatively, we can produce domain assignments for your sequences. All we require is a set of protein sequences in FASTA format.


1: Setup models and scripts

The scripts are written in perl. Any recent version of perl should work. Around 500 MB of hard disk space will be required. We assume you are using a linux/unix environment.

1.1 Register for a SUPERFAMILY license (free for academic and commercial use).
Download the SUPERFAMILY models and scripts:

   ftp supfam.org
   cd models
   get model.tab.gz
   get sam_1.73.tar.gz
   get self_hits.tar.gz
   cd ../scripts
   mget *
   bye
When logging into the SUPERAFMILY ftp server you will be prompted for a password. Use the password you receive after registering for a license.

1.2 The hmmscore program from the SAM software package is recommended [ PubMed12364612 ] for scoring sequences against the SUPERFAMILY models.

Register for a SAM license (free for academic use). Download SAM and follow the installation instructions that come with it. The scripts for running the models require the hmmscore program to be in your command PATH environment variable.

Please note that the scripts support the use of the hmmer HMM software as an alternative to SAM. For clarity, the use of hmmer is described in section 4: Using SUPERFAMILY models with hmmer.

1.3 Download the SCOP 1.73 dir.des.scop.txt and dir.cla.scop.txt files:

   wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.73
   wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.73
   mv dir.des.scop.txt_1.73 dir.des.scop.txt
   mv dir.cla.scop.txt_1.73 dir.cla.scop.txt
These files are required for the family level classification [ PubMed16877569 ].

1.4 Setup the infrastructure required by the scripts:

   gunzip model.tab.gz
   tar xvfz sam_1.73.tar.gz
   tar xvfz self_hits.tar.gz
   mkdir scratch scripts
   chmod u+x *.pl
   cp a2m2selex.pl familyassignment.pl assignment.pl fasta_checker.pl scripts
Leave the superfamily.pl script in the current working directory.


2: Use scripts to produce domain assignments

Run superfamily.pl to produce the domain assignments:

   # Simple
   ./superfamily.pl human.fa

   # Advanced
   ./superfamily.pl human.fa human.ass scratch binmodlib model.tab hmmscore 0.0001 n scripts y

An explanation of the command line options in the Advanced example follows:

   human.fa    Sequence file in FASTA format
   human.ass   File which will contain the domain assignment results
   scratch     Directory where temporary files will be created
   binmodlib   Directory/file containing the HMMs
   model.tab   Contains information on the SCOP superfamilies and the 
               HMMs used to represent them
   hmmscore    Name of the HMM scoring program to use
   0.0001      Evalue threshold, domains with a greater evalue are excluded
   n           Do not remove temporary files in the scratch directory
   scripts     Directory containing helper scripts 
   y           Generate SCOP family level assignments in addition to  
               superfamily level assignments
The order of command line options is important.


3: Domain assignment output formats

Output is a tab-delimited file of domains, one domain per line.
There can be more than one domain per sequence, and there may be sequences for which there is no domain assignment.

The columns, for superfamily and family assignment (the default):

   Sequence ID      
   SCOP superfamily ID       
   SUPERFAMILY model ID      
   Match region     
   Evalue score     
   Alignment to model 
   SCOP superfamily description 
   Family evalue  
   Family  
   Closest structure

The columns, for superfamily assignment:

   Sequence ID      
   SCOP superfamily ID       
   SUPERFAMILY model ID      
   Match region     
   Evalue score     
   Alignment to model 
   SCOP superfamily description

Please contact us if you have further questions.


4: Using SUPERFAMILY models with hmmer

Some additional steps must be taken to use the SUPERFAMILY models with hmmer. We assume that you have followed all the steps in section 1: Setup models and scripts.

4.1 Download hmmer, and follow the installation instructions that come with it (no license required, free for academic and commercial use).

4.2 Download the hmmer formatted models from the models directory on the SUPERFAMILY ftp site:

   ftp supfam.org
   cd models
   get hmmer_1.73.tar.gz
   bye

4.3 Extract the models for hmmer from the compressed file:

   tar xvfz hmmer_1.73.tar.gz
The hmmer formatted models will be placed in a directory called hmmermodlib.

4.4 Join all hmmer models together into one file:

   cd hmmermodlib
   ls | xargs cat >> ../hmmermodlib.hmm
   cd ..
The hmmpfam program, from the hmmer package, requires all HMMs in a single file.

4.5 Run superfamily.pl, using hmmer, to produce the domain assignments:

   ./superfamily.pl human.fa human.ass scratch hmmermodlib.hmm model.tab hmmpfam 0.0001 n scripts y


If you have further questions, suggestions or comments, then please contact us using the feedback form or via email superfamily@mrc-lmb.cam.ac.uk.