Generate SCOP Domain Assignments using the SUPERFAMILY Models
This page describes how to produce SCOP protein domain assignments
using the SUPERFAMILY hidden Markov models (HMMs) and associated scripts.
Introduction
The process involves running a set of FASTA formatted sequences
against the models using the scripts from the ftp site. The results are a set of SCOP
superfamily, and optionally family, level domain assignments.
This page is divided into three main sections:
1: Setup models and scripts
explains how to download and setup the
SUPERFAMILY models and scripts. Some additional programs and data files are required.
These are also described.
2: Use scripts to produce domain assignments
gives a simple, and an advanced, example of how to obtain domain assignments by running
the scripts. Command line options to the scripts are illustrated.
3: Domain assignment output formats
lists the data obtained in the
domain assignment output file. The default output file contains both superfamily and
family level domain assignments.
Setting up the models and scripts is a multi-step process.
There may be issues for some combinations of machines and operating systems.
If you have read this document, and the relevant sections of the SAM or hmmer
documentation, and are still having a problem, then please contact us:
superfamily@mrc-lmb.cam.ac.uk,
feedback form.
Alternatively, we can produce domain assignments for your
sequences. All we require is a set of protein sequences in FASTA format.
1: Setup models and scripts
The scripts are written in perl. Any recent version of perl
should work. Around 500 MB of hard disk space will be required. We assume you are
using a linux/unix environment.
1.1
Register for a SUPERFAMILY
license (free for academic and commercial use).
Download the SUPERFAMILY models and scripts:
ftp supfam.org
cd models
get model.tab.gz
get sam_1.69.tar.gz
get self_hits.tar.gz
cd ../scripts
mget *
bye
When logging into the SUPERAFMILY ftp server you will be prompted for a password.
Use the password you receive after registering for a license.
1.2
The hmmscore program from the SAM
software package is recommended
[
12364612 ]
for scoring sequences against the SUPERFAMILY models.
Register for a
SAM license
(free for academic use).
Download SAM and follow the installation instructions that come with it.
The scripts for running the models require the hmmscore program to be in your command PATH environment variable.
Please note that the scripts support the use of the
hmmer HMM software
as an alternative to SAM. For clarity, the use of hmmer is described in section
4: Using SUPERFAMILY models with hmmer.
1.3
Download the SCOP 1.69 dir.des.scop.txt and dir.cla.scop.txt files:
wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.69
wget http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.69
mv dir.des.scop.txt_1.69 dir.des.scop.txt
mv dir.cla.scop.txt_1.69 dir.cla.scop.txt
These files are required for the family level classification
[
16877569 ].
1.4
Setup the infrastructure required by the scripts:
gunzip model.tab.gz
tar xvfz sam_1.69.tar.gz
tar xvfz self_hits.tar.gz
mkdir scratch scripts
chmod u+x *.pl
cp a2m2selex.pl familyassignment.pl assignment.pl fasta_checker.pl scripts
Leave the superfamily.pl script in the current working directory.
2: Use scripts to produce domain assignments
Run superfamily.pl to produce the domain assignments:
# Simple
./superfamily.pl human.fa
# Advanced
./superfamily.pl human.fa human.ass scratch binmodlib model.tab hmmscore 0.0001 n scripts y
An explanation of the command line options in the Advanced example follows:
human.fa Sequence file in FASTA format
human.ass File which will contain the domain assignment results
scratch Directory where temporary files will be created
binmodlib Directory/file containing the HMMs
model.tab Contains information on the SCOP superfamilies and the
HMMs used to represent them
hmmscore Name of the HMM scoring program to use
0.0001 Evalue threshold, domains with a greater evalue are excluded
n Do not remove temporary files in the scratch directory
scripts Directory containing helper scripts
y Generate SCOP family level assignments in addition to
superfamily level assignments
The order of command line options is important.
3: Domain assignment output formats
Output is a tab-delimited file of domains, one domain per line.
There can be more than one domain per sequence, and there may be sequences for which there
is no domain assignment.
The columns, for superfamily and family assignment (the default):
Sequence ID
SCOP superfamily ID
SUPERFAMILY model ID
Match region
Evalue score
Alignment to model
SCOP superfamily description
Family evalue
Family
Closest structure
The columns, for superfamily assignment:
Sequence ID
SCOP superfamily ID
SUPERFAMILY model ID
Match region
Evalue score
Alignment to model
SCOP superfamily description
Please contact us if you have further questions.
4: Using SUPERFAMILY models with hmmer
Some additional steps must be taken to use the
SUPERFAMILY models with hmmer. We assume that you have followed all the steps
in section
1: Setup models and scripts.
4.1
Download hmmer, and follow
the installation instructions that come with it
(no license required, free for academic and commercial use).
4.2
Download the hmmer formatted models from the models directory on the SUPERFAMILY ftp site:
ftp supfam.org
cd models
get hmmer_1.69.tar.gz
bye
4.3
Extract the models for hmmer from the compressed file:
tar xvfz hmmer_1.69.tar.gz
The hmmer formatted models will be placed in a directory called hmmermodlib.
4.4
Join all hmmer models together into one file:
cd hmmermodlib
ls | xargs cat >> ../hmmermodlib.hmm
cd ..
The hmmpfam program, from the hmmer package, requires all HMMs in a single file.
4.5
Run superfamily.pl, using hmmer, to produce the domain assignments:
./superfamily.pl human.fa human.ass scratch hmmermodlib.hmm model.tab hmmpfam 0.0001 n scripts y
If you have further questions, suggestions or comments, then please contact
us using the feedback form or via email
superfamily@mrc-lmb.cam.ac.uk.
|