SUPERFAMILY 1.75 HMM library and genome assignments server

Superfamily is undergoing a server migration - you are now browsing on the new server. Please contact us if you experience any problems.

Install and Query the SUPERFAMILY Database

This page describes how to download, install and query the SUPERFAMILY database.

Introduction

The SUPERFAMILY relational database dump contains sequences, domain assignments, alignments, domain architectures and information on each genome for all completely sequenced organisms. Statistics on domain assignments for each genome, and separately for each family/superfamily are included. Information on the models used to produce the domain assignments are also provided.

This page is divided into four main sections:

Setting up the database is a multi-step process. There may be issues for some combinations of machines and operating systems. If you have read this document and the relevant sections of the database server documentation and are still having a problem, then please contact us: superfamily@cs.bris.ac.uk, feedback form.

System requirements

The database dump is in MySQL format. MySQL is a free open source database. We assume MySQL is installed and running. Any recent version of MySQL should work. You must have a MySQL user account with permissions to create databases and add indexes. We use the MySQL 'root' user in this documentation. To install the entire database dump will require 50 GB of disk space. We assume you are using a linux/unix command line environment. Users unfamiliar with the command line may wish to explore the free MySQL GUI tools.


1: Download database

Download the SUPERFAMILY database.

1.1
Register for a SUPERFAMILY license (free for academic and commercial use).

1.2 Download the database from from the Downloads Section

When logging into the SUPERFAMILY server you will be prompted for a user name and password. Use the login details you receive after registering for a license.

Database dumps are created every week. The filename of each database table contains the date it was created on, for example, align_22-Jun-2008.sql.gz.


Jump to [ Top of page · 1: Download database · 3: Database schema · 4: Query the database ]


2: Install database

Install the SUPERFAMILY database on your machine.

2.1 Uncompress the database tables:
   gunzip *.sql.gz
2.2 Create the database and load the database tables:
   mysql -p -u root -e "CREATE DATABASE superfamily;"
   cat *<date>.sql | mysql -p -u root superfamily
This step should take approximately one hour.

2.3 Login and check the database:
   mysql -p -u root superfamily
   > SHOW TABLES;

You should see a list of 16 tables, which are described in section 3: Database schema.

MySQL configuration tips - for indexing the database
  • Some of the default MySQL configuration settings are not optimal for the SUPERFAMILY database
  • It is possible to achieve improved performance by modifying them
  • If necessary edit your configuration file, in /etc/mysql/my.cnf or /etc/my.cnf
  • Set the key_buffer size in the [isamchk] section of your MySQL configuration file to approximately 25 % of your RAM

This will speed up index creation. See the MySQL reference manual for details.


Jump to [ Top of page · 1: Download database · 2: Install database · 4: Query the database ]


3: Database schema

An explanation of how the tables in the SUPERFAMLY database are organised, including an entity-relationship model (ERM) diagram and a detailed description of each table.

ERM diagram

SUPERFAMILY database Entity Relationship Model

Diagram showing the database schema used in the SUPERFAMILY database dump.

The 16 tables can be divided into 4 groups:
  • Genomes and sequences (red) - genome, protein, genome_sequence
  • Domain statistics (purple) - info, len, len_comb, len_family
  • Protein domains (blue) - ass, family, align, comb
  • SCOP and models (green) - des, cla, hie, model, pdb_sequence

The ERM diagram shows the relationships between tables within each group (solid lines), the main relationships between tables from different groups are related through the hie id values matching px,cl,cf,sf,fa,dm and sp fields in all other tables.

Table descriptions

The SUPERFAMILY database dump consists of the following 16 tables:
> SHOW TABLES;
+-----------------------+
| Tables_in_superfamily |
+-----------------------+
| align                 | 
| ass                   | 
| cla                   | 
| comb                  | 
| des                   | 
| family                | 
| genome                | 
| genome_sequence       | 
| hie                   | 
| info                  | 
| len                   | 
| len_comb              | 
| len_family            | 
| model                 | 
| pdb_sequence          | 
| protein               |
+-----------------------+

Each of these tables is described in turn. Tables are denoted in bold, fields or columns are underlined and examples of data from the database are either in 'single quotes' or labelled with e.g.

align (protein domain)
Table containing alignments between the sequences and hidden Markov models. This is the largest table.
> DESC align;
+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| auto      | int(11) unsigned     | NO   | PRI | NULL    |       | 
| alignment | text                 | NO   |     | NULL    |       | 
| modstart  | smallint(6) unsigned | NO   |     | NULL    |       | 
+-----------+----------------------+------+-----+---------+-------+
  • The auto column is an index which links this table to the ass table.
  • The alignment column shows the sequence alignment for the entry in the ass table. The whole sequence is always shown, some residues in lower case, some in upper case. Upper case residues are matching, and lower case residues are not matching, and thus not involved in the assignment.
  • The modstart column contains the position that the model starts from. This can be linked to a given model via the ass table joined with model table via the sf field.
ass (protein domain)
The domain assignments table containing the superfamily level classification for each domain. It is a relatively large, but fundamental table.
> DESC ass;
+---------+-------------------------------+------+-----+---------+----------------+
| Field   | Type                          | Null | Key | Default | Extra          |
+---------+-------------------------------+------+-----+---------+----------------+
| protein | int(11) unsigned              | NO   | MUL | NULL    |                | 
| model   | smallint(7) unsigned zerofill | NO   | MUL | NULL    |                | 
| region  | varchar(160)                  | NO   |     |         |                | 
| evalue  | double                        | NO   |     | 0       |                | 
| sf      | mediumint(8) unsigned         | NO   |     | 0       |                | 
| auto    | int(11) unsigned              | NO   | PRI | NULL    | auto_increment | 
+---------+-------------------------------+------+-----+---------+----------------+
  • The protein column is the protein id used in the protein table and elsewhere. This gives you access to the genome project specific seqid column previously found in this table in prior SUPERFAMILY releases.
  • The model column is the model responsible for this assignment, e.g. 0038290. It can be linked to the data in the model table. This can also be joined with the pdb_sequence to get the seqid
  • The region column is a comma-separated list of regions. The two numbers separated with a hyphen are the begin and end positions in the sequence of the assignment region. The list of regions together make up the single domain represented by this entry, e.g. 65-84.
  • The evalue column is the E-value of the assignment. The table includes hits with E-values up to a value of 1.
  • The sf column is the SCOP identifier of the superfamily which the assignment has been made to. This relates to the sf column in the cla table, and the id column in the des table.
  • The auto column is for indexing purposes. The auto number in this table refers to the auto number in the align and family tables. Look up the entry in the align or family table with the same value as auto to get the relevant alignment or family details for this domain, respectively.
cla (SCOP models)
Table containing the SCOP domain definitions and classifications. Documentation for the information in the SCOP tables (cla, des and hie) is available here.
> DESC cla;
+----------------+-----------------------+------+-----+---------+-------+
| Field          | Type                  | Null | Key | Default | Extra |
+----------------+-----------------------+------+-----+---------+-------+
| scopid         | char(7)               | NO   | MUL |         |       | 
| pdb            | char(4)               | NO   | MUL |         |       | 
| region         | varchar(32)           | NO   |     |         |       | 
| classification | varchar(16)           | NO   |     |         |       | 
| px             | mediumint(8) unsigned | NO   | PRI | 0       |       | 
| cl             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| cf             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| sf             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| fa             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| dm             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| sp             | mediumint(8) unsigned | NO   | MUL | 0       |       | 
+----------------+-----------------------+------+-----+---------+-------+
  • The scopid contains the old SCOP sid style identifier, e.g. d1ffve2
  • The pdb field is the 4 alphanumeric character code used by the PDB to refer to their records describing the 3D structure of proteins. e.g. 1ffv
  • The region is the begin and end coordinates (from the ATOM record of the PDB file) of the human curated SCOP domain definition. Whole chain domains are indicated by the letter of the chain followed by a colon (e.g. E:147-803). This field should not be confused with the region field in the ass table.
  • The classification contains the SCOP concise classification string, sccs. The classification identifier includes only the class (alphabetical), fold, superfamily, and family (all numerical) to which each domain belongs to, e.g. d.133.1.1.
  • The px represents the SCOP domain entry.
  • The cl represents the SCOP class.
  • The cf represents the SCOP fold.
  • The sf represents the SCOP superfamily.
  • The fa represents the SCOP family.
  • The dm represents the SCOP protein domain.
  • The sp represents the species identifier (from the NCBI taxonomy) for the protein being classified.
comb (protein domain)
The production of the domain architectures contained in the comb table is described in detail here. A relatively large table.
> DESC comb;
+---------+------------------+------+-----+---------+-------+
| Field   | Type             | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| protein | int(11) unsigned | NO   | PRI | NULL    |       | 
| comb    | text             | NO   |     | NULL    |       | 
| gaps    | text             | NO   |     | NULL    |       | 
| auto    | text             | NO   |     | NULL    |       | 
+---------+------------------+------+-----+---------+-------+
  • The protein column is the protein id used throughout the database. Joining on the protein table will allow you to retrieve the genome project seqid.
  • The comb column is a comma-separated list of SCOP superfamily IDs corresponding to the domains assigned to that sequence. The domains are in the order in which they are found in the sequence (N to C terminal). Gaps in between domains are shown by '_gap_', instead of a superfamily ID. e.g. 52540,_gap_,48295
  • The gaps column is a comma-separated list of lengths. There are as many entries in this column as there are '_gap_' entries in the comb column. Each one (in order) shows the length of the gap. For more information on how the gap is determined see here.
  • The auto column is a comma-separated list of auto IDs which correspond to rows (domains) in the ass table. Like the gaps column each one (in order) corresponds to the domains in the comb column.
des (SCOP and models)
Table containing the description for each SCOP domain. Documentation for the information in the SCOP tables (cla, des and hie) is available here.
> DESC des;
+----------------+-----------------------------------------------+------+-----+---------+-------+
| Field          | Type                                          | Null | Key | Default | Extra |
+----------------+-----------------------------------------------+------+-----+---------+-------+
| id             | mediumint(8) unsigned                         | NO   | PRI | 0       |       | 
| level          | enum('--','px','cl','cf','sf','fa','dm','sp') | NO   | MUL | --      |       | 
| classification | varchar(10)                                   | NO   |     |         |       | 
| name           | varchar(7)                                    | NO   |     |         |       | 
| description    | varchar(128)                                  | NO   |     |         |       | 
+----------------+-----------------------------------------------+------+-----+---------+-------+
  • The id is the SCOP unique identifier, sunid. It refers to the id in the hie table and the scopid in the cla table, e.g. 52540 for the 'P-loop containing nucleoside triphosphate hydrolases' superfamily.
  • The level in the SCOP hierarchy. Can be one of 'cl' for class, 'cf' for fold, 'sf' for superfamily, 'fa' for family, 'dm' for protein domain, 'px' for domain entry, 'sp' for species or '--' for the root node.
  • The classification contains the SCOP concise classification string sunid. The classification identifier includes only the class (alphabetical), fold, superfamily, and family (all numerical) to which each domain belongs to, e.g. 'a.1.1.1' for the 'truncated hemoglobin' family.
  • The name contains the old SCOP sid style identifier when the level equals 'px' (domain entry), and a '-' otherwise.
  • The description is a short explanation describing the SCOP entry. When the level field equals 'px', it gives the domain definition in terms of ATOM coordinates in PDB records (e.g. 1gvh A:1-146).
family (protein domain)
The family level classification for each domain, which is contingent upon the superfamily level classification in the 'ass' table. The family level sub-classification method is described here.
> DESC family;
+--------+-----------------------+------+-----+---------+-------+
| Field  | Type                  | Null | Key | Default | Extra |
+--------+-----------------------+------+-----+---------+-------+
| auto   | int(11) unsigned      | NO   | PRI | NULL    |       | 
| evalue | double                | YES  |     | 999     |       | 
| px     | mediumint(8) unsigned | NO   |     | 0       |       | 
| fa     | mediumint(8) unsigned | NO   |     | 0       |       | 
+--------+-----------------------+------+-----+---------+-------+
  • The auto column is the index which links this table to the ass table.
  • The evalue column is the score for the suggested family conditional on the superfamily membership. The evalue of the superfamily assignment is in the ass table.
  • The px column is the SCOP identifier for the domain which has the closest structure to the sequence. There may be closer structures in SCOP but they will be more than 95% identical in sequence to the one displayed here. The value of this column corresponds to the px column in the cla and des tables.
  • The fa column shows the SCOP family identifier for the domain entry in the ass table. The value of this column corresponds to the fa column in the cla table and the id column in the des table.
genome (genomes and sequences)
Some general information on each genome; including genome name, taxonomy and download details.
> DESC genome;
+---------------+-----------------------------------+------+-----+------------+-------+
| Field         | Type                              | Null | Key | Default    | Extra |
+---------------+-----------------------------------+------+-----+------------+-------+
| genome        | char(3)                           | NO   | PRI | NULL       |       | 
| name          | varchar(150)                      | NO   | MUL | NULL       |       | 
| include       | enum('y','s','1','n','m')         | NO   | MUL | n          |       | 
| excuse        | varchar(15)                       | NO   |     |            |       | 
| domain        | enum('E','B','A','P','C','V','-') | NO   | MUL | -          |       | 
| comment       | varchar(255)                      | NO   |     |            |       | 
| taxonomy      | varchar(512)                      | NO   |     |            |       | 
| taxon_id      | mediumint(8) unsigned             | YES  | MUL | NULL       |       | 
| download_link | varchar(512)                      | NO   |     |            |       | 
| download_date | date                              | NO   |     | 1970-01-01 |       | 
| homepage      | varchar(255)                      | NO   |     |            |       | 
+---------------+-----------------------------------+------+-----+------------+-------+
  • The genome column is a two or three letter character code used throughout the database, e.g. 'dm' for Drosophila melanogaster (fruit fly). The codes are not always as intuitive as this.
  • The name column is the full name of the genome, and may contain HTML tags.
  • The include column indicates whether or not the genome is a completely sequenced genome of a single organism. If the column is set to 'y' then it is a completely sequenced genome. 's' indicates we have classified the genome as a strain of an already existing model organism. '1' is used to denote the "longest transcript per gene" genomes which have had alternative splice forms removed. If the column is set to 'n', then this entry is peculiar in some way. For example early draft proteomes, or sequence collections like UniProt which contain sequences that do not belong to a single organism.
  • The excuse column indicates very briefly in which way entries marked with a 'n' in the include column are not completely sequenced genomes of a single organism.
  • The domain column indicates what kind of proteome the sequence collection is: 'E' for Eukaryota, 'B' for Bacteria, 'A' for Archaea, 'P' for Plasmid, 'C' for Chloroplast, and '-' for anything unclassifiable.
  • The comment column describes in more detail the genome, where necessary.
  • The taxonomy column gives the abbreviated taxonomy of the genome obtained from the NCBI taxonomy e.g. Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila
  • The taxon_id column gives the identifier used by the taxonomy browser at NCBI for this genome. This field is not unique. We include the same genome from multiple projects, e.g. Drososphila melanogaster from both FlyBase 'dm' and Ensembl 'dd'.
  • The download_link column shows the location where the genome sequences were downloaded from.
  • The download_date column indicates the date when the genome was downloaded.
  • The homepage column is a link to the homepage of the genome project, if one exists.
genome_sequence (genomes and sequences)
A table containing the sequences for all genomes. The second largest table.
> DESC genome_sequence;
+----------+-----------------------+------+-----+---------+----------------+
| Field    | Type                  | Null | Key | Default | Extra          |
+----------+-----------------------+------+-----+---------+----------------+
| protein  | int(11) unsigned      | NO   | PRI | NULL    | auto_increment | 
| sequence | text                  | YES  | MUL | NULL    |                | 
| length   | mediumint(9) unsigned | NO   | MUL | 0       |                |
+----------+-----------------------+------+-----+---------+----------------+
  • The protein column is protein id used throughout the database. This can be used to get the genome project seqid.
  • The sequence is the amino acid sequence of the protein identified by the seqid field in protein.
  • The length column is simply the length of the sequence in number of amino acids.
hie (SCOP and models)
The SCOP database uses a hierarchical classification scheme. This table contains the parent and children nodes of each SCOP entry. Note that information on the level of each SCOP identifier is contained in the 'des' table. The levels from highest to lowest: root, class, fold, superfamily, family, protein domain, species and domain entry. Documentation for the information in the SCOP tables (cla, des and hie) is available here.
> DESC hie;
+--------------+-----------------------+------+-----+---------+-------+
| Field        | Type                  | Null | Key | Default | Extra |
+--------------+-----------------------+------+-----+---------+-------+
| id           | mediumint(8) unsigned | NO   | PRI | 0       |       | 
| parent       | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| children     | text                  | YES  |     | NULL    |       | 
| num_children | smallint(5) unsigned  | NO   |     | 0       |       | 
+--------------+-----------------------+------+-----+---------+-------+
  • The id is the SCOP identifier which refers to the id field in the des table.
  • The parent field is the SCOP identifier representing the parent node of the id field.
  • The children is a comma separated list of SCOP identifiers for the children of the id field.
  • The num_children is the number of SCOP identifiers in the children field.
info (domain statistics)
Domain assignment statistics for each genome.
> DESC info;
+----------------------+-----------------------+------+-----+---------+-------+
| Field                | Type                  | Null | Key | Default | Extra |
+----------------------+-----------------------+------+-----+---------+-------+
| genome               | char(3)               | NO   | PRI | NULL    |       | 
| genes                | int(10) unsigned      | NO   | MUL | 0       |       | 
| matches              | int(10) unsigned      | NO   |     | 0       |       | 
| percent              | tinyint(3) unsigned   | NO   | MUL | 0       |       | 
| coverage             | tinyint(3) unsigned   | NO   |     | 0       |       | 
| domains              | int(10) unsigned      | NO   |     | 0       |       | 
| superfamilies        | smallint(4) unsigned  | NO   |     | 0       |       | 
| average_family_size  | float                 | NO   |     | 0       |       | 
| percent_duplication  | tinyint(3) unsigned   | NO   |     | 0       |       | 
| average_length       | smallint(5) unsigned  | NO   |     | 0       |       | 
| average_hit_length   | smallint(5) unsigned  | NO   |     | 0       |       | 
| domain_combinations  | mediumint(8) unsigned | NO   |     | 0       |       | 
| families             | smallint(5) unsigned  | NO   |     | 0       |       | 
| unique_architectures | mediumint(8) unsigned | NO   |     | 0       |       | 
+----------------------+-----------------------+------+-----+---------+-------+
  • The genome column is the two letter code used throughout the database.
  • The genes column is the number of unique sequences in the genome as provided by the genome project.
  • The matches column is the number of sequences in the genome with at least one assignment.
  • The percent column is the percentage of sequences in the genome with at least one assignment.
  • The coverage column is the percentage of all residues from all sequences in the genome covered by assignments.
  • The domains column is the number of domains assigned to sequences in the genome.
  • The superfamilies column is the number of distinct superfamilies with at least one domain assigned to a sequence in the genome.
  • The average_family_size column is the average number of domains assigned per superfamily in the genome.
  • The percent_duplication column is the number of domains assigned, minus the number of superfamilies, expressed as a percentage of all domains assigned.
  • The average_length column shows the average length of a sequence in the genome.
  • The average_hit_length column shows the average length of the domain assignments made to the genome.
  • The domain_combinations column is the number of unique domain combinations in the genome.
  • The families column is the number of SCOP families with at least one domain assigned to a sequence in the genome.
  • The unique_architectures column is the number of unique domain architectures found in the genome compared with all others in SUPERFAMILY.
len (domain statistics)
Occurrence of SCOP superfamilies in each genome.
> DESC len;
+---------------+-----------------------+------+-----+---------+-------+
| Field         | Type                  | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------+-------+
| genome        | char(3)               | NO   | PRI | NULL    |       | 
| sf            | mediumint(8) unsigned | NO   | PRI | 0       |       | 
| number        | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| prot_number   | mediumint(8) unsigned | NO   |     | 0       |       | 
| family_number | smallint(5) unsigned  | NO   |     | 0       |       | 
+---------------+-----------------------+------+-----+---------+-------+
  • The genome column contains the two letter code used for the genomes throughout the database, and relates to the genome column in the genome table.
  • The sf column is the SCOP identifier of the superfamily which the assignment has been made to. This relates to the sf column in the cla table, and the id column in the des table.
  • The number column is the number of domains assigned to the superfamily in the sf column for the genome in the genome column.
  • The prot_number column is the number of sequences assigned to the superfamily in the sf column for the genome in the genome column.
  • The family_number column is the number of significant family assignments within the superfamily in the sf column for the genome in the genome column.
len_comb (domain combination statistics)
Occurrence of domain combinations in each genome.
> DESC len_comb;
+------------+-----------------------+------+-----+---------+----------------+
| Field      | Type                  | Null | Key | Default | Extra          |
+------------+-----------------------+------+-----+---------+----------------+
| genome     | char(3)               | NO   | MUL | NULL    |                | 
| comb       | text                  | NO   | MUL | NULL    |                | 
| number     | mediumint(8) unsigned | NO   | MUL | 0       |                | 
| comb_index | int(10) unsigned      | NO   | PRI | NULL    | auto_increment | 
+------------+-----------------------+------+-----+---------+----------------+
  • The genome column contains the two-three letter code used for the genomes throughout the database, and relates to the genome column in the genome table.
  • The comb column contains a specific domain combination in question. This relates to the comb column in the comb table, and the id column in the hie table if you take each domain id individually from the comma delimited list.
  • The number column is the number of occurences of the combination in the specified genome in the genome column.
  • The comb_index column is here purely for database indexing purposes.
len_family (domain statistics)
Occurrence of SCOP families in each genome.
> DESC len_family;
+-------------+-----------------------+------+-----+---------+-------+
| Field       | Type                  | Null | Key | Default | Extra |
+-------------+-----------------------+------+-----+---------+-------+
| genome      | char(3)               | NO   | PRI | NULL    |       | 
| fa          | mediumint(8) unsigned | NO   | PRI | 0       |       | 
| number      | mediumint(8) unsigned | NO   | MUL | 0       |       | 
| prot_number | mediumint(8) unsigned | NO   |     | 0       |       | 
+-------------+-----------------------+------+-----+---------+-------+
  • The genome column contains the two letter code used for the genomes throughout the database, and relates to the genome column in the genome table.
  • The fa column is the SCOP identifier of the family which the assignment has been made to. This relates to the fa column in the cla table, and the id column in the des table.
  • The number column is the number of domains assigned to the family in the fa column for the genome in the genome column.
  • The prot_number column is the number of sequences assigned to the family in the fa column for the genome in the genome column.
model (SCOP and models)
Details of which superfamilies the models represent, and the seed sequences (from SCOP) which were used to build the models.
> DESC model;
+-------------+----------------------------------------------------------------------+------+-----+---------+-------+
| Field       | Type                                                                 | Null | Key | Default | Extra |
+-------------+----------------------------------------------------------------------+------+-----+---------+-------+
| model       | smallint(7) unsigned zerofill                                        | NO   | PRI | NULL    |       | 
| seed        | mediumint(8) unsigned                                                | NO   | MUL | 0       |       | 
| sf          | mediumint(8) unsigned                                                | NO   | MUL | 0       |       | 
| date        | date                                                                 | NO   |     | NULL    |       | 
+-------------+----------------------------------------------------------------------+------+-----+---------+-------+
  • The model column is the identifier for the hidden Markov model, and corresponds to the filename of each model, e.g. 0034782.
  • The seed column is the SCOP identifier for the protein seed sequence, which relates to px in the cla and pdb_sequence tables, plus id in the des table.
  • The sf column is the SCOP identifier of the superfamily which the model represents, and relates to sf in the cla table, and id in the des table.
  • The date column shows the date the model was built.
pdb_sequence (SCOP and models)
The filtered sequences from ASTRAL which were used to build the models.
> DESC pdb_sequence;
+----------+-----------------------+------+-----+---------+-------+
| Field    | Type                  | Null | Key | Default | Extra |
+----------+-----------------------+------+-----+---------+-------+
| px       | mediumint(8) unsigned | NO   | PRI | 0       |       | 
| seqid    | char(7)               | NO   | MUL |         |       | 
| filter   | tinyint(4) unsigned   | NO   | MUL | 0       |       | 
| comment  | varchar(255)          | NO   |     |         |       | 
| sequence | text                  | NO   |     | NULL    |       | 
| length   | smallint(5) unsigned  | NO   | MUL | 0       |       | 
+----------+-----------------------+------+-----+---------+-------+
  • The px column is the SCOP identifier of the sequence, and relates to the px column in the cla table, and the id column in the des table.
  • The seqid column is the sequence identifier, from ASTRAL. These are the old SCOP sids and correspond to the scopid column in the cla table and the name column in the des table, e.g. d1t95a3.
  • The filter column is used for selecting what sequence identity the sequence is filtered to. To get a set of sequences filtered to a given percentage, select all sequences which have a value in the filter column which is less than or equal to your chosen percentage identity.
  • The comment column is the comment taken from the ASTRAL files, and contains the SCOP sccs classification.
  • The sequence column is the sequence itself.
  • The length column is just the sequence length.
protein (genomes and sequences)
Allow a protein level abstraction of sequence being specific to a genome, this allows for the UniProt genome to use the same sequences in the database as other genomes. This likely affects many previous queries you may have used prior to SUPERFAMILY version 1.75 onwards.
> DESC protein;
+---------+------------------+------+-----+---------+-------+
| Field   | Type             | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+-------+
| genome  | char(3)          | NO   | PRI | NULL    |       | 
| seqid   | varchar(100)     | NO   | PRI |         |       | 
| protein | int(11) unsigned | NO   | MUL | NULL    |       | 
| comment | text             | YES  |     | NULL    |       | 
+---------+------------------+------+-----+---------+-------+
  • The genome column is a two or three letter character code used throughout the database to identify a genome.
  • The seqid column is the sequence identifier given by the genome project.
  • The protein column contains the unique protein ID.
  • The comment column holds a description of the protein or, any comments from the genome project.

Additional tables

Users may wish to consider adding database tables from other projects, such as

Jump to [ Top of page · 1: Download database · 2: Install database · 3: Database schema ]


4: Query the database

Some examples of common database queries. If you are unfamiliar with SQL (Structured Query Language) we suggest you work through the
MySQL tutorial before looking at the example queries.

Examples of common database queries

We present some examples for each of the 16 database tables, as well as cross-table queries.

genome and genome_sequence (updated to use protein)
Select all sequences from the Homo sapiens (human) genome which contain the string 'ELVIS':
SELECT genome_sequence.* FROM genome_sequence, protein WHERE sequence LIKE '%ELVIS%' AND genome_sequence.protein = protein.protein AND protein.genome = 'hs';
Select all sequences from the Saccharomyces cerevisiae (Baker's yeast) genome:
SELECT genome_sequence.* FROM genome_sequence, protein WHERE genome_sequence.protein = protein.protein AND protein.genome = 'sc';
Get a list of all species names and their 2 character genome codes, NCBI taxonomy identifiers and the kingdom of life the species belongs to:
SELECT genome, taxon_id, domain, name FROM genome ORDER BY domain DESC, name;

Results are ordered by kingdom of life (Eukaryotes, Bacteria, Archaea, unclaffified) and alphabetically by species name.

genome and info
Select domain assignment statistics at a genomic level for the hominids (two legged primates):
SELECT genome.name, info.*  
FROM genome, info 
WHERE genome.genome = info.genome AND 
      genome.taxonomy LIKE '%Hominid%' AND 
      genome.include = 'y';
ass
Select all superfamily level domain assignments for Escherichia coli K12:
SELECT * FROM ass WHERE genome = 'ec';
Select only significant superfamily level assignments for Escherichia coli K12:
SELECT * FROM ass WHERE genome = 'ec' AND evalue <= 0.0001;
Select domain assignments for the regulatory 'KRAB domain (Kruppel-associated box)' superfamily from the Mus musculus (mouse) genome:
SELECT * FROM ass WHERE genome = 'mm' AND sf = 109640;
ass and family
Select all superfamily and family level assignments for Oryza sativa (rice):
SELECT ass.genome, ass.seqid, ass.sf, ass.model, ass.region, ass.evalue, family.fa, family.evalue
FROM ass, family 
WHERE family.auto = ass.auto AND
      ass.genome = 'os';
Same as above, but with superfamily and family descriptions from SCOP, plus the most similar structure again from SCOP:
SELECT ass.genome, ass.seqid, ass.sf, des_sf.description, ass.model, ass.region, ass.evalue, 
       family.fa, family.evalue, des_fa.description, family.px
FROM ass, family, des AS des_sf, des AS des_fa 
WHERE family.auto = ass.auto AND 
      des_sf.id = ass.sf AND 
      des_fa.id = family.fa AND
      ass.genome = 'os';

This query uses the des table twice, aliased as des_sf and des_fa. The descriptions for the SCOP family (family.fa) and SCOP superfamily (ass.sf) are retrieved from the des_fa and des_sf aliased tables respectively.

ass and align
Select all significant alignments for the 'ISP domain' superfamily from the FlyBase version of the Drosophila melanogaster (fruit fly) genome:
SELECT ass.genome, ass.seqid, ass.sf, ass.model, ass.region, ass.evalue, align.alignment 
FROM ass, align 
WHERE ass.auto = align.auto AND 
      ass.genome = 'dm' AND 
      ass.sf = 50022 AND 
      ass.evalue <= 0.0001;
[The ISP (iron-sulfur protein) domain is involved in electron transport]

ass and comb
Select all domain architectures for the 'Nuclear receptor ligand-binding domain' superfamily from the Caenorhabditis elegans (nematode) genome:
SELECT DISTINCT comb.genome, comb.seqid, comb.comb 
FROM comb, ass 
WHERE ass.genid = comb.genid AND 
      ass.genome = 'ce' AND 
      ass.evalue <= 0.0001 AND 
      ass.sf = 48508;
[Nuclear receptors are important in medical research into metabolic diseases]

Select the 10 most frequently occurring domain architectures in Caenorhabditis elegans:
SELECT COUNT(comb) AS comb_count, comb 
FROM comb 
WHERE genome = 'ce' 
GROUP BY comb 
ORDER BY comb_count DESC 
LIMIT 10;
len and len_family
Select family level domain assignment statistics for Arabidopsis thaliana (thale cress):
SELECT len_family.genome, len_family.fa, des.description, len_family.number, len_family.prot_number 
FROM len_family, des 
WHERE len_family.fa = des.id AND  
      len_family.genome = 'at' AND 
      len_family.number != 0;
Select the top 10 superfamily domains which are repeated on the same protein for Homo sapiens (human):
SELECT len.genome, len.sf, des.description, len.number, len.prot_number, len.number - len.prot_number AS number_diff 
FROM len, des 
WHERE len.sf = des.id AND 
      len.number != 0 AND 
      len.number != len.prot_number AND 
      len.genome = 'hs'
ORDER BY number_diff DESC 
LIMIT 10;
MySQL configuration tips - for querying the database
Suggested MySQL configuration settings for querying the database:
  • Some of the default MySQL configuration settings are not optimal for the superfamily database
  • It is possible to achieve improved performance by modifying them
  • If necessary edit your configuration file, in /etc/mysql/my.cnf or /etc/my.cnf
The main variables to consider modifying are:

The first variable is shared between all MySQL threads. The final two variables are allocated per thread, so care must be taken not to allocate more memory than is available on the machine you are using.

If you have further questions, suggestions or comments then please contact us using the feedback form, or via email superfamily@cs.bris.ac.uk.


Jump to [ Top of page · 1: Download database · 2: Install database · 3: Database schema · 4: Query the database ]