The reference species tree of (sequenced) life
This tree is fully resolved (bifurcating + estimated branch length) and takes full advantage of the protein structural evolutionary information and the existing taxonomic information. It is reconstructed by RAxML under the NCBI taxonomy constraint using structural phylogenomic info: SCOP domains (both at the superfamily and family levels) and supradomains (only at the superfamily level). After the reconstruction, it has been annotated by the NCBI taxonomy in a manner that each internal node is either mapped into an unique taxon_id, or left empty (assume to be a hypothetical unknown ancestor).
The NCBI taxonomy
The NCBI taxonomy incorporates phylogenetic and taxonomic knowledge from a variety of sources into a less-resolved common species tree. Topologically, it is multifurcating, with some nodes having more than two descendants. Also, the branch length is uniform, lack of quantitative information to measure divergence. Notably, it also includes taxonomic ranks, with commonly known ones (from high to low) sequentially including including Superkingdom, Kingdom, Phylum, Class, Order, Family, Genus, Species and No ranks. Keep it in mind that it is not binary, allowing for only one child or multiple children in additional to two exact children.
Gateway to tree browsing
To navigate this reference species tree of (sequenced) life, we display a path from a given node upwardly leading to its ancestor of superkingdom origin (i.e., all ancestral nodes to the current node of interest in a sequential order. Also lists are its direct children. For each specific clade along the path, we use TreeVector for the visualization, and provide Newick tree format for downloading, with node objects: either Codes (i.e., the 2-letter genome identifiers used by the SUPERFAMILY database), or TaxIDs (i.e., NCBI taxonomy IDs), or Names (Full names).
Applications in small-scale studies
The tree or its derived subtrees can be used to display the distribution of: 1) a specific domain, such as Nuclear receptor ligand-binding domain (sunid=48508) distributed over the path from human leading upwards Eukaryta, or 2) as a whole sets of domains annotated to a specific GO term, such as those domains annotated to stem cell maintenance (GO:0019827) distributed over the path from human leading upwards Eukaryta.
Applications in large-scale studies
The more promising applications are to annotate the whole extant/ancestral domain repertoires. As demonstrated here, we first apply a Dollo parsimony to infer ancestral superfamily domain repertoires at the major branching points in the eukaryotic evolution, and also get ancestral domain repertoires that were present at these points, gained and lost compared to their direct parents. Then, we use domain-centric GO annotations to perform enrichment analysis of present/gained/lost ancestral domain repertoires at each of the superfamily and family level. Inferred ancestral GO terms in Eukaryotes by enrichment analysis tell us functional implications during the eukaryotic evolution.