Cluster
Phylogeny
Phenotype
Function
at subclusters

Cluster Overview of Archaeal Life

  • New to COAL? Please check the Help section!
  • Can't see the applet? Make sure applets are enabled in your browser

 


Results

Advanced search

Keyword:
Filters:

Constraints

Clusters must include the following organisms (select)

Note that selecting many organisms may cause the search to be excessively slow

Clusters must fulfill the following consistency scores

Phylum
COG
Ecotype
Metabolism
Oxygen
Temperature

Clusters must include the following properties

Members
Cluster type  

Clusters must include the following metadata

Metadata Must contain Must not contain
Ecotype
Thermal
Oxygen
Metabolism

Nur04506_small

COAL documentation

  1. What is COAL?
  2. COAL clusters
  3. Integrating phylogeny
  4. Integrating metadata
  5. Exporting clusters to IMG
  6. FAQ
  7. References

1. What is COAL?

COAL is an acronym for Cluster Overview of Archaeal Life. The purpose of COAL is to visualize protein orthology and to relate orthology with additional information derived from their genomes. This information includes phylogeny, ecotype, metabolism, thermal preference and aerobicity. The protein orthology networks are also subclustered, when possible, using a bipartioning approach based on spectral clustering. The advantage of subclustering is clear considering the variable plasticity of protein sequences. For instance, some protein families are quite flexible, e.g. ABC transporters, while other families are very tightly conserved, e.g. ribosomal subunits. Given the heterogeny of biological data, it is clear that hard clustering cutoffs will result in clusters which are unlikely to be biologically relevant for all classes of proteins. This soft approach allows the user to stop clustering at a point that makes biological sense.

Back to top

2. COAL clusters

COAL clusters are of three types; root, stem and leaf. The root clusters are at the top level of the cluster hierarchy, and are identified by their cluster numbering format, which are of the form 41, 8 etc. Note that these numbers are integers. Any number of the form 41.1 or 1144.0.1 denote a subcluster. In this case, 41.1 is one of the two subclusters of root cluster 41. If these clusters in turn have additional subclusters, they are stem clusters. If they do not have any subclusters, they are leaf clusters.

Subclustering of root clusters is performed using spectral clustering (see refs below). We attempt to subdivide each cluster into two subclusters at a time. The separation is successful if the second eigenvalue of the Markov transition matrix exceeds a threshold. Note that this a threshold set on the normalized transitions and not on protein orthology itself. It is therefore more dependent on the topology of the network than on the actual orthologies themselves.

To select a cluster and load it into the applet, enter a cluster number into the Cluster box on the Main page and click Update. The cluster appears as a network of nodes (proteins) connected by edges representing the orthology. Various information about the cluster will be loaded below the applet, along with information of the individual proteins. Initially, networks will not be colored. To get more information about proteins, you can either shift-click a node in the applet or follow the Gene OID links in the list below the applet. You will be taken to the IMG entry for that protein.

Back to top

3. Integrating phylogeny

Proteins can be highlighted according to the phylogenetic placement of their genomes. To color nodes in the applet and proteins in the list, select the level of phylogeny from the drop down list on the Main page and click Phylogeny. Nodes are colored, and the list is sorted and set to display the Phylum, Class and Species levels of taxonomy. You can view other metadata by clicking one of the buttons above the list while maintaining the phylogenetic ordering and coloring.

Note that if your selection returns a large number of categories, the coloring will fail.

Back to top

4. Integrating metadata

Currently, there are four categories of metadata in COAL, oxygen usage (e.g. aerobe, anaerobe), Metabolism (e.g. Chemoorganoheterotroph, Chemolithoautotroph), thermal preference (e.g. hyperthermophile, mesophile) and ecotype (e.g. marine, aquatic). This data was taken from the GOLD database, and more detailed information can be found there. Finally, COG, PFAM and arCOG annotations can be used to color nodes and proteins.

Nodes and proteins can be colored analogously to the previous section on phylogeny.

Back to top

5. Exporting clusters to IMG

All genes that are included in a cluster can be exported to the IMG gene cart by clicking the IMG button in the left column. They can then be analyzed using IMG as normal. You can also go to the gene page directly by either clicking the link in the members table or shift-clicking a node in the graph.

Medusa can be found at SourceForge.

Back to top

6. FAQ

Is the length of an edge any indicator of the strength of similarity between proteins?

No, the edge length depends on the layout only. There is no way to correctly show fixed edge lengths, since we are reducing a multidimensional object to two dimensions. However, the relative strength of orthology can be visualized as the opacity of the edge. Weak similarities are shown as more translucent, and strong similarities as more bold.

Got a question? Please contact shooper /at/ lbl.gov

Back to top

7. References

  • Brewer, M.L., Development of a spectral clustering method for the analysis of molecular data sets. J Chem Inf Model, 2007. 47(5): p. 1727-33.
  • Paccanaro, A., J.A. Casbon, and M.A. Saqi, Spectral clustering of protein sequences. Nucleic Acids Res, 2006. 34(5): p. 1571-80.
  • Markowitz, V.M., et al., The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res, 2007.
  • Liolios, K., et al., The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res, 2007.