Johannes Freudenberg, Vineet Joshi, Zhen Hu, Mario Medvedovic
Laboratory for Statistical Genomics and Systems Biology
Department of Environmental Health,
University of Cincinnati College of Medicine,
3223 Eden Av. ML 56, Cincinnati OH 45267-0056,
Freudenberg JM, Joshi VK, Hu Z, Medvedovic M. CLEAN: CLustering Enrichment ANalysis.
BMC Bioinformatics (2009) 10:234. Pubmed.
Poster presented at OCCBIO 2008.
Background: Integration of biological knowledge encoded in various lists of functionally related genes
has become one of the most important aspects of analyzing genome-wide functional genomics data.
In the context of cluster analysis, functional coherence of clusters established through such
analyses have been used to identify biologically meaningful clusters, compare clustering
algorithms and identify biological pathways associated with the biological process under investigation.
Results: We developed a computational framework for analytically and visually integrating
knowledge-based functional categories with the cluster analysis of genomics data.
The framework is based on the simple, conceptually appealing, and biologically interpretable
gene-specific functional coherence score (CLEAN score). The score is derived by correlating
the clustering structure as a whole with functional categories of interest. We directly
demonstrate that integrating biological knowledge in this way improves the reproducibility
of conclusions derived from cluster analysis. The CLEAN score differentiates between the
levels of functional coherence for genes within the same cluster based on their membership
in enriched functional categories. We show that this aspect results in higher reproducibility
across independent datasets and produces more informative genes for distinguishing different
sample types than the scores based on the traditional cluster-wide analysis. We also demonstrate
the utility of the CLEAN framework in comparing clusterings produced by different algorithms.
CLEAN was implemented as an add-on R package and can be downloaded at http://Clusteranalysis.org.
The package integrates routines for calculating gene specific functional coherence scores and
the open source interactive Java-based viewer Functional TreeView (FTreeView).
Conclusion: Our results indicate that using the gene-specific functional coherence score
improves the reproducibility of the conclusions made about clusters of co-expressed genes
over using the traditional cluster-wide scores. Using gene-specific coherence scores also
simplifies the comparisons of clusterings produced by different clustering algorithms and
provides a simple tool for selecting genes with a "functionally coherent" expression profile.
The CLEAN R package contains functions to compute the R functions to perform the Clustering
Enrichment Analysis. In addition, it provides a number of tools to import and export files in
TreeView format (i.e. .cdt, .gtr, and .atr files), and to match gene
identifiers across species using HomoloGene.
R package CLEAN download (Linux,Windows)
CLEAN annotation R packages for Human, Mouse, and Rat (Linux,Windows)
FTreeView clustering browser (Run
The LRpath function has been incorporated within the R package CLEAN.
The function has been updated to use the new Bioconductor library formats for the functional
annotations (i.e. GO and KEGG) as well as to use other built-in and external functional
categories accessible through CLEAN annotation packages.
Functions to compute the Random Set statistic and the
Generalized Random Set statistic have been incorporated.
Supplemental Materials for the
FTreeView display of genes with statistically
significant CLEAN scores in all four breast cancer
datasets (Figure 8).
analysis for different breast cancer datasets (GSE3494,
Top functionally coherent genes
from analysis of different breast cancer datasets (fTreeView)
additional results can be accessed through our