Laboratory for Statistical Genomics and Systems Biology

Home

People

Research

Cluster Analysis

Genomics Portals

Functional Treeview

CLEAN: CLustering Enrichment ANalysis

Johannes Freudenberg, Vineet Joshi, Zhen Hu, Mario Medvedovic

Laboratory for Statistical Genomics and Systems Biology

Department of Environmental Health,

University of Cincinnati College of Medicine,

3223 Eden Av. ML 56, Cincinnati OH 45267-0056,

 

Paper accepted for publication in BMC Bioinformatics

I

Poster presented at OCCBIO 2008

 

Abstract

Integration of biological knowledge encoded in various lists of functionally related genes has become the most important aspect of analyzing genome-wide functional genomics data. We developed a novel computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of such data.  We demonstrate significant improvements in reproducibility of conclusions and the utility of the new framework in comparison to currently used methods.

 

Background and Significance

Identifying groups of co-expressed and co-regulated genes and associated expression and regulation patterns through cluster analysis has been successfully used to elucidate affected biological pathways, decipher transcriptional regulatory mechanisms and identify relevant sample sub-classes. The integration of biological knowledge in such analyses has been most commonly facilitated by assessing the enrichment of clusters with genes from pre-defined functionally coherent gene lists (“functional categories”). Introducing biological knowledge through such post-hoc analysis has been important for interpreting results and separating reproducible, biologically meaningful gene clusters from clusters that may have resulted from random fluctuations in the data. For both of these objectives, reproducibility of conclusions made is of utmost importance.

 

Methods

We developed a novel framework and flexible computational infrastructure for integrating knowledge-based functional categories into the cluster analysis of gene expression data. The framework consists of the novel, conceptually appealing and biologically interpretable gene-specific functional coherence score (CLEAN score) derived by correlating the clustering structure as a whole with functional categories of interest. The statistical significance of coherence scores is established by comparing them to the empirical null-distribution obtained by randomly permuting gene identifiers. The corresponding computational infrastructure is constructed by developing an open-source R package for the data analysis and an open-source Java software for visually integrating and analyzing expression data and associated knowledge-based functional categories. The definition of the functional coherence score allows for integration of multiple sets of functional categories (e.g. joint analysis of Gene Ontologies, KEGG pathways, and transcription factor regulatory targets). On the other hand, our novel integrative visualization tool provides an intuitive interface for conveying analytical results. We establish the reproducibility of the functional coherence score across related datasets and its utility in selecting biologically meaningful genes and clusters of genes. We also demonstrate the validity of our procedure for selecting genes with statistically significant coherence score and demonstrate its utility in comparing results of different clustering procedures. Our gene-specific CLEAN score which differentiates between the levels of functional coherence for genes within the same cluster, achieves significantly higher reproducibility than currently used cluster-wide scores. We also demonstrate that genes selected based on the CLEAN score produced more precise sample groupings than genes selected using the cluster-wide score. Along with improved reproducibility, these results suggest that the CLEAN framework is an effective tool for prioritizing gene targets.

 

Software

Supplemental Materials for the paper

  • FTreeView display of genes with statistically significant CLEAN scores in all four breast cancer datasets (Figure 6).

  • CLEAN analysis for different breast cancer datasets (GSE3494, GSE7390)

  • Many additional results can be accessed through our Genomics Portals

Contact

mario.medvedovic@uc.edu or johannes.freudenberg@uc.edu