Breast Cancer Genomics

This is the support web page for different breast cancer related genomics initiatives and research projects centered around the Cincinnati Breast Cancer and Environment Research Center (BCERC), and the Center for Environmental Genetic (CEG) The web page servers as the portal for locally generated breast cancer related microarray data and as the analysis portal for publicly available dataset that we downloaded from public repositories, processed and analyzed. Our goal is to eventually process all publicly available genomics datasets related to breast cancer. In addition to offering downloads and querying of the data, our goal is to perform comprehensive analysis of the data using our Context-Specific Infinite Mixture computational algorithms and provide unrestricted access to results of these analysis.

Hardware and Computational Infrastructure

The basic hardware infrastructure supporting the functionality of this web portal consists of the data server (dual-opteron, 6.4 TB raw disk capacity, 16GB RAM), the computing server (dual-dual-core opteron, 8GB RAM) and the web server (dual-xeon 3.0 GHz). The relational database storing all microarray data resides on the data server, and the computational analysis and data mining is performed on the computing server. All servers operate under open source Suse Linux OS.

The relational database storing all microarray data generated by CEG researchers and processed by the Bioinformatics Core is based on the MySQL version of ArrayExpress (Brazma et al. 2003), MaxD (http://www.bioinf.man.ac.uk/ microarray/maxd/ index.html). The database is MAGE-OM compliant (Spellman et al. 2002) and associated tools of MaxD system allow for tools for the automatic creation of MAGE-ML and MIAMI compliant XML representations of the data. We have developed a suite of R procedures for creating, maintaining and querying databases based on MaxD schema using the RMySQL package. The web page supporting the querying the breast cancer genomics data is utilizing in part the CGIwithR package. To download data from the Gene Expression Omnibus (GEO) database, we make use of the Bioconductor package GEOquery.

Data Analysis

Processing and analysis of all data is done in R utilizing various procedures. The large scale data analysis of all data in the database will be performed using the Context Specific Infinite Mixture Model and related in-house developed tools.

People Involved and Role in this Project

Mario Medvedovic, PhD - leading the effort.

Kaustubh Shinde, MS - database administrator and data analyst. Constructing and maintaining all relational databases,
designing the web-page facilitating querying of the data.

Xiangdong Liu, MS - system administrator and Bayesian data mining specialist. Maintains all servers, performs complex
data mining tasks (PhD candidate in Computer Science)

Maureen Sartor, MS - research associate in charge of analyzing all microarray data generated by the UC Genomics core.

Prachi Kothiyal, - PhD student in bioinformatics. Download, processing and analysis of public datasets.