Publically available gene expression data can easily be accessed from online repositories, such as the Gene Expression Omnibus (GEO). Analysis of these datasets by the wider scientific community can provide novel molecular biology insights into tissue development and function. However, effective analysis of these gene expression datasets is often not possible for researchers with limited bioinformatics knowledge or experience.
To overcome this we developed an excel-based macro that is simple to use and offers a range of sorting and grouping capabilities familiar to most researchers. Users download desired data from the GEO repository and simply condense it into a single spread-sheet for input into the macro; the minimum information required being a unique gene identifier (e.g., Affymetrix ID) and an associated expression value (e.g., present and absent calls) from each microarray sample for analysis. The macro, written in visual basics, sorts each row to rank genes from highest to lowest expression (or present/absent calls) across the replicate arrays being examined. Testing showed the macro can be used to accurately analyse GEO datasets of up to at least 64 microarrays, containing over 45,000 rows and 33 columns of data.
To determine whether the macro can facilitate the development of novel molecular hypotheses we incorporated the macro output into a pipeline of similarly simple, publically-available tools that assess gene ontology, transcriptional regulation, and embryonic gene expression patterns (e.g., DAVID Gene Ontology and PASTAA promoter analyses; GenePaint in situ hybridisation data). Analysis of lens and gut GEO datasets using the macro and associated pipeline identified candidate regulatory receptors and transcription factors previously undescribed in each tissue type that were subsequently validated in the laboratory. These results demonstrate that the macro and described pipeline provides a simple tool for non-bioinformaticians to discover very useful new biology from readily-accessible public gene expression datasets.