Gene function inference from expression using probabilistic topic models

Tercan, Bahar
The main aim of this study is to develop a probabilistic biclustering approach which can help to elaborate on the question "Can we determine the biological context of a sample (tissue/condition etc.) using expression data and associate the contexts with annotation databases like Gene Ontology, KEGG and HUGE to discover annotations (like cell division, metabolic process, illness etc.) for these contexts?". We applied a nonparametric probabilistic topic model, Hierarchical Dirichlet Process (HDP), which was originally developed for text mining to extract unknown number of latent topics from documents, to gene expression data analysis. In this study, the analogy is the mRNA transcript to the word, the biological context to the topic and the sample to the document. This study builds on previous studies that have, to varying extents, been able to apply topic models to the problem of differential expression, and improves on the current state of the art by producing a comprehensive and integrative method to enhance HDP with prior information. The main areas of proposed improvement are the preprocessing of gene expression data for topic models and the introduction of informed priors to the HDP model. The results of experiments showed that prior improved HDP successfully reveals the hidden biclusters in gene expression data with higher robustness to changes in sparsity levels (number of samples) and prior strengths (η).