NEXT-GENERATION CELL TYPE ANNOTATION: INTEGRATING NLP AND ML TECHNIQUES FOR ENHANCED SCRNA CLASSIFICATION

Download
2024-9-4
Tandoğan, Orcun Sami
Implementing machine learning in molecular biology research is essential for efficiently exploring the biomolecular cosmos. Our research aims to contribute to biotechnology by developing a methodology that optimizes automated cell-type annotation in single-cell RNA sequencing (scRNA-seq) data. In our thesis, we created a novel approach that combines natural language processing (NLP) and machine learning methods. In the first part of our methodology, we use tokenizers from advanced language models such as BERT, GPT2, and GPT3 to create text embeddings of gene symbols. We then reduce data dimensionality using the encoder parts of autoencoders. We combine this data with gene expression data to produce prediction models using machine learning methods. We use the PBMC dataset from the Human Cell Atlas to evaluate our method. Our results show that our methodology significantly improves cell type annotation accuracy compared to standard approaches. This study potentially advances our understanding of cellular diversity and function by providing a new computational tool for biotechnology
Citation Formats
O. S. Tandoğan, “NEXT-GENERATION CELL TYPE ANNOTATION: INTEGRATING NLP AND ML TECHNIQUES FOR ENHANCED SCRNA CLASSIFICATION,” M.S. - Master of Science, Middle East Technical University, 2024.