Hide/Show Apps

Similarity search and analysis of protein sequences and structures: a residue contacts based approach

Saçan, Ahmet
The advent of high-throughput sequencing and structure determination techniques has had a tremendous impact on our quest in cracking the language of life. The genomic and protein data is now being accumulated at a phenomenal rate, with the motivation of deriving insights into the function, mechanism, and evolution of the biomolecules, through analysis of their similarities, differences, and interactions. The rapid increase in the size of the biomolecular databases, however, calls for development of new computational methods for sensitive and efficient management and analysis of this information. In this thesis, we propose and implement several approaches for accurate and highly efficient comparison and retrieval of protein sequences and structures. The observation that corresponding residues in related proteins share similar inter-residue contacts is exploited in derivation of a new set of biologically sensitive metric amino acid substitution matrices, yielding accurate alignment and comparison of proteins. The metricity of these matrices has allowed efficient indexing and retrieval of both protein sequences and structures. A landmark-guided embedding of protein sequences is developed to represent subsequences in a vector space for approximate, but extremely fast spatial indexing and similarity search. Whereas protein structure comparison and search tasks were hitherto handled separately, we propose an integrated approach that serves both of these tasks and performs comparable to or better than other available methods. Our approach hinges on identification of similar residue contacts using distance-based indexing and provides the best of the both worlds: the accuracy of detailed structure alignment algorithms, at a speed comparable to that of the structure retrieval algorithms. We expect that the methods and tools developed in this study will find use in a wide range of application areas including annotation of new proteins, discovery of functional motifs, discerning evolutionary relationships among genes and species, and drug design and targeting.