Home WakeSpace Scholarship › Electronic Theses and Dissertations

Development and Optimization of a Clustering Process that Utilizes Active Site Features to Identify Functionally Relevant Groups Within Protein Superfamilies

Electronic Theses and Dissertations

Item Files

Item Details

abstract
The elucidation of protein molecular function lags far behind the rate of high-throughput sequencing technology; thus, the development of accurate and efficient computational methods to define functional relationships is essential. Protein similarity networks have emerged as a simple, high-throughput method for defining protein relationships, but sequence-based techniques often inaccurately define molecular function details. Active site profiling (ASP) was previously developed to identify and compare molecular details of protein functional sites. In this work, protein similarity networks were created using active site similarity, sequence similarity, and structure similarity for four manually curated SFLD superfamilies. Results demonstrate that ASP-based clustering identifies detailed functional relationships more accurately than sequence- and structure-based clustering. Building on this, two iterative pipelines were developed utilizing active site profiling and profile-based protein database searches to cluster protein superfamilies into distinct, functionally relevant groups. First, the Two Level Iterative clustering Process (TuLIP) utilizes active site profiling and iterative PDB searches to divisively cluster protein structures into groups sharing functional site features. Across eight superfamilies, TuLIP clusters exhibit high correlation with SFLD functional annotations. Subsequently, the Multi-level Iterative Sequence Searching Technique (MISST) was developed to identify protein sequences sharing active site similarity with each TuLIP group using iterative profile-based GenBank searches. Though research is ongoing to fully parameterize MISST, preliminary results with multiple gold standard superfamilies indicate the ability to identify and cluster a large portion of each superfamily into functionally relevant groups. Furthermore, detailed analysis of the functionally relevant clusters generates hypotheses regarding previously unknown functional relationships. Lastly, DASP2 and DASP3 were created to improve the efficiency and accuracy of DASP at identifying proteins sharing active site similarity with a given set. Initial validation with the Prx superfamily indicates DASP3 TuLIP clusters match known functional groups more closely than DASP2, and DASP3 MISST groups are populated in fewer iterations. TuLIP and MISST provide an efficient, accurate way to define functionally relevant groups that can be applied systematically and on a large-scale, laying the foundation for functionally relevant clustering of the entire protein universe.
subject
Active Site Profiling
Clustering
Protein function
contributor
Leuthaeuser, Janelle (author)
Fetrow, Jacquelyn S (committee chair)
Salsbury, Freddie R (committee member)
Lively, Mark O (committee member)
Turkett, William H (committee member)
Nelson, Kimberly J (committee member)
date
2016-01-11T09:35:19Z (accessioned)
2017-01-10T09:30:13Z (available)
2015 (issued)
degree
Molecular Genetics & Genomics (discipline)
embargo
2017-01-10 (terms)
identifier
http://hdl.handle.net/10339/57422 (uri)
language
en (iso)
publisher
Wake Forest University
title
Development and Optimization of a Clustering Process that Utilizes Active Site Features to Identify Functionally Relevant Groups Within Protein Superfamilies
type
Dissertation

Usage Statistics