Home › WakeSpace Scholarship › Electronic Theses and Dissertations

Machine Learning for Biostatisticians: A Hypothesis Driven Approach

Electronic Theses and Dissertations

Item Files

shell.pdf

Item Details

title: Machine Learning for Biostatisticians: A Hypothesis Driven Approach
author: Guy, Richard
abstract: Advances in microchip technology have enabled genome-wide association studies that attempt to find regions of variation that associate with disease in large scale case/control studies. Traditional statistical methods have demonstrated success, but they have failed to make the transition to identifying large sets of interacting variations. Among the reasons are computational complexity and the curse of dimensionality as complicated regression models are deployed. Recently, a great amount of attention has been paid to machine learning methods for detecting interacting sets of single nucleotide polymorphisms (SNPs) that associate with genetic disorders. Many of the algorithms that have been applied suffer from two problems. First, each algorithm has been developed and presented on data sets on which it is designed to perform well. Second, they require exhaustive search on high order sets of SNPs and are computationally infeasible for modern sample sizes. Comprehensive studies of algorithm effectiveness have failed to further investigate the claims of the algorithm designers, particularly with respect to type I error. In this thesis, we present a study of several algorithms for the detection of interacting sets of SNPs that have been recently published, including Multi-factor Dimensionality Reduction, Support Vector Machines and Random Forests. We compare their power and type I error to logistic regression, which is a popular statistical method for single SNP tests or small interactions. All three machine learning methods are demonstrated to have elevated type I error of detecting an interaction between two SNPs. In addition, power is shown to differ from logistic regression only when a penetrance model exists that does not fit the regression model used. We also investigate the use of Alternating Decision Trees and Bagging for the detection of interacting and associated SNPs. We introduce a novel interpretation of bagged ADTrees that allows for detection of sets of SNPs that associate with disease with high positive predicive value. Simulated testing shows that ADTrees have power comparable to other machine learning algorithms with less elevated type I error of detecting an interaction provided that a marginal effect exists in at least one SNP involved in an interaction. Another advantage of ADTrees and ADTrees using Bagging is complexity that is linear in the number of SNPs in the sample, unlike all other methods considered for the detection of pairs of interacting SNPs. Last, we present a software tool called SNPdoc that was developed to assist in genome-wide association studies by enabling fast aggregation of statistical and genomic information. This will be published as presented in this thesis.
subject: Computer Science; Statistical Genetics
contributor: Langefeld, Carl (committee chair); Santago, Peter (committee member); Turkett, William (committee member); John, David (committee member)
date: 2010-05-07T18:56:53Z (accessioned); 2010-06-18T18:57:47Z (accessioned); 2010-05-07T18:56:53Z (available); 2010-06-18T18:57:47Z (available); 2010-05-07T18:56:53Z (issued)
degree: Computer Science (discipline)
identifier: http://hdl.handle.net/10339/14717 (uri)
language: en_US (iso)
publisher: Wake Forest University
rights: Release the entire work for access only to the Wake Forest University system for one year from the date below. After one year, release the entire work for access worldwide. (accessRights)
type: Thesis

Usage Statistics

View Usage Statistics