Dimensionality Reduction using Backward Peeling in Random Jungle

Kristin K. Nicodemus, PhD, MPH
University of Oxford
Ort des Vortrages: 
AM S2
Uhrzeit: 
12.00 Uhr
Datum: 
1. December 2009

Given the rapidly-expanding size of genome-wide association data, dimensionality reduction (DR) is increasingly important for interpretability of machine learning results. DR creates computationally more tractable models, allows for the best use of resources for follow-up studies, may increase prediction accuracy by removing noise predictors, and can remove redundant predictors. Random Jungle (RJ; Schwarz et al., in press), a computationally efficient implementation of the random forest algorithm, provides options for a backward 'peeling' approach to DR: users specify a target number of predictors to be retained in the final set (p*) and at each iteration the algorithm peels off 50%, 33%, or all predictors with negative variable importance measures until the final set is < p*. However, how should p* be selected when it is not known a priori? I show the use of the out-of-bag error rates during peeling to select p* is strongly influenced by the “large p small n” problem. In other words, when the number of predictors is much larger than the number of observations (as is common in genome-wide association studies) it may lead to an improper estimate of p*. Further, the peeling approach may inflate some of the resulting variable importance measures, and this inflation is also more pronounced when the number of predictors is larger than the number of observations. I propose appropriate solutions to these issues when the number of predictors is much greater than the number of observations including cross-validation for the estimation of p* and the use of independent test sets for estimation of variable importances and prediction error.