





The case-only approach is more powerful to detect gene-environment interactions (GxE) than the case-control approach if the assumption of gene-environment (G-E) independence is valid. Specifically, this assumption may be violated in presence of population stratification. The empirical-Bayes (EB) procedure tests for interaction and exploits the G-E independence assumption but does not rely on this assumption (Mukherjee and Chatterjee, Biometrics, 2008, 64(3):685-94). Therefore, the EB method can have increased power compared to the case-control approach while the type I error is smaller than that of the case-only approach if the independence assumption is violated (Mukherjee et al., 2008).
Systematic reviews of diagnostic test are undertaken for the same reason as systematic reviews of therapeutic inventions: to produce estimates of performance based on all available evidence, to evaluate the quality of published studies, and so account for the variation in findings between studies.
Projects performed by GATC show that the use of one next generation sequencing technology alone does not deliver the best results for all projects. Rather a combination of two or three technologies provides a more complete, cost-effective analysis. In addition to sequencing, bioinformatic analysis is critically important for gaining an in-depth understanding of the biological significance of the sequence data. The combination, analysis and visualisation of these data are key challenges to the successful application of the Next Generation sequencing technologies.
Given the rapidly-expanding size of genome-wide association data, dimensionality reduction (DR) is increasingly important for interpretability of machine learning results. DR creates computationally more tractable models, allows for the best use of resources for follow-up studies, may increase prediction accuracy by removing noise predictors, and can remove redundant predictors. Random Jungle (RJ; Schwarz et al., in press), a computationally efficient implementation of the random forest algorithm, provides options for a backward 'peeling' approach to DR: users specify a target number of predictors to be retained in the final set (p*) and at each iteration the algorithm peels off 50%, 33%, or all predictors with negative variable importance measures until the final set is < p*. However, how should p* be selected when it is not known a priori? I show the use of the out-of-bag error rates during peeling to select p* is strongly influenced by the “large p small n” problem. In other words, when the number of predictors is much larger than the number of observations (as is common in genome-wide association studies) it may lead to an improper estimate of p*. Further, the peeling approach may inflate some of the resulting variable importance measures, and this inflation is also more pronounced when the number of predictors is larger than the number of observations. I propose appropriate solutions to these issues when the number of predictors is much greater than the number of observations including cross-validation for the estimation of p* and the use of independent test sets for estimation of variable importances and prediction error.
Im gesellschaftlichen und kulturellen Kontext ist die medizinische und biologische Genetik mehr als nur eine Naturwissenschaft. Sie enthält mehr, nämlich eine praktische Anthropologie. Das wirft die Frage nach dem Wesen genetischer Information neu auf. Was kann genetische Information bedeuten, wenn man davon ausgeht, dass das Genom nicht im Sinn eines "Programms" für den Menschen funktioniert?

