My research lies at the interface between statistics and machine learning, developing new statistical methods and software for large-scale data analysis. Most of my work is on highly-structured, high-dimensional Bayesian models for statistical genomics and genetic epidemiology.
My background is a Maths degree from Cambridge University and PhD in Cosmology from Imperial College London. I taught myself Bayesian statistics during my PhD, and moved into Biostatistics straight after, working briefly on spatial data analysis in the Small Area Health Statistics Unit at Imperial College London, followed by several years on new statistical methods for gene expression data and other 'omics' data in the Department of Epidemiology and Biostatistics at Imperial and in the Maths department at Brunel University. I joined LSHTM in 2018, to work with people across the School on methods and applications using 'omics' data in epidemiology.
My background is a Maths degree from Cambridge University and PhD in Cosmology from Imperial College London. I taught myself Bayesian statistics during my PhD, and moved into Biostatistics straight after, working briefly on spatial data analysis in the Small Area Health Statistics Unit at Imperial College London, followed by several years on new statistical methods for gene expression data and other 'omics' data in the Department of Epidemiology and Biostatistics at Imperial and in the Maths department at Brunel University. I joined LSHTM in 2018, to work with people across the School on methods and applications using 'omics' data in epidemiology.
Affiliations
Department of Medical Statistics
Faculty of Epidemiology and Population Health
Centres
Centre for Data and Statistical Science for Health
Teaching
I lead the Machine Learning module on the MSc in Health Data Science (with Pierre Masselot from the Faculty of Public Health and Policy) and the Bayesian modules on the MSc in Medical Statistics (with Tim Russell from the Centre for Mathematical Modelling).
This section under development ...
Research
*
High-throughput molecular biology
I have worked for several years on Bayesian integrative models in molecular epidemiology. I have been involved in the development of new statistical methodology for several different types of high-throughput molecular biology data, including gene expression microarrays (Lewin et al. 2006; Lewin et al. 2007; Turro et al. 2010), RNA-seq (Turro et al. 2011), proteomics (Kirk et al. 2013), metabolomics (Lewin et al. 2015, Bottolo et al. 2021, Scott et al. 2023) and microbiome (Scott et al. 2023). The emphasis in all of this work is on integrative modelling, using fully Bayesian models to account for the complex correlation structures in the data and propagate uncertainty on model estimates.
*
Multi-trait analysis in genetics and genomics
Quantitative Trait Loci (QTLs) are genetic variants which are statistically associated with a phenotype of interest. In molecular biology, high-throughput technologies have enabled us to find QTLs for multivariate molecular phenotypes (for example multivariate gene expression (eQTLs), proteomics (pQTLs) and metabolomics (mQTLs)).
Traditional analysis approaches consider each molecular variable separately, despite these data showing extremely high correlations. We have developed Bayesian models for detecting QTLs for multivariate molecular outcomes, and have used these to detect eQTLs and mQTLs. Joint modelling of genomics and metabolomics data for eQTL/mQTL detection (Lewin et al. 2015, Bottolo et al. 2021), joint modelling of microbiome and metabolomics data (Scott et al. 2023) and multi-omics data integration in drug-resistance studies (Zhao et al. 2021, 2023).
An extension of this work into causal modelling is Verena Zuber's paper on Mendelian Randomisation: here we introduce multi-response Mendelian randomization (MR2), an MR method specifically designed for multiple outcomes to identify exposures that cause more than one outcome or, conversely, exposures that exert their effect on distinct responses (Zuber et al. 2023).
Darren Scott recently completed his PhD with me, working on models linking multivariate molecular outcomes with microbiome data. Microbiome data is compositional, meaning that features are expressed as proportions of a whole. Standard supervised learning models cannot be used for compositional data as they treat feature as independent. We have developed models for univariate (Scott et al. 2023) and multivariate outcomes (manuscript in progress) using microbiome as compositional predictors.
*
Machine Learning in Health Data Research
I am currently co-investigator on InflAIM, an NIHR-funded project led by the University of East Anglia to investigate multimorbidity using AI methods. We will be using multi-state models and Bayesian networks to study links and risk factors for multimorbidity.
I recently wrote a "Lessons Learnt" paper for the Centre for Impact Evaluation on machine learning methods used in causal inference for impact evaluation, in particular with respect to investigating heterogeneous treatment effects and mechanisms (Lewin et al. 2023).
I am currently supervising an MSc dissertation on interpretable AI methods. We are investigating the reliability of explanation methods for complex black box models in the context of observational epidemiology. I am also co-supervising a PhD student surveying the use of machine learning methods for large-scale disease surveillance using online social media data.
*
Causal Inference Methodology
I am supervising two NIHR pre-doctoral Fellows (Lauren Rengger and Jenni Banks) working on causal mediation analysis. We are investigation causal mechanisms of the association between eczema and cardiovascular outcomes, using recently developed methods for causal inference.
Mendelian Randomisation for multivariate outcomes: here we introduce multi-response Mendelian randomization (MR2), an MR method specifically designed for multiple outcomes to identify exposures that cause more than one outcome or, conversely, exposures that exert their effect on distinct responses (Zuber et al. 2023).
*
Bayesian Evidence Synthesis
I am working with Darren Scott (AstraZeneca) on Bayesian models for using historical data to improve efficiency of randomised trials analysis (Scott and Lewin 2024 arxiv paper).
I work with Joy Lawn's group in LSHTM MARCH (Centre for Maternal, Adolescent, Reproductive, & Child Health) advising on Bayesian evidence synthesis methods used to produce global and country-specific estimates of disease burden and adverse birth outcomes (Gonçalves et al. 2021, Gonçalves et al. 2022, Ohuma et al. 2023, Okwaraji et al. 2023).
High-throughput molecular biology
I have worked for several years on Bayesian integrative models in molecular epidemiology. I have been involved in the development of new statistical methodology for several different types of high-throughput molecular biology data, including gene expression microarrays (Lewin et al. 2006; Lewin et al. 2007; Turro et al. 2010), RNA-seq (Turro et al. 2011), proteomics (Kirk et al. 2013), metabolomics (Lewin et al. 2015, Bottolo et al. 2021, Scott et al. 2023) and microbiome (Scott et al. 2023). The emphasis in all of this work is on integrative modelling, using fully Bayesian models to account for the complex correlation structures in the data and propagate uncertainty on model estimates.
*
Multi-trait analysis in genetics and genomics
Quantitative Trait Loci (QTLs) are genetic variants which are statistically associated with a phenotype of interest. In molecular biology, high-throughput technologies have enabled us to find QTLs for multivariate molecular phenotypes (for example multivariate gene expression (eQTLs), proteomics (pQTLs) and metabolomics (mQTLs)).
Traditional analysis approaches consider each molecular variable separately, despite these data showing extremely high correlations. We have developed Bayesian models for detecting QTLs for multivariate molecular outcomes, and have used these to detect eQTLs and mQTLs. Joint modelling of genomics and metabolomics data for eQTL/mQTL detection (Lewin et al. 2015, Bottolo et al. 2021), joint modelling of microbiome and metabolomics data (Scott et al. 2023) and multi-omics data integration in drug-resistance studies (Zhao et al. 2021, 2023).
An extension of this work into causal modelling is Verena Zuber's paper on Mendelian Randomisation: here we introduce multi-response Mendelian randomization (MR2), an MR method specifically designed for multiple outcomes to identify exposures that cause more than one outcome or, conversely, exposures that exert their effect on distinct responses (Zuber et al. 2023).
Darren Scott recently completed his PhD with me, working on models linking multivariate molecular outcomes with microbiome data. Microbiome data is compositional, meaning that features are expressed as proportions of a whole. Standard supervised learning models cannot be used for compositional data as they treat feature as independent. We have developed models for univariate (Scott et al. 2023) and multivariate outcomes (manuscript in progress) using microbiome as compositional predictors.
*
Machine Learning in Health Data Research
I am currently co-investigator on InflAIM, an NIHR-funded project led by the University of East Anglia to investigate multimorbidity using AI methods. We will be using multi-state models and Bayesian networks to study links and risk factors for multimorbidity.
I recently wrote a "Lessons Learnt" paper for the Centre for Impact Evaluation on machine learning methods used in causal inference for impact evaluation, in particular with respect to investigating heterogeneous treatment effects and mechanisms (Lewin et al. 2023).
I am currently supervising an MSc dissertation on interpretable AI methods. We are investigating the reliability of explanation methods for complex black box models in the context of observational epidemiology. I am also co-supervising a PhD student surveying the use of machine learning methods for large-scale disease surveillance using online social media data.
*
Causal Inference Methodology
I am supervising two NIHR pre-doctoral Fellows (Lauren Rengger and Jenni Banks) working on causal mediation analysis. We are investigation causal mechanisms of the association between eczema and cardiovascular outcomes, using recently developed methods for causal inference.
Mendelian Randomisation for multivariate outcomes: here we introduce multi-response Mendelian randomization (MR2), an MR method specifically designed for multiple outcomes to identify exposures that cause more than one outcome or, conversely, exposures that exert their effect on distinct responses (Zuber et al. 2023).
*
Bayesian Evidence Synthesis
I am working with Darren Scott (AstraZeneca) on Bayesian models for using historical data to improve efficiency of randomised trials analysis (Scott and Lewin 2024 arxiv paper).
I work with Joy Lawn's group in LSHTM MARCH (Centre for Maternal, Adolescent, Reproductive, & Child Health) advising on Bayesian evidence synthesis methods used to produce global and country-specific estimates of disease burden and adverse birth outcomes (Gonçalves et al. 2021, Gonçalves et al. 2022, Ohuma et al. 2023, Okwaraji et al. 2023).
Research Area
Bayesian Analysis
Statistical methods
Genetic epidemiology
Life-course epidemiology
Selected Publications
2023
American journal of human genetics
2023
BMC bioinformatics
2022
The Lancet. Global health
2022
BMC medicine
2021
PLoS computational biology
2021
Journal of the Royal Statistical Society: Series C (Applied Statistics)
2021
Journal of Statistical Software
2020
British Journal of Dermatology
2019
Science Advances
2019
Wiley