Master's projects

For most of the projects outlined here, you are expected to have taken Probability, (Mathematical) Statistics and Computational Statistics by the time you start the project.

Theory

T1) What ails the King?

Concentration areas: Bayesian Statistics.

In The King Must Die, Dan Simpson argues that the Bayesian least absolute shrinkage and selection operator (LASSO) cannot possibly work because its usual prior structure (the Laplace) prior cannot simultaneously accommodate sparsity and large signals. It is also argued that strategies such as the 'Finnish' horseshoe (Piironen and Vehtari, 2017) can remedy this issue. In this project, the student will give a rigorous and inclusive account of sparsity-inducing priors for Bayesian variable selection and will investigate how the claims about the Bayesian LASSO hold up in the correlated case, where covariates are colinear.

References:

Bai, Rockova & George (2021).
Carvalho [not me!], Polson & Scott (2009).
Sparsity Blues by Michael Betancourt.

T2) Combining densities under non-invertible transforms

Concentration areas: Theoretical Statistics, Functional Analysis, Bayesian Statistics.

One of the questions left open by Carvalho et al. (2022) is: "what happens when your experts give you opinions about a quantity in the space X but you're interested in g(X) where g is non-invertible?". In this project we seek to investigate how one would go about combining probability densities in this situation. One approach is to minimise the KL distance in transformed space, but this needs theoretical grounding and also a study of the conditions necessary for the pushforward measure to have a density. The student will be required to prepare a thorough treatment of pushforward probability measures and then prove a few results concerning the interaction of measurable transformations and the logarithmic pooling operator.

References:

Billingsley's Probability and Measure.
Section 3.3 in Poole and Raftery (2000).

T3) Mixing time in quasi-lumpable Markov chains

Many real-world applications of Markov chains Monte Carlo (MCMC) techniques involve huge-dimensional (discrete) spaces. A way to mitigate this "curse of dimensionality" is to project the original space onto one of a much smaller dimension. This projection might, however, lead to the loss of the Markov property enjoyed by the original process. The concept of lumpability (Buchholz, 2016) formalises the notion of presenving Markovianity under a given partition of the state space. The question remains, however, as to how a lower-dimensional projection might affect the mixing time of the projected chain. In this project the student will investigate how to relate the mixing time of the original and lumped chains and how this can lead (or not) to faster convergence.

Applications

A1) Prevalence estimation and causal inference through regression models with uncertain outcomes

Concentration areas: Biostatistics, Epidemiology.

As the COVID-19 pandemic took the planet by storm, it became apparent the mass testing would be required in order to understand the extent of the virus's spread in the population. However, as the diagnostic tests are imperfect, the outcome data (Infected/Not-infected) comes with observation error, which must be correctly modelled. Whilst the general problem of prevalence estimation under outcome uncertainty has been studied for at least four decades, the interface with regression models saw a recent revival, in no small part due to COVID-19 (see Gelman & Carpenter (2020)). There are however many open regression problems, such as for instance re-testing of positive cases, a routine procedure nowadays. The student will be expected to implement and extend the models in Section 7.3 of Bastos, Carvalho & Gomes (2021) using both simulated and real-world data.

Skills to be developed: Stan/NIMBLE, Multilevel modelling.

References:

In addition to the references already given, Lucas Moschen's honours thesis is great resource.
This repository might come in handy.

A2) Simultaneous nowcasting and Rt estimation for epidemic surveillance

Concentration areas: Biostatistics, Epidemiology.

Timely inferences on the short term behaviour of epidemics is of crucial importance to effective decision-making. Many statistical approaches have been developed for predicting COVID-19 cases, hospitalisations and deaths. Disease reporting data in general and COVID-19 data in particular present a number of methodological challenges due to reporting issues such as underreporting and reporting delays. These then need to be statistically corrected to give a better picture of the actual numbers of cases at any given moment (Bastos, Carvalho & Gomes (2021)). Another aspect of epidemic surveillance is tracking the effective reproductive number (Rt) of the disease through time, as measure of risk of (exponential) disease spread. In this project, the student will couple the delay-correction nowcasting model of Bastos et al. (2019) and the Rt estimation methods in the R package EpiEstim to create a unified framework for accurate Rt calculation by explicitly modelling data misreporting.

Skills to be developed: INLA, R, Applied Bayesian Statistics. This is joint work with Drs Leo Bastos and Marcelo Gomes.

A3) Parsimonious models of pathogen spatial spread: prior modelling and variable selection for phylogeography

Concentration areas: Biostatistiscs, Statistical Phylogenetics.

Since its introduction in Lemey et al (2009), Bayesian phylogeography has become a major tool for assessing the drivers of pathogen spatial spread. In this project the student will be expected to complete two tasks:

Compose a thorough review of the literature on Bayesian phylogeography, along with a curated collection of data sets to be used in further analyses and
Experiments investigating the effect of prior stringency on the resulting inferences, including the set of selected ('significant') covariates.

References:

Lemey et al. (2014).
Dudas, Carvalho, et al (2017).
Chapter 4 in my PhD thesis.

A4) Large effects in a sea of irrelevance: novel techniques for Bayesian sparse regression

Concentration areas: Bayesian statistics, variable selection.

In many practical applications, especially in Biology and Medicine, one is confronted with the task of identifying relevant predictors for an outcome of interest out of a set of thousands or even hundreds of thousands of possible predictors. Sparsity-inducing priors are the Bayesian tool that allow us to identify those predictors that have large effects/associations with the outcome of interest without sacrificing predictive ability. The goal of this project is to extend the review and experiments of van Erp, Oberski & Mulder (2019) to incorporate novel computational improvements and also specifically study the issue of multicolinearity. When predictors are correlated in non-trival ways, the task of identifying a small set that parsimoniously predict the outcome is made even more difficult.

The project will work along two axes: (i) realistically simulating sparse-but-correlated design matrices and (ii) developing memory-efficient implementations that scale with dimension.

Skills to be developed: Stan and C++ programming, Bayesian statistics, Variable selection. This is joint work with Aki Vehtari.

References:

Peltola et al. (2012).
Piironen, Paasiniemi and Vehtari (2020).
Piironen and Vehtari (2017a).
Piironen and Vehtari (2017b).

A5) General inference of natural selection with multilevel models

The question of identifying genes under natural selection is a central one in Evolutionary Biology. Eilertson, Booth & Bustamante, (2012) propose a method to use genome-wide polymorphism data to infer selection using a multilevel logistic model fitted to tens of thousands of (2x2) Mk matrices. The estimated coefficients can then be mapped onto the parameters of a Poisson random field (PRF) that has a direct evolutionary interpretation. In this project we will explore this framework to generalise this approach to other evolutionary models. In this project the student will i. study how to fit other evolutionary models using the SnIPRE approach; ii. simulate evolutionary trajectories (and data) using the SLiM simulator in order to understand how robust the SnIPRE approach to violations of its assumptions; iii. study how to use extra information on the genes (e.g. whether they're associated with immunity or saliva) to detect patterns in the random effects scatterplots; iv. Try to discern chromosome-wide effects: do sexual chromosomes evolve faster?

All of the development will be done in Stan and follow modern Bayesian techniques such as non-centering, simulation-based calibration and predictive checking.

Skills to be developed: Stan and C++ programming, Bayesian statistics, Statistical Genetics. Co-supervised with Professor Claudio Struchiner.

References:

Eilertson, Booth & Bustamante (2012)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Master's projects

Theory

Applications

Files

README.md

Latest commit

History

README.md

File metadata and controls

Master's projects

Theory

Applications