Skip to content

Add Subsampling to Aster Models

Shannon Sequeira edited this page Apr 9, 2019 · 7 revisions

Aster Models for Life History Analysis

Aster models for life history analysis are generalized generalized linear models (GGLM). They also generalize discrete-time survival analysis, life table analysis, zero-inflated Poisson regression, and are a special case of graphical models. Aster models are like GLM except that in aster models components of the response vector

  • can be dependent, with dependence specified by a graphical model, and

  • can have different "families" (some Bernoulli, some Poisson, some zero-truncated Poisson, some normal).

CRAN packages aster and aster2 provide statistical inference for these models. This project only works on R package aster. There is no public git repo for this package. To look at its current state download the tarball from https://cran.r-project.org/package=aster and unpack. The student accepted to work on this project will get commit access to the github.com private repo.

The basic papers on aster models are cited on the help page for R generic function aster in R package aster. There is also a web site for a special topics course on aster models http://www.stat.umn.edu/geyer/8931aster/. Slide decks one and two of the course slides provide sufficient background for this project.

Aster models have not become popular with statisticians despite their possible applications in many areas. They are widely used by biologists.

Related work

No other R packages (except the aforementioned aster2) have aster functionality.

Details of this GSOC Coding Project

This work is about the aster model subsampling problem, which is important because many biologists do their experiments with subsampling. The subsampling problem is thoroughly discussed in the PDF file http://users.stat.umn.edu/~geyer/q.pdf. This file was edited 2019-04-04 at 7:30 CDT to correct a bad off-by-one error in the derivative formulas.

This project will create no new R functions. The task is to fix R package aster so that it does subsampling as described in above mentioned PDF file (q.pdf). This requires many changes to the C code underlying R function mlogl in that package

The reason for this is that unconditional aster models without subsampling are regular full exponential families so one can calculate log likelihood derivatives two ways:

  • use calculus (the sum rule, product rule, chain rule, etc.)

  • use the theory of regular full exponential families:

    • the first derivative of the log likelihood is "observed minus expected" yE(Y) where y is the canonical statistic, and

    • the second derivative of the log likelihood is "minus observed Fisher information" − var(Y)

All of the code in the current version uses the second method, which is much simpler.

But aster models with subsampling are not regular full exponential families, so we have to use the first method (calculus). This requires a lot of changes to the code. The equations for the aster model log likelihood with subsampling are given in Theorems 1, 3, and 6 in q.pdf.

So the log likelihood and its first and second derivatives will have to be recoded to do subsampling right. We also need observed and expected Fisher information done right (Section 2.20 of q.pdf). That is the main job.

We also need a new "family". See the R help page families in R package aster. This will be fam.sample() or perhaps fam.sample(p) which will tell us which arrows are the subsampling arrows. So I guess saying there will be no new R functions was a lie. But fam.sample is a trivial function to write. Its underlying C code may not be trivial. There also may be user-visible changes for R function aster if the user wants to use expected Fisher information, in which case the subsampling probabilities will have to be specified somehow, perhaps through the argument to fam.sample, perhaps through a new argument to R function aster.

What exactly to do:

  • design the changes

  • code the changes

  • test the changes with code in the tests directory of the package

  • modify the help pages as necessary

  • a subsampling vignette would be nice but is not absolutely necessary.

Expected Impact

Many biologists have already produced data that has subsampling and would be suitable for aster analysis if R package aster had subsampling.

Mentors

Charles J. Geyer (charlie@stat.umn.edu)

  • maintainer for 12 CRAN packages

  • Mentor for 2 previous GSOC

  • statistical computing teacher

Ruth G. Shaw (shawx016@umn.edu)

  • scientific originator of aster models

  • also the advocate for aster models in evolutionary genetics

Tests

Students, please do one or more of the following tests before contacting the mentors above.

  • Easy: install R package aster and do the examples on the the help page for R function aster.

  • Medium: write an R implementation of the log likelihood for the aster model specified by (4) in q.pdf, where y5 and y6 are conditionally Poisson given their predecessors, where y1 and y4 are conditionally Bernoulli given their predecessors, and where (as indicated in the graph in the PDF) y2 and y3 are subsampling arrows. So this graph has 4 unknown parameters per individual (one for each non-subsampling arrow). Assume individuals are independent and identically distributed so there are only 4 unknown parameters no matter how many individuals there are. Write a test that shows your function calculates the log likelihood correctly (for this particular graph). Write R functions that calculate the gradient vector and hessian matrix of the log likelihood. Test them using your log likelihood function and R package numDeriv. You will have to make up some data to test your function.

  • Hard: figure out the "call graph" of R function mlogl. How does it do its job? It calls multiple C functions but also calls a non-public R function mloglhelper that does most of the work. But what C functions does that function call? And then what do they call and so forth?

    It may be that you find the current C code is almost useless (since it does derivatives without calculus but rather with probability as explained above). In that case go to the bottom and figure out how families are implemented in this package. And so figure out how the existing family functions can be used to implement the formulas in q.pdf.

Students, please post a link to your test results here.

Clone this wiki locally