-
Notifications
You must be signed in to change notification settings - Fork 8
Add Subsampling to Aster Models
Aster models for life history analysis are generalized generalized linear models (GGLM). They also generalize discrete-time survival analysis, life table analysis, zero-inflated Poisson regression, and are a special case of graphical models. Aster models are like GLM except that in aster models components of the response vector
-
can be dependent, with dependence specified by a graphical model, and
-
can have different "families" (some Bernoulli, some Poisson, some zero-truncated Poisson, some normal).
CRAN packages aster
and aster2
provide statistical
inference for these models. This project only works on R package aster
.
There is no public git repo for this package. To look at its current state
download the tarball from https://cran.r-project.org/package=aster and unpack.
The student accepted to work on this project will get commit access to
the github.com private repo.
The basic papers on aster models are cited on the help page for R generic
function aster
in R package aster
. There is also a web site for
a special topics course on aster models
http://www.stat.umn.edu/geyer/8931aster/.
Slide decks one and two of the course slides provide sufficient background
for this project.
Aster models have not become popular with statisticians despite their possible applications in many areas. They are widely used by biologists.
No other R packages (except the aforementioned aster2
) have aster
functionality.
This work is about the aster model subsampling problem, which is important because many biologists do their experiments with subsampling. The subsampling problem is thoroughly discussed in the PDF file http://users.stat.umn.edu/~geyer/q.pdf. This file was edited 2019-04-04 at 7:30 CDT to correct a bad off-by-one error in the derivative formulas.
This project will create no new R functions. The task is to fix R package
aster
so that it does subsampling as described in above mentioned PDF file
(q.pdf). This requires many changes to the C code underlying R function
mlogl
in that package
The reason for this is that unconditional aster models without subsampling are regular full exponential families so one can calculate log likelihood derivatives two ways:
-
use calculus (the sum rule, product rule, chain rule, etc.)
-
use the theory of regular full exponential families:
-
the first derivative of the log likelihood is "observed minus expected" y − E(Y) where y is the canonical statistic, and
-
the second derivative of the log likelihood is "minus observed Fisher information" − var(Y)
-
All of the code in the current version uses the second method, which is much simpler.
But aster models with subsampling are not regular full exponential families, so we have to use the first method (calculus). This requires a lot of changes to the code. The equations for the aster model log likelihood with subsampling are given in Theorems 1, 3, and 6 in q.pdf.
So the log likelihood and its first and second derivatives will have to be recoded to do subsampling right. We also need observed and expected Fisher information done right (Section 2.20 of q.pdf). That is the main job.
We also need a new "family". See the R help page families
in R package aster
. This will
be fam.sample()
or perhaps fam.sample(p)
which will tell us which arrows
are the subsampling arrows. So I guess saying there will be no new R functions
was a lie. But fam.sample
is a trivial function to write. Its underlying
C code may not be trivial. There also may be user-visible changes for
R function aster
if the user wants to use expected Fisher information,
in which case the subsampling probabilities will have to be specified somehow,
perhaps through the argument to fam.sample
, perhaps through a new argument
to R function aster
.
-
design the changes
-
code the changes
-
test the changes with code in the
tests
directory of the package -
modify the help pages as necessary
-
a subsampling vignette would be nice but is not absolutely necessary.
Many biologists have already produced data that has subsampling and would be suitable for aster analysis if R package aster had subsampling.
Charles J. Geyer (charlie@stat.umn.edu)
-
maintainer for 12 CRAN packages
-
Mentor for 2 previous GSOC
-
statistical computing teacher
Ruth G. Shaw (shawx016@umn.edu)
-
scientific originator of aster models
-
also the advocate for aster models in evolutionary genetics
Students, please do one or more of the following tests before contacting the mentors above.
-
Easy: install R package
aster
and do the examples on the the help page for R functionaster
. -
Medium: write an R implementation of the log likelihood for the aster model specified by (4) in q.pdf, where y5 and y6 are conditionally Poisson given their predecessors, where y1 and y4 are conditionally Bernoulli given their predecessors, and where (as indicated in the graph in the PDF) y2 and y3 are subsampling arrows. So this graph has 4 unknown parameters per individual (one for each non-subsampling arrow). Assume individuals are independent and identically distributed so there are only 4 unknown parameters no matter how many individuals there are. Write a test that shows your function calculates the log likelihood correctly (for this particular graph). Write R functions that calculate the gradient vector and hessian matrix of the log likelihood. Test them using your log likelihood function and R package
numDeriv
. You will have to make up some data to test your function. -
Hard: figure out the "call graph" of R function
mlogl
. How does it do its job? It calls multiple C functions but also calls a non-public R functionmloglhelper
that does most of the work. But what C functions does that function call? And then what do they call and so forth?It may be that you find the current C code is almost useless (since it does derivatives without calculus but rather with probability as explained above). In that case go to the bottom and figure out how families are implemented in this package. And so figure out how the existing family functions can be used to implement the formulas in q.pdf.
Students, please post a link to your test results here.
-
Name: Dominic Mills-Howell
Email: dominic.millshowell@uwimona.edu.jm
University: Univeristy of the West Indies, Mona Campus
Link to solutions: Easy, Medium - coming soon -
Shannon Sequeira (sequeira@umn.edu)
University of Minnesota
Easy solution