-
Notifications
You must be signed in to change notification settings - Fork 31
Constrained changepoint detection
There are many R packages for changepoint detection, which is an important problem in many application domains (genomics, finance, etc). A new changepoint model with up-down constraints has shown state-of-the-art peak detection accuracy in genomic data (Hocking et al 2015). However, the Constrained Dynamic Programming Algorithm (CDPA) proposed in that paper is quadratic time (too slow for big data) and heuristic (not guaranteed to recover the optimal solution).
The goal of this project is to implement an optimal log-linear time algorithm for such constrained models, using the functional pruning technique (Maidstone et al 2016) that has been recently shown to work for constrained models (Hocking et al 2017). In particular we are interested in implementing several loss functions (Poisson, Normal, Binomial), and several constraint functions between adjacent segment means (no constraint, non-decreasing, non-increasing), for the optimal partitioning model (compute the best single segmentation model for a given penalty parameter).
The table below summarizes related R packages. The top half of the table is for algorithms that solve the segment neighborhood problem, which computes all models from 1 to K segments. The bottom half of the table is for algorithms that solve the optimal partitioning problem, which computes the best single model with K segments that corresponds to a given penalty parameter (and it does not compute the models from 1 to K-1 segments). Time complexity is for a sequence of N data.
function | models | time | optimal | constraints | loss |
---|---|---|---|---|---|
Segmentor | K | O(K N log N) | yes | none | several |
cDPA | K | O(K N^2) | no | up-down | Poisson |
PeakSegPDPA | K | O(K N log N) | yes | up-down | Poisson |
Fpop | 1 | O(N log N) | yes | none | Normal |
PeakSegFPOP | 1 | O(N log N) | yes | up-down | Poisson |
this project | 1 | O(N log N) | yes | several | several |
The table above shows that previous work has implemented several loss functions for the unconstrained model (Segmentor), and the Poisson loss function for the up-down constrained model (PeakSegFPOP). The goal of this project is to implement a solver for several loss and constraint functions.
Implement a C++ function which implements the Generalized Functional Pruning Optimal Partitioning (GFPOP) algorithm described in the last section (B.5) of our arXiv paper. There should be an R interface for solving changepoint problems with several kinds of loss and constraint functions. The main
R solver function should be the GFPOP
function:
GFPOP(model, data.vec, weight.vec, penalty)
data.vec
is a numeric vector of data to segment (with
corresponding weights in weight.vec
). The penalty
is a
non-negative numeric scalar which is added to the cost for each
changepoint. The model
specifies the loss and constraint
functions. The constraints can be described in terms of a finite
number of states and possible changes. For example the unconstrained
model (same as fpop package) could be specified as one state with one
possible unconstrained change:
library(data.table)
state <- function(state.name){
structure(data.table(state.name), type="state")
}
change <- function(from, to, constraint, penalty=1){
structure(data.table(from, to, penalty, constraint), type="change")
}
loss <- function(loss.name){
structure(data.table(loss.name), type="loss")
}
unconstrained.Normal <- list(
state("anything"),
change("anything", "anything", "no.constraint"),
loss("Normal"))
The reduced isotonic regression problem also has one state, but constrains all changes to be non-decreasing:
isotonic.Normal <- list(
state("anything"),
change("anything", "anything", "non.decreasing"),
loss("Normal"))
The PeakSegFPOP solver can be described using two states, peak (after an up change) and background (after a down change). The model is also constrained to start and end in the background state:
start <- function(...){
structure(data.table(state.name=c(...)), type="start")
}
end <- function(...){
structure(data.table(state.name=c(...)), type="end")
}
PeakSeg.Poisson <- list(
loss("Poisson"),
state("peak"),
state("background"),
change("background", "peak", "non.decreasing"),
change("peak", "background", "non.increasing", penalty=0),
start("background"),
end("background"))
The idea is that any of these models (or others) could be used as the
model
argument to GFPOP
. This R code that describes the model
should be converted into a format that can be passed to C++ code,
along with a data set and penalty e.g.
class Change {//represents an edge in the constraint graph
int from;
int to;
//EITHER (easy to implement based on current code)
int constraint;//1=non-decreasing 2=non-decreasing 3=unconstrained
//OR (harder to implement, more general)
double a,b,c;//a*m_t + b*m_{t+1} + c \leq 0.
double penalty;
};
void GFPOP(
int n_data,
double *data_vec, //array[n_data]
double *weight_vec, //array[n_data]
double penalty,
int loss, //1=Normal 2=Poisson 3=Binomial, etc
int n_states, //number of vertices in constraint graph
int n_changes, //number of edges in constraint graph
Change *change_vec, //array[n_changes]
int n_starts, //number of states possible at first data point
int *start_state_vec, //array[n_starts] possible states at first data point
int n_ends, //number of states possible at last data point
int *end_state_vec, //array[n_ends] possible states at last data point
//output below:
int *optimal_state_vec, //vector of optimal state/vertex for each segment
int *optimal_change_vec, //vector of optimal changes/edges
double *optimal_mean_vec, //vector of optimal segment means
double *optimal_cost_vec, //maybe just the last point?
int *optimal_segment_end_vec, //last data point of each segment
int *intervals_mat //number of intervals or cost function pieces
);
The ideal student will also write tests, documentation, vignettes and maybe even a blog.
This GSOC project will provide log-linear time code for solving a large class of constrained change-point problems. Because of the log-linear time, this code will be useful for large data sets. Because of the different constraints and loss functions, it will be useful in many different application domains.
Please get in touch with Toby Dylan Hocking <toby.hocking@r-project.org> and Guillem Rigaill <guillem.rigaill@inra.fr> after completing at least one of the tests below.
NOTE: this is a very difficult GSOC project, and so the ideal student really should do all tests below. If we can’t find a student that can do all these tests, we should probably not fund this GSOC project.
Easy: write two R functions checkModel
and plotModel
which input
one of the models described above. checkModel
should stop with an
informative error if the model is invalid (e.g. if change/start/end
reference an undefined state, or if an unrecognized loss/constraint
function is used). Write a package and use testthat
to write some
test cases that make sure checkModel
is working correctly, both for
positive examples (valid models should not cause errors) and negative
examples (invalid models should cause errors). plotModel
should draw
a node-and-edge graph that describes the model states and changes.
Medium: the Segmentor3IsBack package implements the segment neighborhood model (best models in
1,…,K segments) for the
Poisson loss and no constraints, but there is no implementation available for the
optimal partitioning model (best K-segment model for a single penalty
parameter, without computing the models with 1,…,K-1 segments). The
goal of this test is to modify the code in the coseg
package, in
order to implement a solver for the optimal partitioning problem with
Poisson loss and no constraints. Begin by studying PeakSegFPOPLog.cpp
which implements the optimal partitioning model for the Poisson loss
and the up-down constraints. There are two states in this model, up
and down. Since the up-down constrained solver has two states, there
are N x 2 optimal cost functions to compute (cost_model_mat
is of
dimension data_count*2
). The cost of being in the up state at
data_i
is cost_model_mat[data_i]
and the cost of being in the down
state is cost_model_mat[data_i+data_count]
. The
min_prev_cost.set_to_min_less_of(down_cost_prev)
method enforces a
non-decreasing constraint between adjacent segment means, for the
state change down->up. Analogously, the
PiecewisePoissonLossLog::set_to_min_more_of
method enforces a
non-increasing constraint for the state change up->down. To implement
the unconstrained solver, you just need to implement a new
PiecewisePoissonLossLog::set_to_unconstrained_min_of
method that
computes the minimum constant cost function (one PoissonLossPieceLog
object), and then uses that to compute the N x 1 array of optimal cost
functions (cost_model_mat
). Read about the FPOP algorithm in
Maidstone et al 2016 for more info. When you are done with your
implementation, check your work by comparing with the output of
Segmentor3IsBack::Segmentor(model=1)
. Perform model selection
yourself for a range of penalty parameters. Using testthat
, write
some test cases which make sure your function gives the same exact
model as the corresponding selected Segmentor model.
Hard: there is not yet an regularized isotonic regression solver for
the normal loss (issue), and your goal in this test is to implement
one. Like the unconstrained model described the Medium test, the
regularized isotonic regression model also has only one state. So you
can start by modifying the Medium test code, which should have a
cost_model_mat
which is N x 1. However the isotonic regression
constraint means that all changes are non-decreasing, so you should
use set_to_min_less_of
instead of set_to_unconstrained_min_of
. Now
the difficult part: the existing code in the coseg
package
implements the Poisson loss via class PoissonLossPieceLog
, but you
need to implement another class for the Normal loss,
NormalLossPiece
. This class will need to declare different
coefficients Constant
, Linear
, Quadratic
for a function
f(mean)=Constant + Linear*mean + Quadratic*mean^2. You will need to
provide implementations for get_smaller_root
and get_larger_root
by using the sqrt
function in #include <math.h>
. It will be judged
even better if you can get PoissonLossPieceLog
and NormalLossPiece
to inherit from the same base class with shared methods (that is the
approach that the Segmentor package uses to implement several loss
functions, and that is the approach that will be recommended for this
GSOC project). Check your work by writing a testthat
unit test to
make sure that the model returned by your function with penalty=0 is
the same as the model returned by the isoreg
function (PAVA
algorithm). Write another test that checks that the output model
is the same as Fpop
(when all changes are non-decreasing).
- Students, please post a link to your test results here.