-
-
Notifications
You must be signed in to change notification settings - Fork 138
The mechanism of optimization
FSRS is based on the DSR (Difficulty, Stability, Retrievability) memory model, training with Back Propagation Through Time and Maximum Likelihood Estimation.
The DSR model uses thirteen parameters and six equations to describe memory dynamics during spaced repetition practice (For details, see Free Spaced Repetition Scheduler).
Here is a brief introduction to the training process of FSRS.
The database schema of Anki's review logs is as follows (Copy from Database Structure):
-- revlog is a review history; it has a row for every review you've ever done!
CREATE TABLE revlog (
id integer primary key,
-- epoch-milliseconds timestamp of when you did the review
cid integer not null,
-- cards.id
usn integer not null,
-- update sequence number: for finding diffs when syncing.
-- See the description in the cards table for more info
ease integer not null,
-- which button you pushed to score your recall.
-- review: 1(wrong), 2(hard), 3(ok), 4(easy)
-- learn/relearn: 1(wrong), 2(ok), 3(easy)
ivl integer not null,
-- interval (i.e. as in the card table)
lastIvl integer not null,
-- last interval (i.e. the last value of ivl. Note that this value is not necessarily equal to the actual interval between this review and the preceding review)
factor integer not null,
-- factor
time integer not null,
-- how many milliseconds your review took, up to 60000 (60s)
type integer not null
-- 0=learn, 1=review, 2=relearn, 3=cram
);
A revlog records card cid reviewed at time id with rating ease. It is easy to trace the entire review history of a card. The memory state of a card depends on its review history. Group the revlogs by card, sort them by time and concatenate the ratings and intervals between each adjacent review; it can generate the history of ratings and history of intervals for each card. Here are the samples after pre-processing:
In machine learning terms, the histories of ratings and intervals are time-series features. I want to use this time-series data to train the DSR model. The inputs are the r_history and t_history, and the outputs are Stability and Difficulty. I built a time-series model via PyTorch to handle this data type, which could automatically apply Backpropagation through time (BPTT) to train this model.
There is another problem with training the model from the time-series data. One main goal of FSRS is to predict the memory state after a sequence of reviews. However, time-series data only records ratings instead of Stability or Difficulty. Stability is a statistical variable of a group of reviews with the same review history. Is it possible to estimate the stability of memory with the raw data? Maximum likelihood estimation (MLE) fits in this case.
According to the formula of the forgetting curve, the possibility of recall (
According to the MLE method,
Set
which is the loss function for updating
At this point, we then replace
$$P(r=1,t,\vec{r}{i-1},\vec{t}{i-1}|\vec\theta) = e^{-\frac{t}{DSR(\vec{r}{i-1},\vec{t}{i-1})}},$$
where
With MLE and BPTT, we can then train the parameters of DSR directly from the time-series data.
To get optimal retention, it needs to define "optimal". In my paper, the scheduling optimization is formulated to improve memory stability to the target level with minimum cost. For example, I need five reviews to stabilize a card to
In a simplified case, assume that algorithms are only based on retention. It is possible to enumerate each retention and calculate the expected cost, then pick the retention corresponding to the minimum cost. For details, please see the following pseudo-code:
The matrix
My representative paper at ACMKDD: A Stochastic Shortest Path Algorithm for Optimizing Spaced Repetition Scheduling
My fantastic research experience on spaced repetition algorithm: How did I publish a paper in ACMKDD as an undergraduate?
The largest open-source dataset on spaced repetition with time-series features: open-spaced-repetition/FSRS-Anki-20k