-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSRS - Research suggests the minimum limit can be reduced to 16 #3094
Comments
@Expertium @user1823, what do you think? I will reproduce the result later. |
We need to check the stability of the parameters too, i.e., the parameters shouldn't fluctuate too much when re-optimized after doing some more reviews. |
I notice that the size of testset is only 1/4 of the trainset. It's rare in practice because users wouldn't optimize their parameters frequently. |
Bad news: I find a severe pitfall in the benchmark code when I design the experiment. I have to re-benchmark all models in this week. Briefly, the benchmark code has a data leakage problem. The testset is used to pick the best parameters during the training process. |
Or do you mean optimizing the first 4 parameters? Is that what you mean by "pretrain"? I was thinking of the kind of pretrain that you did for neural networks. |
Regarding the stability of parameters, we could introduce L1 or L2 regularization if n(reviews)<1000 (or some other threshold). It's easy to do with pytorch. |
@L-M-Sherlock @user1823 I think there are 2 options:
Btw, I think 16 reviews is too little, I would recommend setting 100 as the new required minimum. |
@Expertium @user1823 It also might be best to update the confidence interval calculation if anyone knows the best way to do it mathematically. I'm currently taking a random 10% of the data and finding the mean, repeating that 200 times, and plotting the IQR. (Which shows the confidence interval of the mean, the IQR of the raw values has a different meaning). Interesting point about stability. Do you think it's important if the parameters change significantly if it achieves a better prediction by doing so? |
The specialized tool for calculating accurate confidence intervals is
The problem is that if the sample size is very small (and 16 is small no matter how you look at it, especially since FSRS has 17 parameters), it may lead to overfitting. The algorithm would perform really well on those 16 reviews, and very poorly in practice. |
Gotcha thanks, I'll give that research a try and see what the results are. We can confirm if it overfits significantly with small review sizes regardless of the default FSRS loss, and what the minimum review count we can use is. I think pretrain has potential here with only optimising 4 parameters. |
Optimised FSRS seems to perform better than the default values. |
Personally, I do not think this needs changing. Lower sample sizes are more likely to be noisy, and users who are new to Anki may take a while to be consistent in their grading. Slightly increasing performance for the first few hundred reviews does not seem worth it. |
Pretrain is robust even on small sample sizes. Plus, a lot of users have complained about the limit. @L-M-Sherlock what do you think? According to Rlz, pretrain is always better than using default parameters, and the L1 penalty ensures that parameters won't become too big/too small on small sample sizes. |
@dae As an example I came across, there was a deck which did not perform well with the default parameters. The default initial stability was far too high, so the user was forgetting a large amount of the cards before the next review, and the parameters not changing caused this issue to continue for every card they made. Also, it's important to note that pretrain factors in the amount of data there is. If the user has not learnt many cards, it will account for this and make smaller changes to the parameters. That's the main deciding factor for us recommending pretrain for the first 1000 reviews, and why it performs so well in our analyses. It's very careful when adjusting the parameters with low sample sizes, yet it can bring significant improvements, especially in the worst case scenario where the assumptions of the default parameters do not hold. |
Thanks for elaborating. The fact that it factors in how the sample size is good. I think you've sold me. |
@dae I think it's best if you give us some more time to figure out the details. While it's clear that pretrain can be helpful for n(reviews)<1000, it's not clear whether there should still be a hard limit - such that below it default parameters are used, no optimization at all - or whether the hard limit should be removed entirely. |
Sorry, I meant I was withdrawing my objection, not suggesting we rush this change in. |
@dae disregard what I said above, we found a rule that performs strictly better than using default parameters for any sample size. And your next pharse will be, "I am not fond of the amount of complexity this creates for a relatively small gain".
This rule is complicated, but it guarantees three things:
It's unlikely that this will be changed further, so unless you or @L-M-Sherlock have any objections, I recommend implementing this rule in the current version of Anki. Sorry that it's just barely before the deadline, but me and Rlz don't see a good reason to postpone it to the next release. |
It's unnecessary to evaluate the default parameters. For all new users, the
Which loss? RMSE or log loss? |
RMSE. |
But that is true only for the very first optimization. |
@Expertium, this conclusion is based on the RMSE. But, what about overfitting? Can we be sure that overfitting would not be a significant problem when reviews > 32? Just comparing the RMSE would not help in this case because RMSE would be low when overfitting occurs. A good way to test overfitting would be to split large datasets based on time and then see how the parameters trained from earlier revlogs perform when applied to later revlogs. |
https://github.com/open-spaced-repetition/srs-benchmark/blob/main/notebook/minimum_limit.ipynb
Conclusion: if the user optimizes parameters on 32 reviews, they will work well for the next 10 reviews, and even for the next 100 reviews. But not for the next 1000 reviews. In order to get a good test error on the next 1000 reviews, the dataset must be 100-200 reviews. |
|
Whatever. Stats is not my field of expertise. But, I hope you got the idea. |
I came to an almost similar conclusion. However, I think that the number should be less than 5n. Maybe 3n or even 2n? |
That sounds good. Optimise every time your number of reviews has doubled/tripled, perhaps |
Ok, nevermind, those are, indeed, confidence intervals. My bad. |
Sounds great, we can use pretrain up to 64 reviews and fully optimised thereafter. A nice feature of optimising every 2x reviews means as the deck becomes larger, we optimise less frequently. |
Looking at the graph (lookahead = 100), even pretrain should not be used before 64 reviews |
In the graph with lookahead = 100, the curves and the bands for
I agree. But, then this is valid for full optimization also. Edit: |
The problem with full optimization is that, in theory, it's less stable than pretrain due to a lack of regularization aka penalty for making the parameters too large/too small. Though it doesn't actually show up in data, it seems. |
Sounds good. In which case, do we agree that it's best to:
|
It sounds good. I would add that don't even allow pre-train before 32 reviews. The graph with lookahead = 25 shows that about 32 reviews are required to say with 95% certainty that the pretrain parameters would be better than the default ones for the next 25 reviews. This means that below 32 reviews, the user would need to optimize even more frequently than every 2x reviews. One question: |
So 32 reviews for pretrain and 64 reviews for the full optimization? |
Ok. We can remove the limits for pre-train and see how it goes. Because we are talking about very small number of reviews, people will actually be able to share their experience with us during the beta cycle. |
I'd rather not. FSRS is already confusing for most people. We should decrease the cognitive burden for the user, not increase it. |
Pretrain outperforms default for a low number of reviews. |
Maybe we are misunderstanding each other. I meant that there shouldn't be a special window, like "Are you sure you want to run the optimizer when the number of reviews is so low?" |
Ah yes, I completely agree with you on that. |
@user1823 here's the answer to your earlier question:
They don't add up to 7586 because Rlz used Three conclusions from looking at the file:
|
@RlzHi, would you be interested in doing another analysis? Currently, we know that it is good to re-optimize the parameters after 2n reviews. But, is more frequent optimization worth it? For example, if I optimize the parameters on 100k reviews, should I think about re-optimizing them only after I have 200k reviews? Here is how you can go about it:
If the RMSE goes up significantly between n and 2n, we would know that it would be better to re-optimize the parameters in between. |
@user1823 Sure, it may take some time to create this one as higher values of n take longer to process. Will tag you when it's finished. Update: It's taking a very long time to complete, it seems to be because the optimisations I've added don't work with creating this graph (especially for large n). From what I'm seeing so far, re-optimising should be completed every 3x at least, which seems in line with the previous graphs. Every 2x seems ideal (i.e. if the last optimisation was at 100 cards, it doesn't need to be optimised for the next 200), which has the nice effect of less training being needed as the number of reviews grows. Let me know if automatic optimisation is being considered and I'll see what I can do to generate these graphs. |
Hey everyone,
We currently have a minimum of 1000 reviews for FSRS to be optimised, which I believe is now changing to 400. After some research and discussions with @L-M-Sherlock it seems we can lower this as far as 16, or remove it completely.
I've forked and updated srs-benchmark which contains the benchmarking code for this:
https://github.com/RlzHi/srs-benchmark/blob/main/notebook/metric_over_size.ipynb
The dry run is FSRS 4.5 without optimisations, and after approximately 16 reviews optimising FSRS performs better than the default parameters.
The text was updated successfully, but these errors were encountered: