-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer #1
Comments
Honestly I was suggesting what I suggested cause it was an easier implementation 🍃 |
I implement the simplest load balancer in the simulator. Here are my initial results. Simulator: https://github.com/open-spaced-repetition/load-balance-simulator/blob/main/notebook.ipynb Load Balance: enable Load Balance: disable The real workload is smoother. But the true retention also drops. |
doing a weighted random seems to improve the retention but the review count is a bit more erratic (but not as much as with no balancing). I'll try implementing this in anki tomorrow and trying it out for a bit. I ended up wasting the day failing to convince anki that a plugin to do simulations was a reasonable thing it should be able to do. |
@jakeprobst, thanks for your code! I add it into the notebook. To make our analyses more accurate, I quantify the volatility and average retention. Here is the result:
How I calculate the volatility: volatility = np.std(np.diff(review_card_per_day) / review_card_per_day[1:], ddof=1) |
@L-M-Sherlock can you test my idea as well? With your code, days in range is
We sort |
Also, I don't think your method of calculating volatility is good. According to your method, [100, 125, 150] has lower volatility than [100, 115, 110]. I suggest replacing And one more thing: please make a table with 4 scenarios: no Load Balancer, Simple Load Balancer, Weighted Random Load Balancer, and Restricted Weighted Random Load Balancer, the code for the last one is in my comment above. |
|
It seems that there may be an error (or bias) in this value. I can't think of any reason why the Restricted Weighted Random Load Balancer would have a lower volatility than the Simple Load Balancer. |
Might be noise. @L-M-Sherlock try increasing |
I think we need run more simulations with different random seeds and use t-test. |
Then run 20 of each (so that's 20*4=80 runs), record retention and volatility values, and post them here. |
|
You should probably switch to scientific notation if p-values are so small. And you should add average retention values as well, not just average volatility. |
I like Moreover, it will perform better in breaking up clusters of related cards than the |
I don't like that it can schedule a card on a day with the highest due count of all days. That goes against the spirit of this feature. |
I ran the simulations on my own, but with a tweak to my formula: I used
Ranking based on volatility (from best to worst): Simple > Restricted (floor) > Restricted (ceil) > Weighted > None You may have noticed that the rankings for volatility and retention are the exact opposites of each other. You can obtain one just by reversing the order of the other one. So it doesn't seem like we can get both desirable properties - same retention but low volatility - simultaneously. I would suggest using Restricted (ceil) simply because it's right in the middle of both rankings. A compromise. |
Are we trying to solve the retention problem or the clustering problem or both at the same time? |
Ideally, we want retention to be unaffected and the load to be very consistent. But we can't have both. |
Can we not just schedule newer cards differently, i.e., a bit earlier, to account for this? |
Well, say hello to Double Weighted Random Load Balancer:
This makes it so that the probabilities depend not just on the number of due cards, but also on interval lengths. This will systematically drag retention up, since shorter intervals result in higher probabilities. Results:
Ok, not gonna lie, I was NOT expecting this to actually work. |
Seems that the Double Weighted LB is one of the best compromises yet. I encourage you to try even more tweaks. PS: |
rather than just the interval itself (which is going to bias a bit too hard to earlier days?) why not use the distance from the target interval? |
Ok, here's one final tweak. I raised
The output is getting too long, so I'm not including p-values, except for two.
The retention is still higher with Double Weighted LB with r squared than with no LB, 88.7% vs 88.6%; meanwhile volatility is definitely lower. In fact, volatility is almost as low as with my restricted method. Edit: here are all modes, sorted by retention (higher is better):
And by volatility (lower is better):
Double Weighted Random Load Balancer with r raised to the power of 2 is within top 3 best modes in both lists. |
I'd like to see the numbers for
maybe pow(2) it as well for good measure. double weight load balance does look pretty solid though. I was also experimenting with larger powers like 10 to sort accentuate the actual load balance aspect of all this. Unsure how I feel about this currently. |
You need to add an extra condition for when |
Since @L-M-Sherlock appears to be busy, I have implemented the idea above myself. Here are the results. Lower clustering score = better. Score of 1 means that all intervals are exactly the same. The closer the value is to 0, the more intervals are unique. For example, for [1, 1, 1, 2, 2, 3], the clustering score is 1/3, since there are 3 unique values.
For clustering, all p-values are >0.05, meaning that all variants perform about the same. |
Maybe we need a different way to estimate the amount of clustering. I used ChatGPT to get some ideas. They seem to be worth looking into. ChatGPT output: To get a single value that summarizes how often any two cards are reviewed on the same date, you can use aggregate measures that capture the overall tendency of event co-occurrence. Here are a few approaches to compute such a value: 1. Co-occurrence RatioCalculate a single co-occurrence ratio that summarizes the frequency with which any two events occur together relative to their individual frequencies. Steps:
Formula:where 2. Jaccard Similarity IndexThe Jaccard similarity index measures the similarity between two sets and can be adapted for events. Formula:where For multiple events, you can generalize this by averaging the pairwise Jaccard indices. 3. Dice CoefficientThe Dice coefficient is another measure of similarity that can be used for co-occurrence. Formula:4. Average Pairwise Co-occurrenceCalculate the average number of co-occurrences across all pairs of events. Steps:
Formula:5. Correlation of Event PresenceCompute the average pairwise correlation between events. Steps:
Example Using PythonHere’s an example of how to compute the Average Pairwise Co-occurrence: import pandas as pd
# Sample data
data = {
'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
'Event 1': [1, 1, 0],
'Event 2': [0, 1, 1],
'Event 3': [1, 0, 1]
}
df = pd.DataFrame(data)
# Co-occurrence calculation
co_occurrence_matrix = pd.DataFrame(index=['Event 1', 'Event 2', 'Event 3'], columns=['Event 1', 'Event 2', 'Event 3'], data=0)
for event1 in co_occurrence_matrix.index:
for event2 in co_occurrence_matrix.columns:
if event1 != event2:
co_occurrence_matrix.loc[event1, event2] = ((df[event1] == 1) & (df[event2] == 1)).sum()
# Average Pairwise Co-occurrence
num_pairs = len(co_occurrence_matrix.index) * (len(co_occurrence_matrix.index) - 1) / 2
total_co_occurrences = co_occurrence_matrix.where(np.triu(np.ones_like(co_occurrence_matrix, dtype=bool), k=1)).stack().sum()
average_pairwise_co_occurrence = total_co_occurrences / num_pairs
print("Average Pairwise Co-occurrence:", average_pairwise_co_occurrence) In this script:
Choose the method that best fits your data and goals. Each method provides a different perspective on how often events co-occur. |
How is retention higher for double weighted? You said this might be some error, is it error? Also, about clustering, it seems you're counting the number of unique intervals due cards at a particular date has. For example, among all the cards scheduled for today how many have unique intervals. How about "among all the cards rated today, how many have unique intervals"? edit: ngl chatgpt can be a boon to creativity. |
Later, once I have a good metric for clustering, I'll re-run everything with 150 samples instead of 20, to get a definitive answer. For now, I'm thinking about quantifying clustering. |
Pairwise comparisons sound like a huge pain to implement, not to mention the computational complexity. If there are 10 000 reviews, that's (10 000 * 9 999) / 2 = 49 995 000 pairs. Even if I could implement it, it would likely slow the simulation down. But I'll see what I can do. |
Ok, I figured out a way of doing this. I'm not saying I figured out a good way, though.
The matrix gets updated after a new day starts + after each review. It contains information about every review of every card on every day. And then, after the simulation is done, there is this monstrosity. It's a for loop inside a for loop inside a for loop:
This loop takes longer than the simulation itself. This is likely the most inefficient method possible, but oh well. Here are the results:
The amount of clustering is lower with the simple LB than with no LB (p-value=4.03E-16). I don't have an explanation for that. Although, while the difference is statistically significant, it's not practically significant. All numbers are close to 0.01. |
I ran it again, but replaced
Same results again. Somehow simple LB results in lower clustering than no LB. So both methods - based on the number of unique interval lengths and based on counting co-occurrences - tell us that clustering is the same, and whether I limit it to 15 days or no has no effect either. |
I also tried Jaccard Similarity Coefficient.
It's similar to counting co-occurrences, the big difference is that unlike the last time, when "no review of card 1 - no review of card 2" counted towards
With all 3 methods - unique intervals, co-occurrences, Jaccard - the averages of different variants of LB are very close, and without a statistical significance test, it's not obvious whether the tiny differences can be attributed to noise or not. With the first method, the differences are not statistically significant. With the latter two, most differences are significant, but the results are counter-intuitive. |
It seems that if delta_t < 2.5 or not enable_load_balance:
return delta_t But this doesn't completely explain the results. For example, |
You're right. I'll fix that later. |
This assigns equal probabilities to each interval. I also renamed "No Load Balancer" to "No Load Balancer (fuzz)", for the sake of clarity. For clustering, I'm using Jaccard, lower = better.
The results are almost the same as previously. |
@jakeprobst @L-M-Sherlock sorry for the trouble, but I advise you to read all comments made within the last 2 days. |
the simple load balancer doing the best with clustering is very surprising to the point where I feel something is wrong somewhere? I'm not a math guy so I can't add any meaningful input more than it just feels intuitively wrong? |
The problem is that all 3 methods of estimating clustering output very similar average values for every variant of LB, and the differences are very small.
With the first method, all values are very close to 0.15. With the second method, all values are very close to 0.011. With the third method, all values are very close to 0.055. So either all methods of measuring clustering are bad, or clustering is not an issue and we are worried for no reason. |
Can you do this on a real collection? Because it was vaibhav who brought it up first. |
It would take months. |
He specifically talked about new cards, so he excluded cards that went to relearning and was talking about a rather small time frame, probably 7 days. |
I don't really see a reason to do so when we have simulations. We just don't know how to measure what we want. Or we're worried about a non-existing problem, which is also possible. |
I agree with you but in that comment I was saying clustering might decrease because you were considering a 15 day time interval in the simulation. @dae you should look at this, it's possible clustering is a non-existing problem. |
To clear any confusion: the simulation is always 90 days, and I tried measuring clustering both across all days and across the first 15 days, the results were pretty much the same - all variants of LB perform almost the same. |
I think we should dive into the details. I draw the histograms of last dates from day 1 to day 30: |
What's your interpretation? |
@L-M-Sherlock did you add my line?
Because if not, then this isn't a proper comparison. |
I ran each variant of LB with 150 samples, to determine how much they affect retetion. Here are the results:
All p-values are significant, no variant performs exactly the same as another one. So yeah, double weighted makes retention ever so slightly lower. But while the difference is statistically significant, in practice nobody will be able to tell 88.8% and 88.7% apart based solely on personal experience. Next, I also tried a fourth method of estimating clustering: just simply record every single interval length of every card throughout the simulation, and calculate the standard deviation, and take the inverse (1/x) of it, so that lower = better. And I also removed all intervals equal to 0. Here are the results, 20 samples:
As you can see, it's the same picture again, for the fourth time: all values are very close, and no variant of LB is very different from any other based on this metric. If we sort by these 4 different metrics, we get very different rankings, so we can't say "Well, let's just pick whichever variant of LB results in the lowest value of clustering according to all 4 metrics". Calculating these metrics based on all 90 days or only 15 days makes no difference either. In any case, all variants of LB are practically the same, regardless of which metric you are looking at.
Squared count seems to work better both in terms of making retention closer to fuzz, and in terms of reducing volatility. Based on retention and volatility, Double Weighted is the best. |
So, I think it would be better to see how it works in the real world. The next Anki update will likely be a big update with a long beta cycle. So, beta testers will be able to share their experiences before the stable release. Let's go with the Double Weighted one and add an option to disable the Load Balancer (even if just in the Debug Console) for the people who face problems. Just a note: |
Can I close this issue? I think the conclusion has been drawn. Btw, I write a notebook to simulate Easy Days here: https://github.com/open-spaced-repetition/easy-days-simulator/blob/main/notebook.ipynb I recommend opening a new issue to discuss it. |
yeah close it |
@L-M-Sherlock @jakeprobst @user1823 @brishtibheja
Right now, there are two "competing" ideas regarding Load Balancing. Everyone agrees that making the probability of scheduling a card on a specific day inversely proportional to how many due cards that day has is a good idea. However, me and Jake disagree on the details, and it's not clear who's idea is better.
Jake proposes N = days in range, where N is the number of cards with the lowest due count. In other words, his idea is to allow Load Balancer to schedule a card on any day within the fuzz range, just with different non-zero probabilities.
My idea: N = max(floor(days in range / 2), 2). This ensures that cards can only be scheduled on days with a relatively low due count and cannot be scheduled on a day with the highest (or one of the highest) due count.
It may not be immediately clear what's the difference, so here's a visualiztion:
Suppose the fuzz range for a card includes the following days: [23, 24, 25, 26, 27], days in range = 5. Jake's range and fuzz range are the same, the difference is that in Jake's version the probability of a card falling on a particular day isn't the same for every day. My range is narrower, it only includes (in this example) two days with the lowest due count. The probabilities for the other 3 days are 0. This avoids clustering while staying true to the spirit of this feature, which is "do not schedule a card on a day that already has a lot of cards".
In order to test which idea is better, we could ask Jake to run both on his collection, but that would take months. Instead, Jarrett, I would like you to collaborate with Jake to incorporate Load Balancer into your Simulator, so we can test these ideas on a simulation.
The text was updated successfully, but these errors were encountered: