Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer #1

Closed
Expertium opened this issue Aug 6, 2024 · 59 comments
Assignees

Comments

@Expertium
Copy link

Expertium commented Aug 6, 2024

@L-M-Sherlock @jakeprobst @user1823 @brishtibheja
Right now, there are two "competing" ideas regarding Load Balancing. Everyone agrees that making the probability of scheduling a card on a specific day inversely proportional to how many due cards that day has is a good idea. However, me and Jake disagree on the details, and it's not clear who's idea is better.
Jake proposes N = days in range, where N is the number of cards with the lowest due count. In other words, his idea is to allow Load Balancer to schedule a card on any day within the fuzz range, just with different non-zero probabilities.
My idea: N = max(floor(days in range / 2), 2). This ensures that cards can only be scheduled on days with a relatively low due count and cannot be scheduled on a day with the highest (or one of the highest) due count.
It may not be immediately clear what's the difference, so here's a visualiztion:
Load Balancer

Suppose the fuzz range for a card includes the following days: [23, 24, 25, 26, 27], days in range = 5. Jake's range and fuzz range are the same, the difference is that in Jake's version the probability of a card falling on a particular day isn't the same for every day. My range is narrower, it only includes (in this example) two days with the lowest due count. The probabilities for the other 3 days are 0. This avoids clustering while staying true to the spirit of this feature, which is "do not schedule a card on a day that already has a lot of cards".

In order to test which idea is better, we could ask Jake to run both on his collection, but that would take months. Instead, Jarrett, I would like you to collaborate with Jake to incorporate Load Balancer into your Simulator, so we can test these ideas on a simulation.

@jakeprobst
Copy link

Honestly I was suggesting what I suggested cause it was an easier implementation 🍃

@L-M-Sherlock
Copy link
Member

I implement the simplest load balancer in the simulator. Here are my initial results.

Simulator: https://github.com/open-spaced-repetition/load-balance-simulator/blob/main/notebook.ipynb

Load Balance: enable

image

image

Load Balance: disable

image

image

The real workload is smoother. But the true retention also drops.

@jakeprobst
Copy link

jakeprobst commented Aug 8, 2024

-    delta_t = possible_intervals[np.argmin(review_cnts)]
+    inv = [1 if r == 0 else 1/r  for r in review_cnts]
+    delta_t = random.choices(possible_intervals, weights=inv)[0]

weighted_review_per_day
weighted_retention_per_day

doing a weighted random seems to improve the retention but the review count is a bit more erratic (but not as much as with no balancing).

I'll try implementing this in anki tomorrow and trying it out for a bit. I ended up wasting the day failing to convince anki that a plugin to do simulations was a reasonable thing it should be able to do.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Aug 8, 2024

@jakeprobst, thanks for your code! I add it into the notebook.

To make our analyses more accurate, I quantify the volatility and average retention.

Here is the result:

No Load Balancer Simple Load Balancer Weight Random Load Balancer
Retention 0.886 0.876 0.885
Volatility 0.2 0.152 0.166

How I calculate the volatility:

volatility = np.std(np.diff(review_card_per_day) / review_card_per_day[1:], ddof=1)

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

@L-M-Sherlock can you test my idea as well?
N = max(floor(days in range / 2), 2)

With your code, days in range is possible_intervals. Here's an example of what I expect the code to look like:

possible_intervals = np.asarray([18, 19, 20, 21, 22])
review_counts = np.asarray([29, 25, 26, 23, 24])

N = max(int(np.floor(len(possible_intervals) / 2)), 2)  # N days with the lowest due count
sorter = np.argsort(review_counts)  # sort by review counts
sorted_counts = review_counts[sorter]
sorted_intervals = possible_intervals[sorter]
possible_intervals = sorted_intervals[:N]
inv = [1 if r == 0 else 1/r for r in sorted_counts[:N]]
delta_t = random.choices(possible_intervals, weights=inv)[0]

We sort possible_intervals and review_counts by the number of reviews, and then change possible_intervals so that it's a slice of sorted_intervals, with only N days being selected. And then we calculate inv based only on the selected days. In my example, only intervals 21 and 22 are possible, since they have the lowest number of due cards, so we randomly select between those days.

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Also, I don't think your method of calculating volatility is good. According to your method, [100, 125, 150] has lower volatility than [100, 115, 110]. I suggest replacing np.std with np.mean, and calculating the absolute difference.
volatility = np.mean(np.abs(np.diff(review_card_per_day1)) / review_card_per_day1[1:])

And one more thing: please make a table with 4 scenarios: no Load Balancer, Simple Load Balancer, Weighted Random Load Balancer, and Restricted Weighted Random Load Balancer, the code for the last one is in my comment above.

@L-M-Sherlock
Copy link
Member

No Load Balancer Simple Load Balancer Weighted Random Load Balancer Restricted Weighted Random Load Balancer
Retention 0.884 0.874 0.884 0.878
Volatility 0.153 0.094 0.118 0.091

@user1823
Copy link

user1823 commented Aug 8, 2024

0.091

It seems that there may be an error (or bias) in this value. I can't think of any reason why the Restricted Weighted Random Load Balancer would have a lower volatility than the Simple Load Balancer.

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Might be noise. @L-M-Sherlock try increasing deck_size to 20000, and set review_limits to 9999.

@L-M-Sherlock
Copy link
Member

I think we need run more simulations with different random seeds and use t-test.

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Then run 20 of each (so that's 20*4=80 runs), record retention and volatility values, and post them here.
Alternatively, give me the code (.py, not .ipynb, please), and I'll do it myself

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Aug 8, 2024

volatility mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.13395  0.009254
Restricted Weighted Random Load Balancer  0.09020  0.007736
Simple Load Balancer                      0.08130  0.005526
Weighted Random Load Balancer             0.10735  0.009281

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 6.35
p-value: 1.90E-07


retention mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.88610  0.000852
Restricted Weighted Random Load Balancer  0.87950  0.001051
Simple Load Balancer                      0.87565  0.000875
Weighted Random Load Balancer             0.88515  0.000745

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -12.59
p-value: 3.96E-15

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 19.61
p-value: 1.79E-21

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

You should probably switch to scientific notation if p-values are so small. And you should add average retention values as well, not just average volatility.
EDIT: ok, I think that's all the info we need. So here's the ranking based on average volatility: Simple > Restricted > Weighted > None. And based on average retention, None > Weighted > Restricted > Simple.
So we have to choose either Restricted or Weighted.

@user1823
Copy link

user1823 commented Aug 8, 2024

I like Weighted Random Load Balancer more because it produces almost the same retention as no load balancing (88.6 vs 88.5%) while also significantly decreasing the volatility.

Moreover, it will perform better in breaking up clusters of related cards than the Restricted one.

@Expertium
Copy link
Author

I don't like that it can schedule a card on a day with the highest due count of all days. That goes against the spirit of this feature.

@Expertium
Copy link
Author

I ran the simulations on my own, but with a tweak to my formula: I used np.ceil instead of np.floor, it increases N by 1, making restricted load balancer less...well, restrictive.
Here are the results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

retention mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59

Ranking based on volatility (from best to worst): Simple > Restricted (floor) > Restricted (ceil) > Weighted > None
Ranking based on retention (from best to worst): None > Weighted > Restricted (ceil) > Restricted (floor) > Simple

You may have noticed that the rankings for volatility and retention are the exact opposites of each other. You can obtain one just by reversing the order of the other one. So it doesn't seem like we can get both desirable properties - same retention but low volatility - simultaneously. I would suggest using Restricted (ceil) simply because it's right in the middle of both rankings. A compromise.
I know that choosing something simply because it's right in the middle isn't the best kind of reasoning, but again, based on these results, it seems like we can't have both good properties, so a compromise is necessary.

@brishtibheja
Copy link

brishtibheja commented Aug 8, 2024

Are we trying to solve the retention problem or the clustering problem or both at the same time?

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Ideally, we want retention to be unaffected and the load to be very consistent. But we can't have both.

@brishtibheja
Copy link

Can we not just schedule newer cards differently, i.e., a bit earlier, to account for this?

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Well, say hello to Double Weighted Random Load Balancer:

            # schedule on days with lower load AND shorter intervals
            inv = [1 if r == 0 else (1 / r) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

This makes it so that the probabilities depend not just on the number of due cards, but also on interval lengths. This will systematically drag retention up, since shorter intervals result in higher probabilities.

Results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -3.53
p-value: 1.11E-03

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 11.26
p-value: 1.13E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.20
p-value: 1.94E-12

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 1.49
p-value: 1.46E-01

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -5.30
p-value: 5.22E-06

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -26.34
p-value: 5.18E-26

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.53
p-value: 8.05E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59
p-value: 3.96E-15

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -50.52
p-value: 1.89E-36

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 19.61
p-value: 1.79E-21

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -15.26
p-value: 8.65E-18

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -31.95
p-value: 4.59E-29

Ok, not gonna lie, I was NOT expecting this to actually work.
Double weighted LB achieves higher retention than no LB, 88.8% vs 88.6%. And the volatility is lower than with no LB. Also, Double Weighted is strictly better than Weighted.
@jakeprobst @L-M-Sherlock

@user1823
Copy link

user1823 commented Aug 8, 2024

Seems that the Double Weighted LB is one of the best compromises yet. I encourage you to try even more tweaks.

PS:
When posting the results, don't include the list of p-values if they are highly significant. Frankly speaking, they are currently just adding to the clutter.

@jakeprobst
Copy link

rather than just the interval itself (which is going to bias a bit too hard to earlier days?) why not use the distance from the target interval?

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

Ok, here's one final tweak. I raised r to the power of 2, and also to the power of 1.5:

            inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

The output is getting too long, so I'm not including p-values, except for two.

volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
Double Weighted Random Load Balancer (1.5)        0.09225  0.007152
Double Weighted Random Load Balancer (2)          0.09130  0.008430
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 9.01E-18

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
Double Weighted Random Load Balancer (1.5)        0.88795  0.000759
Double Weighted Random Load Balancer (2)          0.88700  0.000795
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 1.37E-03

The retention is still higher with Double Weighted LB with r squared than with no LB, 88.7% vs 88.6%; meanwhile volatility is definitely lower. In fact, volatility is almost as low as with my restricted method.
I think this is enough tweaking. Jake, LMSherlock - you final thoughts, please.

Edit: here are all modes, sorted by retention (higher is better):

Double Weighted Random Load Balancer              0.8888
Double Weighted Random Load Balancer (1.5)        0.8880
Double Weighted Random Load Balancer (2)          0.8870
No Load Balancer                                  0.8861
Weighted Random Load Balancer                     0.8852
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8800
Simple Load Balancer                              0.8757

And by volatility (lower is better):

Simple Load Balancer                              0.0813
Restricted Weighted Random Load Balancer (floor)  0.0902
Double Weighted Random Load Balancer (2)          0.0913
Double Weighted Random Load Balancer (1.5)        0.0923
Restricted Weighted Random Load Balancer (ceil)   0.0944
Double Weighted Random Load Balancer              0.1033
Weighted Random Load Balancer                     0.1074
No Load Balancer                                  0.1340

Double Weighted Random Load Balancer with r raised to the power of 2 is within top 3 best modes in both lists.

@jakeprobst
Copy link

jakeprobst commented Aug 8, 2024

I'd like to see the numbers for

inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / abs(delta_t - possible_delta_t)) for r, possible_delta_t in zip(review_cnts, possible_intervals)]

maybe pow(2) it as well for good measure.

double weight load balance does look pretty solid though.

I was also experimenting with larger powers like 10 to sort accentuate the actual load balance aspect of all this. Unsure how I feel about this currently.

@Expertium
Copy link
Author

Expertium commented Aug 8, 2024

You need to add an extra condition for when abs(delta_t - possible_delta_t) is 0. Or use a different function altogether.
EDIT: I'll try np.exp(-abs(delta_t - possible_delta_t))

@Expertium
Copy link
Author

Expertium commented Aug 10, 2024

Since @L-M-Sherlock appears to be busy, I have implemented the idea above myself.
However, I soon found out that measuring clustering across the entire 90-day period gives pretty much the same results.
I changed it so that it only measures clustering on the first 15 days, but even then the results are pretty much the same for all variants. So either my idea with measuring clustering based on how many intervals are unique is bad, or clustering isn't an issue in the first place.

Here are the results. Lower clustering score = better. Score of 1 means that all intervals are exactly the same. The closer the value is to 0, the more intervals are unique. For example, for [1, 1, 1, 2, 2, 3], the clustering score is 1/3, since there are 3 unique values.

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)                           
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

For clustering, all p-values are >0.05, meaning that all variants perform about the same.

@user1823
Copy link

Maybe we need a different way to estimate the amount of clustering. I used ChatGPT to get some ideas. They seem to be worth looking into.

ChatGPT output:


To get a single value that summarizes how often any two cards are reviewed on the same date, you can use aggregate measures that capture the overall tendency of event co-occurrence. Here are a few approaches to compute such a value:

1. Co-occurrence Ratio

Calculate a single co-occurrence ratio that summarizes the frequency with which any two events occur together relative to their individual frequencies.

Steps:

  1. Create a Co-occurrence Matrix: For each pair of events, count how often they occur on the same date.
  2. Compute Individual Frequencies: Count how often each event occurs.
  3. Calculate the Co-occurrence Ratio: For each pair of events, divide the co-occurrence count by the product of their individual frequencies.

Formula:

$$\text{Co-occurrence Ratio} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\sum_{i} \text{Occurrences}_{i}}$$

where $\text{Co-occurrence}_{ij}$ is the count of times events $i$ and $j$ co-occur, and $\text{Occurrences}_{i}$ is the total count of event $i$ across the time series.

2. Jaccard Similarity Index

The Jaccard similarity index measures the similarity between two sets and can be adapted for events.

Formula:

$$\text{Jaccard Index} = \frac{|A \cap B|}{|A \cup B|}$$

where $A$ and $B$ are the sets of dates on which events occur.

For multiple events, you can generalize this by averaging the pairwise Jaccard indices.

3. Dice Coefficient

The Dice coefficient is another measure of similarity that can be used for co-occurrence.

Formula:

$$\text{Dice Coefficient} = \frac{2 \cdot |A \cap B|}{|A| + |B|}$$

4. Average Pairwise Co-occurrence

Calculate the average number of co-occurrences across all pairs of events.

Steps:

  1. Create a Co-occurrence Matrix: For all pairs of events.
  2. Compute the Average Co-occurrence: Sum the co-occurrences of all pairs and divide by the number of pairs.

Formula:

$$\text{Average Pairwise Co-occurrence} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\text{Number of Pairs}}$$

5. Correlation of Event Presence

Compute the average pairwise correlation between events.

Steps:

  1. Binary Presence Matrix: Create a binary matrix where rows represent dates and columns represent events.
  2. Compute Pairwise Correlations: Calculate the Pearson or Spearman correlation between each pair of event columns.
  3. Average Correlations: Compute the average of these correlations.

Example Using Python

Here’s an example of how to compute the Average Pairwise Co-occurrence:

import pandas as pd

# Sample data
data = {
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
    'Event 1': [1, 1, 0],
    'Event 2': [0, 1, 1],
    'Event 3': [1, 0, 1]
}

df = pd.DataFrame(data)

# Co-occurrence calculation
co_occurrence_matrix = pd.DataFrame(index=['Event 1', 'Event 2', 'Event 3'], columns=['Event 1', 'Event 2', 'Event 3'], data=0)

for event1 in co_occurrence_matrix.index:
    for event2 in co_occurrence_matrix.columns:
        if event1 != event2:
            co_occurrence_matrix.loc[event1, event2] = ((df[event1] == 1) & (df[event2] == 1)).sum()

# Average Pairwise Co-occurrence
num_pairs = len(co_occurrence_matrix.index) * (len(co_occurrence_matrix.index) - 1) / 2
total_co_occurrences = co_occurrence_matrix.where(np.triu(np.ones_like(co_occurrence_matrix, dtype=bool), k=1)).stack().sum()

average_pairwise_co_occurrence = total_co_occurrences / num_pairs

print("Average Pairwise Co-occurrence:", average_pairwise_co_occurrence)

In this script:

  • co_occurrence_matrix is used to count the co-occurrence of each event pair.
  • average_pairwise_co_occurrence is computed by dividing the total co-occurrence by the number of pairs.

Choose the method that best fits your data and goals. Each method provides a different perspective on how often events co-occur.

@brishtibheja
Copy link

brishtibheja commented Aug 10, 2024

How is retention higher for double weighted? You said this might be some error, is it error?

Also, about clustering, it seems you're counting the number of unique intervals due cards at a particular date has. For example, among all the cards scheduled for today how many have unique intervals. How about "among all the cards rated today, how many have unique intervals"?

edit: ngl chatgpt can be a boon to creativity.

@Expertium
Copy link
Author

Expertium commented Aug 10, 2024

How is retention higher for double weighted? You said this might be some error, is it error?

Later, once I have a good metric for clustering, I'll re-run everything with 150 samples instead of 20, to get a definitive answer. For now, I'm thinking about quantifying clustering.

@Expertium
Copy link
Author

Pairwise comparisons sound like a huge pain to implement, not to mention the computational complexity. If there are 10 000 reviews, that's (10 000 * 9 999) / 2 = 49 995 000 pairs. Even if I could implement it, it would likely slow the simulation down. But I'll see what I can do.

@Expertium
Copy link
Author

Expertium commented Aug 10, 2024

Ok, I figured out a way of doing this. I'm not saying I figured out a good way, though.

            matrix = [[-1 for n in range(learn_days)] for id in range(min(learn_days * new_cards_limits, deck_size))]
            # -1 = not introduced yet, 0 = not reviewed, 1 = reviewed

The matrix gets updated after a new day starts + after each review. It contains information about every review of every card on every day. And then, after the simulation is done, there is this monstrosity. It's a for loop inside a for loop inside a for loop:

            n = 0
            co_occurred = 0
            for day in range(learn_days):
                for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                    for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 also counts
                            n += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            co_occurred += 1

This loop takes longer than the simulation itself. This is likely the most inefficient method possible, but oh well.
Anyway, the final score is clustering_score = co_occurred / n. It's a ratio of times when two cards happened to be reviewed on the same day, to all reviews of all cards on all days. For example, a value of 0.01 means that there is a 1% chance of two cards both being reviewed on the same randomly selected day. Lower = better.

Here are the results:

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

The amount of clustering is lower with the simple LB than with no LB (p-value=4.03E-16). I don't have an explanation for that. Although, while the difference is statistically significant, it's not practically significant. All numbers are close to 0.01.

@Expertium
Copy link
Author

Expertium commented Aug 10, 2024

I ran it again, but replaced for day in range(learn_days): with for day in range(15): in the loop, in other words, this time clustering is calculated only based on the first 15 days.

                                                  clustering (mean)      
Simple Load Balancer                              0.0847
Restricted Weighted Random Load Balancer (ceil)   0.0853
Restricted Weighted Random Load Balancer (floor)  0.0854
Weighted Random Load Balancer                     0.0878
Double Weighted Random Load Balancer              0.0880
No Load Balancer                                  0.0905

Same results again. Somehow simple LB results in lower clustering than no LB. So both methods - based on the number of unique interval lengths and based on counting co-occurrences - tell us that clustering is the same, and whether I limit it to 15 days or no has no effect either.

@Expertium
Copy link
Author

Expertium commented Aug 11, 2024

I also tried Jaccard Similarity Coefficient.

            jaccard = []
            for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                    intersection = 0
                    union = 0
                    for day in range(learn_days):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 doesn't count, only 0-1, 1-0 and 1-1
                            if matrix[card1][day] + matrix[card2][day] > 0:
                                union += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            intersection += 1
                    jaccard.append(intersection/union)

            clustering_score = np.mean(jaccard)

It's similar to counting co-occurrences, the big difference is that unlike the last time, when "no review of card 1 - no review of card 2" counted towards n (the total number of all occurrences), here it doesn't count: "union" means "only A, or only B, or both", but not "neither A nor B". The structure of the loop is also a little different.
As you might have guessed, the results are similar. Simple LB produces less clustering than no LB, p-value=2.79E-13.

                                                  clustering (mean)   
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer                                  0.0566
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With all 3 methods - unique intervals, co-occurrences, Jaccard - the averages of different variants of LB are very close, and without a statistical significance test, it's not obvious whether the tiny differences can be attributed to noise or not. With the first method, the differences are not statistically significant. With the latter two, most differences are significant, but the results are counter-intuitive.

@user1823
Copy link

user1823 commented Aug 11, 2024

It seems that No Load Balance doesn't apply fuzz in this simulator.

if delta_t < 2.5 or not enable_load_balance:
    return delta_t

But this doesn't completely explain the results. For example, Restricted Weighted Load Balancer should logically have more clustering than Weighted Random Load Balancer but the results show the opposite.

@Expertium
Copy link
Author

It seems that No Load Balance doesn't apply fuzz in this simulator.

You're right. I'll fix that later.

@Expertium
Copy link
Author

Expertium commented Aug 11, 2024

    if not enable_load_balance:
        # fuzz
        delta_t = random.choices(possible_intervals, weights=list(np.ones_like(possible_intervals)))[0]

This assigns equal probabilities to each interval. I also renamed "No Load Balancer" to "No Load Balancer (fuzz)", for the sake of clarity. For clustering, I'm using Jaccard, lower = better.

                                                  volatility (mean)                                             
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer (fuzz)                           0.1252

                                                  retention (mean)                                               
No Load Balancer (fuzz)                           0.8881
Double Weighted Random Load Balancer              0.8870
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

The results are almost the same as previously.

@Expertium
Copy link
Author

Expertium commented Aug 11, 2024

@jakeprobst @L-M-Sherlock sorry for the trouble, but I advise you to read all comments made within the last 2 days.
TLDR: I tried 3 different methods of measuring clustering, and with all 3 methods the difference between different variants of LB is miniscule.
Meanwhile I will run the simulations with 150 samples, to accurately measure the impact on retention.

@jakeprobst
Copy link

the simple load balancer doing the best with clustering is very surprising to the point where I feel something is wrong somewhere? I'm not a math guy so I can't add any meaningful input more than it just feels intuitively wrong?

@Expertium
Copy link
Author

Expertium commented Aug 11, 2024

The problem is that all 3 methods of estimating clustering output very similar average values for every variant of LB, and the differences are very small.

                                                  uniqueness of intervals, 15 days (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

                                                  co-occurrences, 90 days (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

                                                  Jaccard, 90 days (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With the first method, all values are very close to 0.15. With the second method, all values are very close to 0.011. With the third method, all values are very close to 0.055. So either all methods of measuring clustering are bad, or clustering is not an issue and we are worried for no reason.

@brishtibheja
Copy link

Can you do this on a real collection? Because it was vaibhav who brought it up first.

@Expertium
Copy link
Author

It would take months.

@brishtibheja
Copy link

He specifically talked about new cards, so he excluded cards that went to relearning and was talking about a rather small time frame, probably 7 days.

@Expertium
Copy link
Author

I don't really see a reason to do so when we have simulations. We just don't know how to measure what we want. Or we're worried about a non-existing problem, which is also possible.

@brishtibheja
Copy link

I agree with you but in that comment I was saying clustering might decrease because you were considering a 15 day time interval in the simulation.

@dae you should look at this, it's possible clustering is a non-existing problem.

@Expertium
Copy link
Author

Expertium commented Aug 11, 2024

clustering might decrease because you were considering a 15 day time interval in the simulation.

To clear any confusion: the simulation is always 90 days, and I tried measuring clustering both across all days and across the first 15 days, the results were pretty much the same - all variants of LB perform almost the same.
And I don't think you should ping Dae for this, he's not a math guy, even less so than Jake (and I thought Jake was a math guy, but he says otherwise).

@L-M-Sherlock
Copy link
Member

I think we should dive into the details.

I draw the histograms of last dates from day 1 to day 30:

hist.zip

@user1823
Copy link

I think we should dive into the details. I draw the histograms of last dates from day 1 to day 30:

What's your interpretation?

@Expertium
Copy link
Author

Expertium commented Aug 12, 2024

@L-M-Sherlock did you add my line?

 if not enable_load_balance:
        # fuzz
        delta_t = random.choices(possible_intervals, weights=list(np.ones_like(possible_intervals)))[0]

Because if not, then this isn't a proper comparison.
Also, the fact that the range of values on the X axis is different even for the same day makes it hard to compare. For any day N, the range of values on the X axis should be identical, to make it visually easier to spot the difference.

@Expertium
Copy link
Author

Expertium commented Aug 12, 2024

I ran each variant of LB with 150 samples, to determine how much they affect retetion. Here are the results:

                                                  retention (mean)
No Load Balancer (fuzz)                           0.8879
Double Weighted Random Load Balancer              0.8870
Weighted Random Load Balancer                     0.8862
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8807
Simple Load Balancer                              0.8775

All p-values are significant, no variant performs exactly the same as another one. So yeah, double weighted makes retention ever so slightly lower. But while the difference is statistically significant, in practice nobody will be able to tell 88.8% and 88.7% apart based solely on personal experience.

Next, I also tried a fourth method of estimating clustering: just simply record every single interval length of every card throughout the simulation, and calculate the standard deviation, and take the inverse (1/x) of it, so that lower = better. And I also removed all intervals equal to 0. Here are the results, 20 samples:

                                                  clustering (mean)                                             
Double Weighted Random Load Balancer              0.0468
No Load Balancer (fuzz)                           0.0468
Weighted Random Load Balancer                     0.0468
Restricted Weighted Random Load Balancer (ceil)   0.0469
Restricted Weighted Random Load Balancer (floor)  0.0469
Simple Load Balancer                              0.0470

As you can see, it's the same picture again, for the fourth time: all values are very close, and no variant of LB is very different from any other based on this metric.
I give up. Either clustering is a non-issue or measuring it is too difficult, and I would have to dedicate an insane amount of time and effortful thinking to it, and I'd rather we get done with Load Balancing and move on to Easy Days.
Finally, here's a table with 4 different measures of clustering:
image

If we sort by these 4 different metrics, we get very different rankings, so we can't say "Well, let's just pick whichever variant of LB results in the lowest value of clustering according to all 4 metrics". Calculating these metrics based on all 90 days or only 15 days makes no difference either. In any case, all variants of LB are practically the same, regardless of which metric you are looking at.
@jakeprobst @L-M-Sherlock I suggest we choose Double Weighted LB and move on. Here's how it's calculated:

inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
delta_t = random.choices(possible_intervals, weights=inv)[0]

Squared count seems to work better both in terms of making retention closer to fuzz, and in terms of reducing volatility. Based on retention and volatility, Double Weighted is the best.

@user1823
Copy link

user1823 commented Aug 13, 2024

Either clustering is a non-issue or measuring it is too difficult, and I would have to dedicate an insane amount of time and effortful thinking to it

So, I think it would be better to see how it works in the real world. The next Anki update will likely be a big update with a long beta cycle. So, beta testers will be able to share their experiences before the stable release.

Let's go with the Double Weighted one and add an option to disable the Load Balancer (even if just in the Debug Console) for the people who face problems.

Just a note:
Theoretically, increasing the weights (adding new weights or increasing power of the existing weights) causes the scheduling to become less random and more predictable, increasing the chances of clustering.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Aug 16, 2024

Can I close this issue? I think the conclusion has been drawn.

Btw, I write a notebook to simulate Easy Days here: https://github.com/open-spaced-repetition/easy-days-simulator/blob/main/notebook.ipynb

I recommend opening a new issue to discuss it.

@jakeprobst
Copy link

yeah close it

@L-M-Sherlock L-M-Sherlock transferred this issue from open-spaced-repetition/fsrs4anki Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants