[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer #1

Expertium · 2024-08-06T18:33:27Z

@L-M-Sherlock @jakeprobst @user1823 @brishtibheja
Right now, there are two "competing" ideas regarding Load Balancing. Everyone agrees that making the probability of scheduling a card on a specific day inversely proportional to how many due cards that day has is a good idea. However, me and Jake disagree on the details, and it's not clear who's idea is better.
Jake proposes N = days in range, where N is the number of cards with the lowest due count. In other words, his idea is to allow Load Balancer to schedule a card on any day within the fuzz range, just with different non-zero probabilities.
My idea: N = max(floor(days in range / 2), 2). This ensures that cards can only be scheduled on days with a relatively low due count and cannot be scheduled on a day with the highest (or one of the highest) due count.
It may not be immediately clear what's the difference, so here's a visualiztion:

Suppose the fuzz range for a card includes the following days: [23, 24, 25, 26, 27], days in range = 5. Jake's range and fuzz range are the same, the difference is that in Jake's version the probability of a card falling on a particular day isn't the same for every day. My range is narrower, it only includes (in this example) two days with the lowest due count. The probabilities for the other 3 days are 0. This avoids clustering while staying true to the spirit of this feature, which is "do not schedule a card on a day that already has a lot of cards".

In order to test which idea is better, we could ask Jake to run both on his collection, but that would take months. Instead, Jarrett, I would like you to collaborate with Jake to incorporate Load Balancer into your Simulator, so we can test these ideas on a simulation.

jakeprobst · 2024-08-06T18:43:51Z

Honestly I was suggesting what I suggested cause it was an easier implementation 🍃

L-M-Sherlock · 2024-08-08T02:44:27Z

I implement the simplest load balancer in the simulator. Here are my initial results.

Simulator: https://github.com/open-spaced-repetition/load-balance-simulator/blob/main/notebook.ipynb

Load Balance: enable

Load Balance: disable

The real workload is smoother. But the true retention also drops.

jakeprobst · 2024-08-08T06:31:32Z

-    delta_t = possible_intervals[np.argmin(review_cnts)]
+    inv = [1 if r == 0 else 1/r  for r in review_cnts]
+    delta_t = random.choices(possible_intervals, weights=inv)[0]

doing a weighted random seems to improve the retention but the review count is a bit more erratic (but not as much as with no balancing).

I'll try implementing this in anki tomorrow and trying it out for a bit. I ended up wasting the day failing to convince anki that a plugin to do simulations was a reasonable thing it should be able to do.

L-M-Sherlock · 2024-08-08T07:32:12Z

@jakeprobst, thanks for your code! I add it into the notebook.

To make our analyses more accurate, I quantify the volatility and average retention.

Here is the result:

	No Load Balancer	Simple Load Balancer	Weight Random Load Balancer
Retention	0.886	0.876	0.885
Volatility	0.2	0.152	0.166

How I calculate the volatility:

volatility = np.std(np.diff(review_card_per_day) / review_card_per_day[1:], ddof=1)

Expertium · 2024-08-08T09:59:56Z

@L-M-Sherlock can you test my idea as well?
N = max(floor(days in range / 2), 2)

With your code, days in range is possible_intervals. Here's an example of what I expect the code to look like:

possible_intervals = np.asarray([18, 19, 20, 21, 22])
review_counts = np.asarray([29, 25, 26, 23, 24])

N = max(int(np.floor(len(possible_intervals) / 2)), 2)  # N days with the lowest due count
sorter = np.argsort(review_counts)  # sort by review counts
sorted_counts = review_counts[sorter]
sorted_intervals = possible_intervals[sorter]
possible_intervals = sorted_intervals[:N]
inv = [1 if r == 0 else 1/r for r in sorted_counts[:N]]
delta_t = random.choices(possible_intervals, weights=inv)[0]

We sort possible_intervals and review_counts by the number of reviews, and then change possible_intervals so that it's a slice of sorted_intervals, with only N days being selected. And then we calculate inv based only on the selected days. In my example, only intervals 21 and 22 are possible, since they have the lowest number of due cards, so we randomly select between those days.

Expertium · 2024-08-08T10:59:21Z

Also, I don't think your method of calculating volatility is good. According to your method, [100, 125, 150] has lower volatility than [100, 115, 110]. I suggest replacing np.std with np.mean, and calculating the absolute difference.
volatility = np.mean(np.abs(np.diff(review_card_per_day1)) / review_card_per_day1[1:])

And one more thing: please make a table with 4 scenarios: no Load Balancer, Simple Load Balancer, Weighted Random Load Balancer, and Restricted Weighted Random Load Balancer, the code for the last one is in my comment above.

L-M-Sherlock · 2024-08-08T12:23:01Z

	No Load Balancer	Simple Load Balancer	Weighted Random Load Balancer	Restricted Weighted Random Load Balancer
Retention	0.884	0.874	0.884	0.878
Volatility	0.153	0.094	0.118	0.091

user1823 · 2024-08-08T12:38:18Z

0.091

It seems that there may be an error (or bias) in this value. I can't think of any reason why the Restricted Weighted Random Load Balancer would have a lower volatility than the Simple Load Balancer.

Expertium · 2024-08-08T12:39:27Z

Might be noise. @L-M-Sherlock try increasing deck_size to 20000, and set review_limits to 9999.

L-M-Sherlock · 2024-08-08T12:40:36Z

I think we need run more simulations with different random seeds and use t-test.

Expertium · 2024-08-08T12:41:39Z

Then run 20 of each (so that's 20*4=80 runs), record retention and volatility values, and post them here.
Alternatively, give me the code (.py, not .ipynb, please), and I'll do it myself

L-M-Sherlock · 2024-08-08T13:05:57Z

volatility mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.13395  0.009254
Restricted Weighted Random Load Balancer  0.09020  0.007736
Simple Load Balancer                      0.08130  0.005526
Weighted Random Load Balancer             0.10735  0.009281

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 6.35
p-value: 1.90E-07


retention mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.88610  0.000852
Restricted Weighted Random Load Balancer  0.87950  0.001051
Simple Load Balancer                      0.87565  0.000875
Weighted Random Load Balancer             0.88515  0.000745

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -12.59
p-value: 3.96E-15

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 19.61
p-value: 1.79E-21

Expertium · 2024-08-08T13:07:50Z

You should probably switch to scientific notation if p-values are so small. And you should add average retention values as well, not just average volatility.
EDIT: ok, I think that's all the info we need. So here's the ranking based on average volatility: Simple > Restricted > Weighted > None. And based on average retention, None > Weighted > Restricted > Simple.
So we have to choose either Restricted or Weighted.

user1823 · 2024-08-08T13:21:22Z

I like Weighted Random Load Balancer more because it produces almost the same retention as no load balancing (88.6 vs 88.5%) while also significantly decreasing the volatility.

Moreover, it will perform better in breaking up clusters of related cards than the Restricted one.

Expertium · 2024-08-08T13:22:32Z

I don't like that it can schedule a card on a day with the highest due count of all days. That goes against the spirit of this feature.

Expertium · 2024-08-08T16:00:28Z

I ran the simulations on my own, but with a tweak to my formula: I used np.ceil instead of np.floor, it increases N by 1, making restricted load balancer less...well, restrictive.
Here are the results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

retention mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59

Ranking based on volatility (from best to worst): Simple > Restricted (floor) > Restricted (ceil) > Weighted > None
Ranking based on retention (from best to worst): None > Weighted > Restricted (ceil) > Restricted (floor) > Simple

You may have noticed that the rankings for volatility and retention are the exact opposites of each other. You can obtain one just by reversing the order of the other one. So it doesn't seem like we can get both desirable properties - same retention but low volatility - simultaneously. I would suggest using Restricted (ceil) simply because it's right in the middle of both rankings. A compromise.
I know that choosing something simply because it's right in the middle isn't the best kind of reasoning, but again, based on these results, it seems like we can't have both good properties, so a compromise is necessary.

brishtibheja · 2024-08-08T16:03:08Z

Are we trying to solve the retention problem or the clustering problem or both at the same time?

Expertium · 2024-08-08T16:03:38Z

Ideally, we want retention to be unaffected and the load to be very consistent. But we can't have both.

brishtibheja · 2024-08-08T16:05:56Z

Can we not just schedule newer cards differently, i.e., a bit earlier, to account for this?

Expertium · 2024-08-08T16:28:47Z

Well, say hello to Double Weighted Random Load Balancer:

            # schedule on days with lower load AND shorter intervals
            inv = [1 if r == 0 else (1 / r) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

This makes it so that the probabilities depend not just on the number of due cards, but also on interval lengths. This will systematically drag retention up, since shorter intervals result in higher probabilities.

Results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -3.53
p-value: 1.11E-03

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 11.26
p-value: 1.13E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.20
p-value: 1.94E-12

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 1.49
p-value: 1.46E-01

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -5.30
p-value: 5.22E-06

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -26.34
p-value: 5.18E-26

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.53
p-value: 8.05E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59
p-value: 3.96E-15

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -50.52
p-value: 1.89E-36

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 19.61
p-value: 1.79E-21

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -15.26
p-value: 8.65E-18

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -31.95
p-value: 4.59E-29

Ok, not gonna lie, I was NOT expecting this to actually work.
Double weighted LB achieves higher retention than no LB, 88.8% vs 88.6%. And the volatility is lower than with no LB. Also, Double Weighted is strictly better than Weighted.
@jakeprobst @L-M-Sherlock

user1823 · 2024-08-08T16:40:19Z

Seems that the Double Weighted LB is one of the best compromises yet. I encourage you to try even more tweaks.

PS:
When posting the results, don't include the list of p-values if they are highly significant. Frankly speaking, they are currently just adding to the clutter.

jakeprobst · 2024-08-08T16:45:47Z

rather than just the interval itself (which is going to bias a bit too hard to earlier days?) why not use the distance from the target interval?

Expertium · 2024-08-08T17:01:31Z

Ok, here's one final tweak. I raised r to the power of 2, and also to the power of 1.5:

            inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

The output is getting too long, so I'm not including p-values, except for two.

volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
Double Weighted Random Load Balancer (1.5)        0.09225  0.007152
Double Weighted Random Load Balancer (2)          0.09130  0.008430
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 9.01E-18

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
Double Weighted Random Load Balancer (1.5)        0.88795  0.000759
Double Weighted Random Load Balancer (2)          0.88700  0.000795
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 1.37E-03

The retention is still higher with Double Weighted LB with r squared than with no LB, 88.7% vs 88.6%; meanwhile volatility is definitely lower. In fact, volatility is almost as low as with my restricted method.
I think this is enough tweaking. Jake, LMSherlock - you final thoughts, please.

Edit: here are all modes, sorted by retention (higher is better):

Double Weighted Random Load Balancer              0.8888
Double Weighted Random Load Balancer (1.5)        0.8880
Double Weighted Random Load Balancer (2)          0.8870
No Load Balancer                                  0.8861
Weighted Random Load Balancer                     0.8852
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8800
Simple Load Balancer                              0.8757

And by volatility (lower is better):

Simple Load Balancer                              0.0813
Restricted Weighted Random Load Balancer (floor)  0.0902
Double Weighted Random Load Balancer (2)          0.0913
Double Weighted Random Load Balancer (1.5)        0.0923
Restricted Weighted Random Load Balancer (ceil)   0.0944
Double Weighted Random Load Balancer              0.1033
Weighted Random Load Balancer                     0.1074
No Load Balancer                                  0.1340

Double Weighted Random Load Balancer with r raised to the power of 2 is within top 3 best modes in both lists.

jakeprobst · 2024-08-08T17:18:03Z

I'd like to see the numbers for

inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / abs(delta_t - possible_delta_t)) for r, possible_delta_t in zip(review_cnts, possible_intervals)]

maybe pow(2) it as well for good measure.

double weight load balance does look pretty solid though.

I was also experimenting with larger powers like 10 to sort accentuate the actual load balance aspect of all this. Unsure how I feel about this currently.

Expertium · 2024-08-08T17:19:44Z

You need to add an extra condition for when abs(delta_t - possible_delta_t) is 0. Or use a different function altogether.
EDIT: I'll try np.exp(-abs(delta_t - possible_delta_t))

Expertium · 2024-08-10T15:17:36Z

Since @L-M-Sherlock appears to be busy, I have implemented the idea above myself.
However, I soon found out that measuring clustering across the entire 90-day period gives pretty much the same results.
I changed it so that it only measures clustering on the first 15 days, but even then the results are pretty much the same for all variants. So either my idea with measuring clustering based on how many intervals are unique is bad, or clustering isn't an issue in the first place.

Here are the results. Lower clustering score = better. Score of 1 means that all intervals are exactly the same. The closer the value is to 0, the more intervals are unique. For example, for [1, 1, 1, 2, 2, 3], the clustering score is 1/3, since there are 3 unique values.

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)                           
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

For clustering, all p-values are >0.05, meaning that all variants perform about the same.

user1823 · 2024-08-10T15:48:23Z

Maybe we need a different way to estimate the amount of clustering. I used ChatGPT to get some ideas. They seem to be worth looking into.

ChatGPT output:

To get a single value that summarizes how often any two cards are reviewed on the same date, you can use aggregate measures that capture the overall tendency of event co-occurrence. Here are a few approaches to compute such a value:

1. Co-occurrence Ratio

Calculate a single co-occurrence ratio that summarizes the frequency with which any two events occur together relative to their individual frequencies.

Steps:

Create a Co-occurrence Matrix: For each pair of events, count how often they occur on the same date.
Compute Individual Frequencies: Count how often each event occurs.
Calculate the Co-occurrence Ratio: For each pair of events, divide the co-occurrence count by the product of their individual frequencies.

Formula:

$$\text{Co-occurrence Ratio} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\sum_{i} \text{Occurrences}_{i}}$$

where $\text{Co-occurrence}_{ij}$ is the count of times events $i$ and $j$ co-occur, and $\text{Occurrences}_{i}$ is the total count of event $i$ across the time series.

2. Jaccard Similarity Index

The Jaccard similarity index measures the similarity between two sets and can be adapted for events.

Formula:

$$\text{Jaccard Index} = \frac{|A \cap B|}{|A \cup B|}$$

where $A$ and $B$ are the sets of dates on which events occur.

For multiple events, you can generalize this by averaging the pairwise Jaccard indices.

3. Dice Coefficient

The Dice coefficient is another measure of similarity that can be used for co-occurrence.

Formula:

$$\text{Dice Coefficient} = \frac{2 \cdot |A \cap B|}{|A| + |B|}$$

4. Average Pairwise Co-occurrence

Calculate the average number of co-occurrences across all pairs of events.

Steps:

Create a Co-occurrence Matrix: For all pairs of events.
Compute the Average Co-occurrence: Sum the co-occurrences of all pairs and divide by the number of pairs.

Formula:

$$\text{Average Pairwise Co-occurrence} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\text{Number of Pairs}}$$

5. Correlation of Event Presence

Compute the average pairwise correlation between events.

Steps:

Binary Presence Matrix: Create a binary matrix where rows represent dates and columns represent events.
Compute Pairwise Correlations: Calculate the Pearson or Spearman correlation between each pair of event columns.
Average Correlations: Compute the average of these correlations.

Example Using Python

Here’s an example of how to compute the Average Pairwise Co-occurrence:

import pandas as pd

# Sample data
data = {
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
    'Event 1': [1, 1, 0],
    'Event 2': [0, 1, 1],
    'Event 3': [1, 0, 1]
}

df = pd.DataFrame(data)

# Co-occurrence calculation
co_occurrence_matrix = pd.DataFrame(index=['Event 1', 'Event 2', 'Event 3'], columns=['Event 1', 'Event 2', 'Event 3'], data=0)

for event1 in co_occurrence_matrix.index:
    for event2 in co_occurrence_matrix.columns:
        if event1 != event2:
            co_occurrence_matrix.loc[event1, event2] = ((df[event1] == 1) & (df[event2] == 1)).sum()

# Average Pairwise Co-occurrence
num_pairs = len(co_occurrence_matrix.index) * (len(co_occurrence_matrix.index) - 1) / 2
total_co_occurrences = co_occurrence_matrix.where(np.triu(np.ones_like(co_occurrence_matrix, dtype=bool), k=1)).stack().sum()

average_pairwise_co_occurrence = total_co_occurrences / num_pairs

print("Average Pairwise Co-occurrence:", average_pairwise_co_occurrence)

In this script:

co_occurrence_matrix is used to count the co-occurrence of each event pair.
average_pairwise_co_occurrence is computed by dividing the total co-occurrence by the number of pairs.

Choose the method that best fits your data and goals. Each method provides a different perspective on how often events co-occur.

brishtibheja · 2024-08-10T15:48:34Z

How is retention higher for double weighted? You said this might be some error, is it error?

Also, about clustering, it seems you're counting the number of unique intervals due cards at a particular date has. For example, among all the cards scheduled for today how many have unique intervals. How about "among all the cards rated today, how many have unique intervals"?

edit: ngl chatgpt can be a boon to creativity.

Expertium · 2024-08-10T15:51:05Z

How is retention higher for double weighted? You said this might be some error, is it error?

Later, once I have a good metric for clustering, I'll re-run everything with 150 samples instead of 20, to get a definitive answer. For now, I'm thinking about quantifying clustering.

Expertium · 2024-08-10T16:14:55Z

Pairwise comparisons sound like a huge pain to implement, not to mention the computational complexity. If there are 10 000 reviews, that's (10 000 * 9 999) / 2 = 49 995 000 pairs. Even if I could implement it, it would likely slow the simulation down. But I'll see what I can do.

Expertium · 2024-08-10T19:48:28Z

Ok, I figured out a way of doing this. I'm not saying I figured out a good way, though.

            matrix = [[-1 for n in range(learn_days)] for id in range(min(learn_days * new_cards_limits, deck_size))]
            # -1 = not introduced yet, 0 = not reviewed, 1 = reviewed

The matrix gets updated after a new day starts + after each review. It contains information about every review of every card on every day. And then, after the simulation is done, there is this monstrosity. It's a for loop inside a for loop inside a for loop:

            n = 0
            co_occurred = 0
            for day in range(learn_days):
                for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                    for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 also counts
                            n += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            co_occurred += 1

This loop takes longer than the simulation itself. This is likely the most inefficient method possible, but oh well.
Anyway, the final score is clustering_score = co_occurred / n. It's a ratio of times when two cards happened to be reviewed on the same day, to all reviews of all cards on all days. For example, a value of 0.01 means that there is a 1% chance of two cards both being reviewed on the same randomly selected day. Lower = better.

Here are the results:

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

The amount of clustering is lower with the simple LB than with no LB (p-value=4.03E-16). I don't have an explanation for that. Although, while the difference is statistically significant, it's not practically significant. All numbers are close to 0.01.

Expertium · 2024-08-10T21:35:11Z

I ran it again, but replaced for day in range(learn_days): with for day in range(15): in the loop, in other words, this time clustering is calculated only based on the first 15 days.

                                                  clustering (mean)      
Simple Load Balancer                              0.0847
Restricted Weighted Random Load Balancer (ceil)   0.0853
Restricted Weighted Random Load Balancer (floor)  0.0854
Weighted Random Load Balancer                     0.0878
Double Weighted Random Load Balancer              0.0880
No Load Balancer                                  0.0905

Same results again. Somehow simple LB results in lower clustering than no LB. So both methods - based on the number of unique interval lengths and based on counting co-occurrences - tell us that clustering is the same, and whether I limit it to 15 days or no has no effect either.

Expertium · 2024-08-11T01:04:46Z

I also tried Jaccard Similarity Coefficient.

            jaccard = []
            for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                    intersection = 0
                    union = 0
                    for day in range(learn_days):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 doesn't count, only 0-1, 1-0 and 1-1
                            if matrix[card1][day] + matrix[card2][day] > 0:
                                union += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            intersection += 1
                    jaccard.append(intersection/union)

            clustering_score = np.mean(jaccard)

It's similar to counting co-occurrences, the big difference is that unlike the last time, when "no review of card 1 - no review of card 2" counted towards n (the total number of all occurrences), here it doesn't count: "union" means "only A, or only B, or both", but not "neither A nor B". The structure of the loop is also a little different.
As you might have guessed, the results are similar. Simple LB produces less clustering than no LB, p-value=2.79E-13.

                                                  clustering (mean)   
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer                                  0.0566
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With all 3 methods - unique intervals, co-occurrences, Jaccard - the averages of different variants of LB are very close, and without a statistical significance test, it's not obvious whether the tiny differences can be attributed to noise or not. With the first method, the differences are not statistically significant. With the latter two, most differences are significant, but the results are counter-intuitive.

user1823 · 2024-08-11T02:05:41Z

It seems that No Load Balance doesn't apply fuzz in this simulator.

if delta_t < 2.5 or not enable_load_balance:
    return delta_t

But this doesn't completely explain the results. For example, Restricted Weighted Load Balancer should logically have more clustering than Weighted Random Load Balancer but the results show the opposite.

Expertium · 2024-08-11T09:06:33Z

It seems that No Load Balance doesn't apply fuzz in this simulator.

You're right. I'll fix that later.

Expertium · 2024-08-11T13:29:46Z

    if not enable_load_balance:
        # fuzz
        delta_t = random.choices(possible_intervals, weights=list(np.ones_like(possible_intervals)))[0]

This assigns equal probabilities to each interval. I also renamed "No Load Balancer" to "No Load Balancer (fuzz)", for the sake of clarity. For clustering, I'm using Jaccard, lower = better.

                                                  volatility (mean)                                             
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer (fuzz)                           0.1252

                                                  retention (mean)                                               
No Load Balancer (fuzz)                           0.8881
Double Weighted Random Load Balancer              0.8870
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

The results are almost the same as previously.

Expertium · 2024-08-11T13:34:34Z

@jakeprobst @L-M-Sherlock sorry for the trouble, but I advise you to read all comments made within the last 2 days.
TLDR: I tried 3 different methods of measuring clustering, and with all 3 methods the difference between different variants of LB is miniscule.
Meanwhile I will run the simulations with 150 samples, to accurately measure the impact on retention.

jakeprobst · 2024-08-11T15:54:07Z

the simple load balancer doing the best with clustering is very surprising to the point where I feel something is wrong somewhere? I'm not a math guy so I can't add any meaningful input more than it just feels intuitively wrong?

Expertium · 2024-08-11T16:14:22Z

The problem is that all 3 methods of estimating clustering output very similar average values for every variant of LB, and the differences are very small.

                                                  uniqueness of intervals, 15 days (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

                                                  co-occurrences, 90 days (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

                                                  Jaccard, 90 days (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With the first method, all values are very close to 0.15. With the second method, all values are very close to 0.011. With the third method, all values are very close to 0.055. So either all methods of measuring clustering are bad, or clustering is not an issue and we are worried for no reason.

brishtibheja · 2024-08-11T16:27:21Z

Can you do this on a real collection? Because it was vaibhav who brought it up first.

Expertium · 2024-08-11T16:27:52Z

It would take months.

brishtibheja · 2024-08-11T16:30:05Z

He specifically talked about new cards, so he excluded cards that went to relearning and was talking about a rather small time frame, probably 7 days.

Expertium · 2024-08-11T16:34:15Z

I don't really see a reason to do so when we have simulations. We just don't know how to measure what we want. Or we're worried about a non-existing problem, which is also possible.

brishtibheja · 2024-08-11T16:45:21Z

I agree with you but in that comment I was saying clustering might decrease because you were considering a 15 day time interval in the simulation.

@dae you should look at this, it's possible clustering is a non-existing problem.

Expertium · 2024-08-11T16:47:23Z

clustering might decrease because you were considering a 15 day time interval in the simulation.

To clear any confusion: the simulation is always 90 days, and I tried measuring clustering both across all days and across the first 15 days, the results were pretty much the same - all variants of LB perform almost the same.
And I don't think you should ping Dae for this, he's not a math guy, even less so than Jake (and I thought Jake was a math guy, but he says otherwise).

L-M-Sherlock · 2024-08-12T04:16:46Z

I think we should dive into the details.

I draw the histograms of last dates from day 1 to day 30:

hist.zip

user1823 · 2024-08-12T13:10:44Z

I think we should dive into the details. I draw the histograms of last dates from day 1 to day 30:

What's your interpretation?

Expertium · 2024-08-12T13:17:40Z

@L-M-Sherlock did you add my line?

 if not enable_load_balance:
        # fuzz
        delta_t = random.choices(possible_intervals, weights=list(np.ones_like(possible_intervals)))[0]

Because if not, then this isn't a proper comparison.
Also, the fact that the range of values on the X axis is different even for the same day makes it hard to compare. For any day N, the range of values on the X axis should be identical, to make it visually easier to spot the difference.

Expertium · 2024-08-12T19:46:52Z

I ran each variant of LB with 150 samples, to determine how much they affect retetion. Here are the results:

                                                  retention (mean)
No Load Balancer (fuzz)                           0.8879
Double Weighted Random Load Balancer              0.8870
Weighted Random Load Balancer                     0.8862
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8807
Simple Load Balancer                              0.8775

All p-values are significant, no variant performs exactly the same as another one. So yeah, double weighted makes retention ever so slightly lower. But while the difference is statistically significant, in practice nobody will be able to tell 88.8% and 88.7% apart based solely on personal experience.

Next, I also tried a fourth method of estimating clustering: just simply record every single interval length of every card throughout the simulation, and calculate the standard deviation, and take the inverse (1/x) of it, so that lower = better. And I also removed all intervals equal to 0. Here are the results, 20 samples:

                                                  clustering (mean)                                             
Double Weighted Random Load Balancer              0.0468
No Load Balancer (fuzz)                           0.0468
Weighted Random Load Balancer                     0.0468
Restricted Weighted Random Load Balancer (ceil)   0.0469
Restricted Weighted Random Load Balancer (floor)  0.0469
Simple Load Balancer                              0.0470

As you can see, it's the same picture again, for the fourth time: all values are very close, and no variant of LB is very different from any other based on this metric.
I give up. Either clustering is a non-issue or measuring it is too difficult, and I would have to dedicate an insane amount of time and effortful thinking to it, and I'd rather we get done with Load Balancing and move on to Easy Days.
Finally, here's a table with 4 different measures of clustering:

If we sort by these 4 different metrics, we get very different rankings, so we can't say "Well, let's just pick whichever variant of LB results in the lowest value of clustering according to all 4 metrics". Calculating these metrics based on all 90 days or only 15 days makes no difference either. In any case, all variants of LB are practically the same, regardless of which metric you are looking at.
@jakeprobst @L-M-Sherlock I suggest we choose Double Weighted LB and move on. Here's how it's calculated:

inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
delta_t = random.choices(possible_intervals, weights=inv)[0]

Squared count seems to work better both in terms of making retention closer to fuzz, and in terms of reducing volatility. Based on retention and volatility, Double Weighted is the best.

user1823 · 2024-08-13T05:18:34Z

Either clustering is a non-issue or measuring it is too difficult, and I would have to dedicate an insane amount of time and effortful thinking to it

So, I think it would be better to see how it works in the real world. The next Anki update will likely be a big update with a long beta cycle. So, beta testers will be able to share their experiences before the stable release.

Let's go with the Double Weighted one and add an option to disable the Load Balancer (even if just in the Debug Console) for the people who face problems.

Just a note:
Theoretically, increasing the weights (adding new weights or increasing power of the existing weights) causes the scheduling to become less random and more predictable, increasing the chances of clustering.

L-M-Sherlock · 2024-08-16T02:18:53Z

Can I close this issue? I think the conclusion has been drawn.

Btw, I write a notebook to simulate Easy Days here: https://github.com/open-spaced-repetition/easy-days-simulator/blob/main/notebook.ipynb

I recommend opening a new issue to discuss it.

jakeprobst · 2024-08-16T03:13:42Z

yeah close it

Expertium assigned Expertium, jakeprobst and L-M-Sherlock and unassigned Expertium Aug 6, 2024

Expertium mentioned this issue Aug 7, 2024

load balancer! ankitects/anki#3230

Merged

L-M-Sherlock closed this as completed Aug 16, 2024

L-M-Sherlock transferred this issue from open-spaced-repetition/fsrs4anki Sep 19, 2024

L-M-Sherlock mentioned this issue Sep 19, 2024

Refactor/load balance & easy days open-spaced-repetition/fsrs4anki-helper#461

Merged

[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer #1

[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer #1

Comments

Expertium commented Aug 6, 2024 • edited Loading

jakeprobst commented Aug 6, 2024

L-M-Sherlock commented Aug 8, 2024

jakeprobst commented Aug 8, 2024 • edited Loading

L-M-Sherlock commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024 • edited Loading

L-M-Sherlock commented Aug 8, 2024

user1823 commented Aug 8, 2024

Expertium commented Aug 8, 2024 • edited Loading

L-M-Sherlock commented Aug 8, 2024

Expertium commented Aug 8, 2024 • edited Loading

L-M-Sherlock commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024 • edited Loading

user1823 commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024

Expertium commented Aug 8, 2024

brishtibheja commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024 • edited Loading

brishtibheja commented Aug 8, 2024

Expertium commented Aug 8, 2024 • edited Loading

user1823 commented Aug 8, 2024

jakeprobst commented Aug 8, 2024

Expertium commented Aug 8, 2024 • edited Loading

jakeprobst commented Aug 8, 2024 • edited Loading

Expertium commented Aug 8, 2024 • edited Loading

Expertium commented Aug 10, 2024 • edited Loading

user1823 commented Aug 10, 2024

1. Co-occurrence Ratio

Steps:

Formula:

2. Jaccard Similarity Index

Formula:

3. Dice Coefficient

Formula:

4. Average Pairwise Co-occurrence

Steps:

Formula:

5. Correlation of Event Presence

Steps:

Example Using Python

brishtibheja commented Aug 10, 2024 • edited Loading

Expertium commented Aug 10, 2024 • edited Loading

Expertium commented Aug 10, 2024

Expertium commented Aug 10, 2024 • edited Loading

Expertium commented Aug 10, 2024 • edited Loading

Expertium commented Aug 11, 2024 • edited Loading

user1823 commented Aug 11, 2024 • edited Loading

Expertium commented Aug 11, 2024

Expertium commented Aug 11, 2024 • edited Loading

Expertium commented Aug 11, 2024 • edited Loading

jakeprobst commented Aug 11, 2024

Expertium commented Aug 11, 2024 • edited Loading

brishtibheja commented Aug 11, 2024

Expertium commented Aug 11, 2024

brishtibheja commented Aug 11, 2024

Expertium commented Aug 11, 2024

brishtibheja commented Aug 11, 2024

Expertium commented Aug 11, 2024 • edited Loading

L-M-Sherlock commented Aug 12, 2024

user1823 commented Aug 12, 2024

Expertium commented Aug 12, 2024 • edited Loading

Expertium commented Aug 12, 2024 • edited Loading

user1823 commented Aug 13, 2024 • edited Loading

L-M-Sherlock commented Aug 16, 2024 • edited Loading

jakeprobst commented Aug 16, 2024

Expertium commented Aug 6, 2024 •

edited

Loading

jakeprobst commented Aug 8, 2024 •

edited

Loading

L-M-Sherlock commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

L-M-Sherlock commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

user1823 commented Aug 8, 2024 •

edited

Loading

brishtibheja commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

jakeprobst commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 8, 2024 •

edited

Loading

Expertium commented Aug 10, 2024 •

edited

Loading

brishtibheja commented Aug 10, 2024 •

edited

Loading

Expertium commented Aug 10, 2024 •

edited

Loading

Expertium commented Aug 10, 2024 •

edited

Loading

Expertium commented Aug 10, 2024 •

edited

Loading

Expertium commented Aug 11, 2024 •

edited

Loading

user1823 commented Aug 11, 2024 •

edited

Loading

Expertium commented Aug 11, 2024 •

edited

Loading

Expertium commented Aug 11, 2024 •

edited

Loading

Expertium commented Aug 11, 2024 •

edited

Loading

Expertium commented Aug 11, 2024 •

edited

Loading

Expertium commented Aug 12, 2024 •

edited

Loading

Expertium commented Aug 12, 2024 •

edited

Loading

user1823 commented Aug 13, 2024 •

edited

Loading

L-M-Sherlock commented Aug 16, 2024 •

edited

Loading