Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Train a gradient-boosted decision tree #28

Closed
maxencefrenette opened this issue Jan 3, 2024 · 36 comments
Closed

[Feature Request] Train a gradient-boosted decision tree #28

maxencefrenette opened this issue Jan 3, 2024 · 36 comments

Comments

@maxencefrenette
Copy link

Although transformers are probably what would give the best performance with enough training and tweaking of hyperparameters, I suspect that a gradient boosted decision tree ensemble model might outperform FSRS with very little tweaking using a methodology similar to this: https://machinelearningmastery.com/xgboost-for-time-series-forecasting/. It would, however be a much heavier model with many more parameters than even the LSTM that was attempted.

This is something i'd be interested in exploring if I could have access to the training data.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 4, 2024

Here are 10 users' datasets: tiny_dataset.zip

You can use them for testing your model. PR is welcome. I can help you benchmark the model.

@maxencefrenette
Copy link
Author

I'll see what sort of results I can get with this. Thanks for the data!

@imrryr
Copy link

imrryr commented Jan 11, 2024

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

@Expertium
Copy link
Contributor

Not sure if that helps, but I use cmd (Windows) and the following command: set DEV_MODE=1 && python script.py

@imrryr
Copy link

imrryr commented Jan 11, 2024

That doesn't produce changes. I think the problem is that it may not be finding the data, but I'm not sure how to check for that.

@Expertium
Copy link
Contributor

Do you have the fsrs-optimizer repo downloaded too? script.py relies on fsrs_optimizer.py.

if os.environ.get("DEV_MODE"):
    # for local development
    sys.path.insert(0, os.path.abspath("../fsrs-optimizer/src/fsrs_optimizer/"))

from fsrs_optimizer import (
    Optimizer,
    Trainer,
    FSRS,
    Collection,
    power_forgetting_curve,
)

@imrryr
Copy link

imrryr commented Jan 11, 2024

I did it like this, is it right:
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python -m pip install fsrs-optimizer
Collecting fsrs-optimizer
Using cached FSRS_Optimizer-4.20.8-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: matplotlib>=3.7.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (3.8.2)
Requirement already satisfied: numpy>=1.22.4 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.26.3)
Requirement already satisfied: pandas>=1.5.3 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.4)
Requirement already satisfied: pytz>=2022.7.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2023.3.post1)
Requirement already satisfied: scikit-learn>=1.2.2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (1.3.2)
Requirement already satisfied: torch>=1.13.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from fsrs-optimizer) (2.1.2)
Collecting tqdm>=4.64.1 (from fsrs-optimizer)
Using cached tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
Collecting statsmodels>=0.13.5 (from fsrs-optimizer)
Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl.metadata (9.8 kB)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (4.47.2)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (1.4.5)
Requirement already satisfied: packaging>=20.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (23.2)
Requirement already satisfied: pillow>=8 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from matplotlib>=3.7.0->fsrs-optimizer) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from pandas>=1.5.3->fsrs-optimizer) (2023.4)
Requirement already satisfied: scipy>=1.5.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn>=1.2.2->fsrs-optimizer) (3.2.0)
Collecting patsy>=0.5.4 (from statsmodels>=0.13.5->fsrs-optimizer)
Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Requirement already satisfied: filelock in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.13.1)
Requirement already satisfied: typing-extensions in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (4.9.0)
Requirement already satisfied: sympy in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (1.12)
Requirement already satisfied: networkx in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.2.1)
Requirement already satisfied: jinja2 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (3.1.3)
Requirement already satisfied: fsspec in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from torch>=1.13.1->fsrs-optimizer) (2023.12.2)
Collecting colorama (from tqdm>=4.64.1->fsrs-optimizer)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: six in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from patsy>=0.5.4->statsmodels>=0.13.5->fsrs-optimizer) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from jinja2->torch>=1.13.1->fsrs-optimizer) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in c:\users\ppavl\appdata\local\programs\python\python311\lib\site-packages (from sympy->torch>=1.13.1->fsrs-optimizer) (1.3.0)
Downloading FSRS_Optimizer-4.20.8-py3-none-any.whl (25 kB)
Downloading statsmodels-0.14.1-cp311-cp311-win_amd64.whl (9.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 19.1 MB/s eta 0:00:00
Using cached tqdm-4.66.1-py3-none-any.whl (78 kB)
Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.9/233.9 kB 14.0 MB/s eta 0:00:00
Installing collected packages: patsy, colorama, tqdm, statsmodels, fsrs-optimizer
Successfully installed colorama-0.4.6 fsrs-optimizer-4.20.8 patsy-0.5.6 statsmodels-0.14.1 tqdm-4.66.1

@Expertium
Copy link
Contributor

Try running this line in cmd again (and make sure that fsrs-benchmark and fsrs-optimizer have the same parent folder, for example C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark and C:\Users\ppavl\Dropbox\Active projects\fsrs-optimizer): set DEV_MODE=1 && python script.py
If that doesn't work, then idk, you'll have to wait for LMSherlock to respond.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 12, 2024

So, I'm trying to run your script.py with this dataset, and it creates an evaluation directory, but it is empty. (I put the dataset in the dataset directory). Can you help me with the next steps, please? By the way, this is Pavlik, working with Hannah-Joy Simms

Did you see the result directory?

image

@imrryr
Copy link

imrryr commented Jan 12, 2024

Yes, it was there from the start. It is unchanged after running the script

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 12, 2024

Could you paste the output of script displayed in the terminal?

@imrryr
Copy link

imrryr commented Jan 12, 2024

Yes, but it is blank:

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> $env:DEV_MODE="1"; python script.py
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

and

PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> set DEV_MODE=1
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark> python script.py
PS C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark>

@L-M-Sherlock
Copy link
Member

Weird. Nothing happened after the execution? I'm sorry I can't help you because I don't have a windows device.

@L-M-Sherlock
Copy link
Member

Could you check the file path of your dataset?

@imrryr
Copy link

imrryr commented Jan 12, 2024

You can see it on the left. I wasn't sure of the format, so I offered the tiny dataset as csv, in the folder, and as a zip.
script py - fsrs-benchmark - Visual Studio Code 1_12_2024 9_50_18 AM

@L-M-Sherlock
Copy link
Member

It's weird. Could you add print(os.getcwd()) below if __name__ == "__main__":? I guess it's a path related problem.

@imrryr
Copy link

imrryr commented Jan 13, 2024

It says: C:\Users\ppavl\Dropbox\Active projects\fsrs-benchmark

@L-M-Sherlock
Copy link
Member

Maybe you can print(unprocessed_files) to check whether the dataset has been read.

@imrryr
Copy link

imrryr commented Jan 15, 2024

So, for my configuration it wasn't overwriting the old results directory that was there in github, I renamed this directory to results2, and now it creates the results directory as expected. I'll likely have some questions, so I'll send you an email unless you prefer I post them here as new issues.

@Expertium
Copy link
Contributor

@imrryr how's the progress?

@imrryr
Copy link

imrryr commented Jan 20, 2024

Well, pretty good. I'm trying to get some appropriate data to compare this with some of our methods (e.g. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Ye48zsYAAAAJ&sortby=pubdate&citation_for_view=Ye48zsYAAAAJ:iyewoVqAXLQC ). I contacted Dae and am also looking at the MaiMemo data. I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC. I'd be happy if someone could explain that... @Expertium

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm? @L-M-Sherlock

@L-M-Sherlock
Copy link
Member

I'm a little confused now since I realize I don't know the formal relationship of FSRS 4.5 and SSP-MMC

They are all based on DSR model. But the difficulty of cards is predetermined because we have millions users learning the same set of vocabulary.

Could one simply use the MaiMemo data with the FSRS 4.5 algorithm?

It's hard because the MaiMemo data doesn't contains every user's entire review data.

@imrryr
Copy link

imrryr commented Jan 21, 2024

@L-M-Sherlock OK, got it. So, DSR= difficulty, stability, recall... So when I unpack the SSP-MMC notation in your paper I will see it corresponds closely with FSRS model, except the difficulties are fixed in SSP-MMC method? Also, I got the full data, so I may have more questions as I move forward on this with Hannah

@Expertium
Copy link
Contributor

My bad, imrryr. All this time I thought you were the person who is implementing a decision tree algorithm.
@maxencefrenette any progress?

@imrryr
Copy link

imrryr commented Jan 21, 2024

@L-M-Sherlock I am looking at the revlog format in the data archive. Do you have existing code to convert it to your CSV format? I guess I need to do that.

@L-M-Sherlock
Copy link
Member

Do you have existing code to convert it to your CSV format? I guess I need to do that.

Do you mean this?

https://github.com/open-spaced-repetition/fsrs-optimizer/blob/8ce183629bdd56cf6a4eced66df121caecaef92e/src/fsrs_optimizer/fsrs_optimizer.py#L476-L693

@imrryr
Copy link

imrryr commented Jan 22, 2024

@L-M-Sherlock Maybe I do, but the format this code creates is different than is in the dataset folder. Do you know how to make them into the same format it needs for input: e.g.

card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

Also
review_th - this is the order the cards occurred in?
delta_t - this is the difference in the cards temporal spacings (with 0 indicating less than a day)?

@L-M-Sherlock
Copy link
Member

Can you elaborate on how to get to this final format? I may be able to right the code from what you sent already, but help is appreciated.

The code used to generate that format data is at here: https://github.com/open-spaced-repetition/fsrs-benchmark/blob/main/revlogs2dataset.py

@imrryr
Copy link

imrryr commented Jan 26, 2024

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had. Its weirdly similar, with the number of card_id and length the same... just corrupted review_th and delta t.... For example... correct file:
card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3

the output I get from revlogs2dataset.py:
card_id,review_th,delta_t,rating
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3

@L-M-Sherlock
Copy link
Member

So, this code seemed to work at first, but doesn't produce the same results as the tiny dataset had.

Please open a new issue to report the details. I hope you can share the revlogs file and your script code.

@L-M-Sherlock L-M-Sherlock closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
@Expertium
Copy link
Contributor

Well that's a bummer. Why did you close it?

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Feb 5, 2024

Because I don't plan to implement the model and I have shared the dataset with the creator of this issue.

@Expertium
Copy link
Contributor

Yeah, but did the creator of the issue himself say that he's not planning to work on it?

@maxencefrenette
Copy link
Author

Hi all, I'm still working on this, but progress is slow since I don't have a ton of time to spend on this. I got what I wanted out of this issue, which is a public subset of the data, thanks a lot for that. I'm okay with closing this, I don't need the issue to be open to work on it.

@Expertium
Copy link
Contributor

@maxencefrenette I think it's best to keep the number of trainable parameters around 500-600, since that's roughly how many parameters our LSTM and Transformer have. Ideally, we want to see how much architecture affects the results. If the number of parameters across different algorithms is similar, then we can clearly see which architecture is superior.

@Expertium
Copy link
Contributor

@maxencefrenette Hello again! Me and LMSherlock have re-defined RMSE and are finishing benchmarking algorithms again. If you still want to participate (and I hope you do), now is a good time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants