-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/export dataset for research #3511
Feat/export dataset for research #3511
Conversation
pylib/anki/collection.py
Outdated
self, min_entries: int = 0, target_path: str | None = None | ||
) -> None: | ||
if target_path is None: | ||
target_path = os.path.join(os.path.dirname(self.path), "dataset.pb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer target_path was required instead of falling back to dataset.pb. Otherwise looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice. I update it.
rslib/src/scheduler/fsrs/weights.rs
Outdated
.filter_map(|deck| { | ||
let parent_id = self | ||
.storage | ||
.parent_decks(&deck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very efficient, but I don't know if we have an existing routine you can reuse to look up the parent from a hashmap, so I won't ask you to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find an existing routine. What about that:
pub(crate) fn immediate_parent_deck(&self, child: &Deck) -> Result<Option<Deck>> {
if let Some(parent_name) = immediate_parent_name(child.name.as_native_str()) {
if let Some(parent_did) = self.get_deck_id(parent_name)? {
return self.get_deck(parent_did);
}
}
Ok(None)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function named deck_tree
:
Lines 252 to 261 in c1a2b03
pub fn deck_tree(&mut self, timestamp: Option<TimestampSecs>) -> Result<DeckTreeNode> { | |
let names = self.storage.get_all_deck_names()?; | |
let mut tree = deck_names_to_tree(names.into_iter()); | |
let decks_map = self.storage.get_decks_map()?; | |
add_collapsed_and_filtered(&mut tree, &decks_map, timestamp.is_none()); | |
if self.default_deck_is_empty()? { | |
hide_default_deck(&mut tree); | |
} |
Do you mean this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd need immediate_parent_deck() and then looking up the parent in a hashset of all deck names. Alternatively, you could use deck_tree() and then walk the tree once to determine the parent of each deck. But this is code that will run infrequently and the delay won't be huge, so I'm ok with not making this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant immediate_parent_name() - glad to see you figured that out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactor the code. Now it looks up the parent from the hashmap.
Co-authored-by: Damien Elmes <dae@users.noreply.github.com>
rslib/src/scheduler/fsrs/weights.rs
Outdated
@@ -1,3 +1,4 @@ | |||
use std::collections::HashMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing that Jarrett. Just one little tweak - please move this down
Thanks :-) It may take a couple of weeks for me to get this onto AnkiWeb - I still need to catch up with the recent FSRS changes as well. Will let you know. |
Finally got updates working again. Here's a small test export. Please confirm it has everything you want, then I can do a bigger run. https://drive.google.com/file/d/1JmUvu6SELAy9jSXejksjfTBTKmYEQ3mI/view?usp=sharing Is 20k a good number, or do you want more or less? |
20k feels too much. My device will run out of its disk. 10k is enough. Thanks! I will check the test export and give you feedback later. |
Checked! It's really what I want. |
Last time, I think I gathered revlogs with 5k+ reviews. Do you have a preference for min number? |
Is it possible to restrict the limit to exclude users who have thousands of revlog entries generated by rescheduling on change? |
Not currently - export_dataset() only takes min_entries. If you need this, you can submit a PR to change it, and then I will need to deploy a new update to AnkiWeb. Please let me know if you'd like me to generate the entries as-is, or wait for the PR. |
(sorry, just another post in case you didn't see my edit) |
Please wait for that, thanks! I'm afraid that this problem is worse for those users registered in last year because the |
Is your goal to exclude users who have rescheduled on change? Or exclude the rescheduling revlogs for those users? If your goal is the former, won't that bias the results away from early adopters? |
My goal is the latter, because these entries are useless in optimization. |
@dae any chance we will get the new dataset before the release of Anki 24.11? |
@Expertium, I have received the dataset. The benchmark will be updated in next two weeks. |
source: open-spaced-repetition/srs-benchmark#29 (comment)