Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial restructure draft #44
Tutorial restructure draft #44
Changes from all commits
de0f921
e6a5693
ac87643
a2c3f36
9d128a3
acad49e
04470d7
d1a0524
b558a66
b99af07
464ba3c
1ea723e
3dd310d
8e4cc8b
080b588
8b8d792
fbf252e
2411b4e
a8bfce2
b939038
f3d24c6
f0f57a4
0daaaa4
babafeb
4a8334b
e3db636
405ed89
dff7aae
fba5806
0ed9fd7
4c711fa
0eb9de3
a7318c5
7e7b4f3
7c437c4
f04a193
627ec52
a52d1bc
2dc469a
11b1d33
f4c77f6
65fa864
1843731
b785f56
10d3329
1296fc8
f0f90c8
5776b81
fc737de
0c7bee8
d31cc90
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I am baffled and I think I need help figuring this out @julian-risch
Right now, we look at this dataset to count the times tutorial 2 has been run. Now we're decoupling the distillation bit so this line is out. However, we don't at any point for the training step have users download the data/squad20 "train_filename="dev-v2.0.json"? But the tutorial works. Does the
train()
function pull it by name from somewhere? Is it possible to once we merge these tutorial restructures that we track that? @agnieszka-m tagging you here for visibilityThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can't work. Only possible explanation is that you already have the
data_dir = "data/squad20"
directory with some data in it. For example if you ran another tutorial before. We need to also fetch_archive_from_http the data in tutorial 2. So this needs to be added here again. And I could upload a/ datasets/documents/squad_small2.json.zip" or something like that with a different dataset name than before for telemetry.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm wrong.
I just tried running the new tutorial here: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/tutorial_restructure_draft/tutorials/02_finetune_a_reader.ipynb and it does download the dataset.
The explanation is that the processor has two datasets hardcoded here: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L42 One is squad
Here is the logic that decides about downloading that data: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L2250
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, in that case for now, until we have any decisions to change or keep the way we do telemetry for tutorials, I will add a
fetch_archive_from_http
to tutorial 2 as well. As you suggested in the comment above. I will create a list of datasets for tutorials in the meantime. Until and if there is a change, it will be healthier to have.I've already decoupled tutorial 1 and 3 into a separate PR as there's no need for those 2 to wait while we're fixing datasets etc for Tutorial 2 and 21..