Tutorial restructure draft #44

brandenchan · 2022-10-14T10:59:51Z

Creating a draft of Tutorial 1 in a new tutorial style. To summarize the main design changes:

Adhering to the conventional idea of a tutorial being a hands on and step by step lesson towards creating something concrete
Shifting the large majority of the conceptual explanation from markdown text blocks to links to documentation pages
Making it clearer and cleaner by removing large majority of code comments, and also optional code blocks
Adding opening bullet points on level, time to complete, prerequisites, nodes used and goal

review-notebook-app · 2022-10-14T10:59:55Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ZanSara

Sorry in advance if the comments are too drastic! Let's discuss if there's a way to address them. I won't block this refactoring for them.

tutorials/01_Basic_QA_Pipeline.ipynb

bilgeyucel

I loved how everything is well structured and goes step-by-step! Such an improvement (at least for me 😅)

markdowns/1.md

bglearning · 2022-10-18T10:27:50Z

Minor comment, just thinking out loud: for Tut 1, what is the reason we use ElasticSearch (as opposed to InMemory)? Is it to start off with a good default?

The "QA System" -> "QA System without ES" rather than "QA System (InMemory)" -> "QA System with ES" has always felt a bit odd to me. 😄

Just a hunch: someone not familiar with ES might find all the ES setup intimidating (albeit it's just running the given code/commands).

brandenchan · 2022-10-18T10:47:17Z

Minor comment, just thinking out loud: for Tut 1, what is the reason we use ElasticSearch (as opposed to InMemory)? Is it to start off with a good default?

The "QA System" -> "QA System without ES" rather than "QA System (InMemory)" -> "QA System with ES" has always felt a bit odd to me. smile

Just a hunch: someone not familiar with ES might find all the ES setup intimidating (albeit it's just running the given code/commands).

I actually agree with you on this one! I think ES is a complicating factor this early on. And they aren't working at the scale where they will be getting the benefits of an ES server instead of an InMemoryDS. One thing though is that InMemoryDS will require using the TFIDFRetriever instead of the BM25Retriever which is not a recommended default.

ZanSara · 2022-10-19T12:06:31Z

We could also consider adding support form BM25 in InMemory in the near future if there's a quick and easy way to do that. There might be libraries for it, so we wouldn't need to implement BM25 from scratch (for example https://pypi.org/project/rank-bm25/).

markdowns/1.md

new_tutorials/04_distilling_a_reader.ipynb

tutorials/01_Basic_QA_Pipeline.ipynb

anakin87 · 2022-11-17T10:54:36Z

To give a general opinion, I really appreciate this restructuring of the tutorials: it seems to me that they lower the entry barrier a little, without selling overly pre-packaged solutions. 💪

Just two small comments on the fine-tuning tutorial:

I would like the user to be able to directly experience the positive effect of fine-tuning (some examples, metrics...), but I realize that it is difficult to do so if the evaluation topic is not introduced first
I understand that using the SQuAD dev set for fine-tuning is simple and makes sense for demonstration purposes.
On the other hand, I don't want the user to think this is the norm. (I'm not an expert on tuning QA models, so what I said might be nonsense...)

brandenchan · 2022-11-17T15:33:37Z

deepset-ai/haystack#3561 will add support for BM25Retriever in InMemoryDocumentStore.
I think it is already well advanced and I hope to complete it within a maximum of two days. So maybe we could wait for it to be merged...

Thanks @anakin87 for the heads up! Having BM25 in the first tutorial working with the InMemoryDocumentStore will be a big step up!

brandenchan · 2022-11-17T15:36:52Z

I would like the user to be able to directly experience the positive effect of fine tuning (some examples, metrics...), but I realize that it is difficult to do so if the evaluation topic is not introduced first

I understand that using the SQuAD dev set for fine-tuning is simple and makes sense for demonstration purposes.
On the other hand, I don't want the user to think this is the norm. (I'm not an expert on tuning QA models, soso what I said might be nonsense...)

Thank you very much for taking the time to have a look at the tutorials! Your feedback is definitely really helpful. And I'm glad you picked up on this point. Further training the QA model on SQuAD dev is definitely not ideal because you cannot evaluate it anymore on SQuAD. Theoretically it should be generally better just because it has seen more data. But I will ask internally whether we have any more appropriate datasets to work with in this tutorial

ZanSara · 2022-11-22T08:26:57Z

@brandenchan deepset-ai/haystack#3561 is merged, we can now use BM25 with InMemoryDocumentStore. Let's update the tutorial to do so!

anakin87 · 2022-11-22T15:41:47Z

@brandenchan to build the BM25 representation, the document store must be initialized as follows:
document_store = InMemoryDocumentStore(use_bm25=True)

The BM25 representation is not computed by default: if you just want to use the InMemoryDocumentStore for dense retrieval or for TF-IDF retrieval, it is not needed and would make operations slower.

brandenchan · 2022-11-23T08:20:36Z

@brandenchan to build the BM25 representation, the document store must be initialized as follows:
document_store = InMemoryDocumentStore(use_bm25=True)

Thanks for the heads up! Committing that change now

TuanaCelik · 2022-11-24T15:52:04Z

@brandenchan - the slug structure is now ready for you to use. Including the auto-generated download buttons. You can merge master and we can take it from there.

TuanaCelik · 2022-11-25T15:55:31Z

As I will be on PTO next week I will add this note here just in case it is needed. And @julian-risch tagging you here too as it is about tracking in telemetry and the files in the new s3 bucket as well. fyi: we've spoken about this with docs, just putting it in writing.
3 things:

I think we should have a shift in mentality with regards to the numbers. We might want to start thinking of the number at the front of a tutorial as 'identification' numbers. Because the order is not linked to them. And could change in the future too. Depending on what we decide makes sense as a learning track. Otherwise we risk having to re-number tutorial files a lot and also have to adjust how we look at telemetry data..
If a tutorial is the 'same'(ish) just being restructured, I suggest we keep the number in front of the name the same as before, this way we can look at telemetry, see the tutorial number there, and we know which notebook it corresponds to.
With that, in this PR, let's make sure that the dataset used in the tutorial is the correct one for telemetry to id it properly (I think this is simply a path name check)

And with that, suggestion:
Keep the fine-tuning tutorial id'ed 02. This way we can see the difference of usage from before, for the same tutorial.
Meaning 'Build your first Question Answering System' and 'Build a scalable question answering system' can be 01 and 03 (or 03 and 01, I leave this to you :) )- The only trick is to then set their 'order' properly in index.toml

index.toml

TuanaCelik · 2022-11-30T16:06:21Z

index.toml

 aliases = ["without-elasticsearch"]
+slug = "03_scalable_qa_pipeline"


So now that we no longer have and no longer want to maintain "03_build_a_scalable_question_answering_system" --> where do you want the people looking for this to go to? do you want them to be directed to the 1st tutorial? Or this tutorial? We can add the old slug to one of the aliases accordingly.

the first tutorial, I'd say

TuanaCelik · 2022-11-30T16:33:23Z

@brandenchan here are some preview links for you. Please disregard the url structure in these previews. The slugs in this PR are correct, I just had to make the previews like this because I'm hacking it :) We don't have a proper previewing structure in place yet.
Tutorial 1: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/1
Tutorial 2: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/2
Tutorial 3: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/3
Tutorial 21: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/21

anakin87 · 2022-12-01T12:05:14Z

Standing to the preview, BM25 doesn't appear in Tutorial 1.

agnieszka-m · 2022-12-05T18:37:30Z

Standing to the preview, BM25 doesn't appear in Tutorial 1.

I think it was mistakenly updated in the .md file instead of the notebook file. Fixed it now. Thanks for being alert! :)

TuanaCelik · 2022-12-20T07:48:39Z

markdowns/03_Scalable_QA_Pipeline.md

+doc_dir = "data/build_a_scalable_question_answering_system"
+
+fetch_archive_from_http(
+    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",


This has to change to _txt3.zip

TuanaCelik · 2022-12-20T07:49:06Z

markdowns/21_distill_a_reader.md

+doc_dir = "data/distil_a_reader"
+squad_dir = doc_dir + "/squad"
+
+s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"


This dataset can be moved from tutorial 2 to 21 for telemetry @julian-risch ?

Yes can be down. To make this process less error-prone maybe we can have a list with old tutorials dataset URLs and what they become? This is not the only one that needs to change and it would be great if we could do it in one larger batch if possible.

TuanaCelik · 2022-12-20T07:53:46Z

markdowns/02_Finetune_a_model_on_your_data.md


-# Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)
-s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"


At this point I am baffled and I think I need help figuring this out @julian-risch
Right now, we look at this dataset to count the times tutorial 2 has been run. Now we're decoupling the distillation bit so this line is out. However, we don't at any point for the training step have users download the data/squad20 "train_filename="dev-v2.0.json"? But the tutorial works. Does the train() function pull it by name from somewhere? Is it possible to once we merge these tutorial restructures that we track that? @agnieszka-m tagging you here for visibility

That can't work. Only possible explanation is that you already have the data_dir = "data/squad20"directory with some data in it. For example if you ran another tutorial before. We need to also fetch_archive_from_http the data in tutorial 2. So this needs to be added here again. And I could upload a/ datasets/documents/squad_small2.json.zip" or something like that with a different dataset name than before for telemetry.

Ah, I'm wrong.
I just tried running the new tutorial here: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/tutorial_restructure_draft/tutorials/02_finetune_a_reader.ipynb and it does download the dataset.
The explanation is that the processor has two datasets hardcoded here: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L42 One is squad

Here is the logic that decides about downloading that data: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L2250

Thanks, in that case for now, until we have any decisions to change or keep the way we do telemetry for tutorials, I will add a fetch_archive_from_http to tutorial 2 as well. As you suggested in the comment above. I will create a list of datasets for tutorials in the meantime. Until and if there is a change, it will be healthier to have.

I've already decoupled tutorial 1 and 3 into a separate PR as there's no need for those 2 to wait while we're fixing datasets etc for Tutorial 2 and 21..

brandenchan added 9 commits October 11, 2022 17:37

Draft tutorial 1 restructure

de0f921

Add title

e6a5693

Add link

ac87643

Fix titles

a2c3f36

Add final message

9d128a3

Fix some links

acad49e

Incorporate reviewer feedback

04470d7

Incorporate reviewer feedback

d1a0524

Regenerate markdown

b558a66

ZanSara reviewed Oct 14, 2022

View reviewed changes

tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved

tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved

tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved

tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved

bilgeyucel reviewed Oct 14, 2022

View reviewed changes

markdowns/1.md Outdated Show resolved Hide resolved

ZanSara mentioned this pull request Oct 21, 2022

Add support for BM25Retriever in InMemoryDocumentStore deepset-ai/haystack#3447

Closed

brandenchan added 11 commits October 25, 2022 16:38

Create second tutorial

b99af07

Clear output

464ba3c

Create finetuning tutorial

1ea723e

Create distillation tutorial

3dd310d

Rename tutorial

8e4cc8b

Test and refine distillation tutorial

080b588

Fix first tutorial

8b8d792

Run and test tutorial 2

fbf252e

Run and test tutorial 3

2411b4e

Run and test tutorial 3

a8bfce2

Run and test tutorial 4

b939038

brandenchan requested review from agnieszka-m and TuanaCelik November 3, 2022 15:44

agnieszka-m requested changes Nov 4, 2022

View reviewed changes

Incorporate reviewer feedback

11b1d33

brandenchan requested a review from agnieszka-m November 22, 2022 15:38

Set use_bm25 argument

f4c77f6

This was referenced Nov 23, 2022

New Tutorial: REST API #61

Merged

Fix formatting on all tutorials #75

Closed

brandenchan added 4 commits November 29, 2022 16:09

Merge main

65fa864

Update naming

1843731

Delete old md and regenerate new md

b785f56

Update index.toml and readme

10d3329

TuanaCelik reviewed Nov 30, 2022

View reviewed changes

minor changes for new tutorial structure

1296fc8

agnieszka-m added 5 commits December 5, 2022 19:37

add bm25 and lg updates

f0f90c8

update with bm25

5776b81

Update links and retriever

fc737de

Update links

0c7bee8

update the gpu links

d31cc90

TuanaCelik suggested changes Dec 20, 2022

View reviewed changes

TuanaCelik mentioned this pull request Dec 29, 2022

adding only tutorial 1 and 3 updates #95

Merged

bilgeyucel closed this Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial restructure draft #44

Tutorial restructure draft #44

brandenchan commented Oct 14, 2022

review-notebook-app bot commented Oct 14, 2022

ZanSara left a comment

bilgeyucel left a comment

bglearning commented Oct 18, 2022

brandenchan commented Oct 18, 2022

ZanSara commented Oct 19, 2022 •

edited

Loading

anakin87 commented Nov 17, 2022 •

edited

Loading

brandenchan commented Nov 17, 2022

brandenchan commented Nov 17, 2022

ZanSara commented Nov 22, 2022

anakin87 commented Nov 22, 2022

brandenchan commented Nov 23, 2022

TuanaCelik commented Nov 24, 2022

TuanaCelik commented Nov 25, 2022

TuanaCelik Nov 30, 2022

agnieszka-m Dec 7, 2022

TuanaCelik commented Nov 30, 2022

anakin87 commented Dec 1, 2022

agnieszka-m commented Dec 5, 2022

TuanaCelik Dec 20, 2022

TuanaCelik Dec 20, 2022

julian-risch Dec 22, 2022 •

edited

Loading

TuanaCelik Dec 20, 2022

julian-risch Dec 22, 2022

julian-risch Dec 22, 2022

TuanaCelik Dec 29, 2022

		aliases = ["without-elasticsearch"]
		slug = "03_scalable_qa_pipeline"


		# Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)
		s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"

Tutorial restructure draft #44

Tutorial restructure draft #44

Conversation

brandenchan commented Oct 14, 2022

review-notebook-app bot commented Oct 14, 2022

ZanSara left a comment

Choose a reason for hiding this comment

bilgeyucel left a comment

Choose a reason for hiding this comment

bglearning commented Oct 18, 2022

brandenchan commented Oct 18, 2022

ZanSara commented Oct 19, 2022 • edited Loading

anakin87 commented Nov 17, 2022 • edited Loading

brandenchan commented Nov 17, 2022

brandenchan commented Nov 17, 2022

ZanSara commented Nov 22, 2022

anakin87 commented Nov 22, 2022

brandenchan commented Nov 23, 2022

TuanaCelik commented Nov 24, 2022

TuanaCelik commented Nov 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TuanaCelik commented Nov 30, 2022

anakin87 commented Dec 1, 2022

agnieszka-m commented Dec 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julian-risch Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZanSara commented Oct 19, 2022 •

edited

Loading

anakin87 commented Nov 17, 2022 •

edited

Loading

julian-risch Dec 22, 2022 •

edited

Loading