Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial restructure draft #44

Closed
wants to merge 51 commits into from
Closed

Conversation

brandenchan
Copy link
Contributor

Creating a draft of Tutorial 1 in a new tutorial style. To summarize the main design changes:

  • Adhering to the conventional idea of a tutorial being a hands on and step by step lesson towards creating something concrete
  • Shifting the large majority of the conceptual explanation from markdown text blocks to links to documentation pages
  • Making it clearer and cleaner by removing large majority of code comments, and also optional code blocks
  • Adding opening bullet points on level, time to complete, prerequisites, nodes used and goal

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry in advance if the comments are too drastic! Let's discuss if there's a way to address them. I won't block this refactoring for them.

tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
Copy link
Contributor

@bilgeyucel bilgeyucel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loved how everything is well structured and goes step-by-step! Such an improvement (at least for me 😅)

markdowns/1.md Outdated Show resolved Hide resolved
@bglearning
Copy link
Contributor

Minor comment, just thinking out loud: for Tut 1, what is the reason we use ElasticSearch (as opposed to InMemory)? Is it to start off with a good default?

The "QA System" -> "QA System without ES" rather than "QA System (InMemory)" -> "QA System with ES" has always felt a bit odd to me. 😄

Just a hunch: someone not familiar with ES might find all the ES setup intimidating (albeit it's just running the given code/commands).

@brandenchan
Copy link
Contributor Author

Minor comment, just thinking out loud: for Tut 1, what is the reason we use ElasticSearch (as opposed to InMemory)? Is it to start off with a good default?

The "QA System" -> "QA System without ES" rather than "QA System (InMemory)" -> "QA System with ES" has always felt a bit odd to me. smile

Just a hunch: someone not familiar with ES might find all the ES setup intimidating (albeit it's just running the given code/commands).

I actually agree with you on this one! I think ES is a complicating factor this early on. And they aren't working at the scale where they will be getting the benefits of an ES server instead of an InMemoryDS. One thing though is that InMemoryDS will require using the TFIDFRetriever instead of the BM25Retriever which is not a recommended default.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 19, 2022

We could also consider adding support form BM25 in InMemory in the near future if there's a quick and easy way to do that. There might be libraries for it, so we wouldn't need to implement BM25 from scratch (for example https://pypi.org/project/rank-bm25/).

markdowns/1.md Outdated Show resolved Hide resolved
markdowns/1.md Outdated Show resolved Hide resolved
markdowns/1.md Outdated Show resolved Hide resolved
markdowns/1.md Outdated Show resolved Hide resolved
markdowns/1.md Outdated Show resolved Hide resolved
new_tutorials/04_distilling_a_reader.ipynb Outdated Show resolved Hide resolved
new_tutorials/04_distilling_a_reader.ipynb Outdated Show resolved Hide resolved
new_tutorials/04_distilling_a_reader.ipynb Outdated Show resolved Hide resolved
tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
tutorials/01_Basic_QA_Pipeline.ipynb Outdated Show resolved Hide resolved
@anakin87
Copy link
Member

anakin87 commented Nov 17, 2022

To give a general opinion, I really appreciate this restructuring of the tutorials: it seems to me that they lower the entry barrier a little, without selling overly pre-packaged solutions. 💪

Just two small comments on the fine-tuning tutorial:

  • I would like the user to be able to directly experience the positive effect of fine-tuning (some examples, metrics...), but I realize that it is difficult to do so if the evaluation topic is not introduced first
  • I understand that using the SQuAD dev set for fine-tuning is simple and makes sense for demonstration purposes.
    On the other hand, I don't want the user to think this is the norm. (I'm not an expert on tuning QA models, so what I said might be nonsense...)

@brandenchan
Copy link
Contributor Author

deepset-ai/haystack#3561 will add support for BM25Retriever in InMemoryDocumentStore.
I think it is already well advanced and I hope to complete it within a maximum of two days. So maybe we could wait for it to be merged...

Thanks @anakin87 for the heads up! Having BM25 in the first tutorial working with the InMemoryDocumentStore will be a big step up!

@brandenchan
Copy link
Contributor Author

  • I would like the user to be able to directly experience the positive effect of fine tuning (some examples, metrics...), but I realize that it is difficult to do so if the evaluation topic is not introduced first

  • I understand that using the SQuAD dev set for fine-tuning is simple and makes sense for demonstration purposes.
    On the other hand, I don't want the user to think this is the norm. (I'm not an expert on tuning QA models, soso what I said might be nonsense...)

Thank you very much for taking the time to have a look at the tutorials! Your feedback is definitely really helpful. And I'm glad you picked up on this point. Further training the QA model on SQuAD dev is definitely not ideal because you cannot evaluate it anymore on SQuAD. Theoretically it should be generally better just because it has seen more data. But I will ask internally whether we have any more appropriate datasets to work with in this tutorial

@ZanSara
Copy link
Contributor

ZanSara commented Nov 22, 2022

@brandenchan deepset-ai/haystack#3561 is merged, we can now use BM25 with InMemoryDocumentStore. Let's update the tutorial to do so!

@anakin87
Copy link
Member

@brandenchan to build the BM25 representation, the document store must be initialized as follows:
document_store = InMemoryDocumentStore(use_bm25=True)

The BM25 representation is not computed by default: if you just want to use the InMemoryDocumentStore for dense retrieval or for TF-IDF retrieval, it is not needed and would make operations slower.

@brandenchan
Copy link
Contributor Author

@brandenchan to build the BM25 representation, the document store must be initialized as follows:
document_store = InMemoryDocumentStore(use_bm25=True)

Thanks for the heads up! Committing that change now

This was referenced Nov 23, 2022
@TuanaCelik
Copy link
Contributor

@brandenchan - the slug structure is now ready for you to use. Including the auto-generated download buttons. You can merge master and we can take it from there.

@TuanaCelik
Copy link
Contributor

As I will be on PTO next week I will add this note here just in case it is needed. And @julian-risch tagging you here too as it is about tracking in telemetry and the files in the new s3 bucket as well. fyi: we've spoken about this with docs, just putting it in writing.
3 things:

  • I think we should have a shift in mentality with regards to the numbers. We might want to start thinking of the number at the front of a tutorial as 'identification' numbers. Because the order is not linked to them. And could change in the future too. Depending on what we decide makes sense as a learning track. Otherwise we risk having to re-number tutorial files a lot and also have to adjust how we look at telemetry data..
  • If a tutorial is the 'same'(ish) just being restructured, I suggest we keep the number in front of the name the same as before, this way we can look at telemetry, see the tutorial number there, and we know which notebook it corresponds to.
  • With that, in this PR, let's make sure that the dataset used in the tutorial is the correct one for telemetry to id it properly (I think this is simply a path name check)

And with that, suggestion:
Keep the fine-tuning tutorial id'ed 02. This way we can see the difference of usage from before, for the same tutorial.
Meaning 'Build your first Question Answering System' and 'Build a scalable question answering system' can be 01 and 03 (or 03 and 01, I leave this to you :) )- The only trick is to then set their 'order' properly in index.toml

index.toml Outdated Show resolved Hide resolved
index.toml Outdated
aliases = ["without-elasticsearch"]
slug = "03_scalable_qa_pipeline"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now that we no longer have and no longer want to maintain "03_build_a_scalable_question_answering_system" --> where do you want the people looking for this to go to? do you want them to be directed to the 1st tutorial? Or this tutorial? We can add the old slug to one of the aliases accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first tutorial, I'd say

@TuanaCelik
Copy link
Contributor

@brandenchan here are some preview links for you. Please disregard the url structure in these previews. The slugs in this PR are correct, I just had to make the previews like this because I'm hacking it :) We don't have a proper previewing structure in place yet.
Tutorial 1: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/1
Tutorial 2: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/2
Tutorial 3: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/3
Tutorial 21: https://haystack-home-bpgtoj8a9-deepset-overnice.vercel.app/tutorials/21

@anakin87
Copy link
Member

anakin87 commented Dec 1, 2022

Standing to the preview, BM25 doesn't appear in Tutorial 1.

@agnieszka-m
Copy link
Contributor

Standing to the preview, BM25 doesn't appear in Tutorial 1.

I think it was mistakenly updated in the .md file instead of the notebook file. Fixed it now. Thanks for being alert! :)

doc_dir = "data/build_a_scalable_question_answering_system"

fetch_archive_from_http(
url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to change to _txt3.zip

doc_dir = "data/distil_a_reader"
squad_dir = doc_dir + "/squad"

s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataset can be moved from tutorial 2 to 21 for telemetry @julian-risch ?

Copy link
Member

@julian-risch julian-risch Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes can be down. To make this process less error-prone maybe we can have a list with old tutorials dataset URLs and what they become? This is not the only one that needs to change and it would be great if we could do it in one larger batch if possible.


# Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I am baffled and I think I need help figuring this out @julian-risch
Right now, we look at this dataset to count the times tutorial 2 has been run. Now we're decoupling the distillation bit so this line is out. However, we don't at any point for the training step have users download the data/squad20 "train_filename="dev-v2.0.json"? But the tutorial works. Does the train() function pull it by name from somewhere? Is it possible to once we merge these tutorial restructures that we track that? @agnieszka-m tagging you here for visibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't work. Only possible explanation is that you already have the data_dir = "data/squad20"directory with some data in it. For example if you ran another tutorial before. We need to also fetch_archive_from_http the data in tutorial 2. So this needs to be added here again. And I could upload a/ datasets/documents/squad_small2.json.zip" or something like that with a different dataset name than before for telemetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in that case for now, until we have any decisions to change or keep the way we do telemetry for tutorials, I will add a fetch_archive_from_http to tutorial 2 as well. As you suggested in the comment above. I will create a list of datasets for tutorials in the meantime. Until and if there is a change, it will be healthier to have.

I've already decoupled tutorial 1 and 3 into a separate PR as there's no need for those 2 to wait while we're fixing datasets etc for Tutorial 2 and 21..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants