Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial restructure draft #44

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
de0f921
Draft tutorial 1 restructure
brandenchan Oct 11, 2022
e6a5693
Add title
brandenchan Oct 11, 2022
ac87643
Add link
brandenchan Oct 11, 2022
a2c3f36
Fix titles
brandenchan Oct 11, 2022
9d128a3
Add final message
brandenchan Oct 11, 2022
acad49e
Fix some links
brandenchan Oct 13, 2022
04470d7
Incorporate reviewer feedback
brandenchan Oct 13, 2022
d1a0524
Incorporate reviewer feedback
brandenchan Oct 14, 2022
b558a66
Regenerate markdown
brandenchan Oct 14, 2022
b99af07
Create second tutorial
brandenchan Oct 25, 2022
464ba3c
Clear output
brandenchan Nov 1, 2022
1ea723e
Create finetuning tutorial
brandenchan Nov 2, 2022
3dd310d
Create distillation tutorial
brandenchan Nov 2, 2022
8e4cc8b
Rename tutorial
brandenchan Nov 2, 2022
080b588
Test and refine distillation tutorial
brandenchan Nov 3, 2022
8b8d792
Fix first tutorial
brandenchan Nov 3, 2022
fbf252e
Run and test tutorial 2
brandenchan Nov 3, 2022
2411b4e
Run and test tutorial 3
brandenchan Nov 3, 2022
a8bfce2
Run and test tutorial 3
brandenchan Nov 3, 2022
b939038
Run and test tutorial 4
brandenchan Nov 3, 2022
f3d24c6
Rename tutorial 4
brandenchan Nov 4, 2022
f0f57a4
Oxford comma
brandenchan Nov 4, 2022
0daaaa4
Incorporate reviewer feedback for tutorial 2
brandenchan Nov 8, 2022
babafeb
Merge branch 'tutorial_restructure_draft' of https://github.com/deeps…
brandenchan Nov 8, 2022
4a8334b
Incorporate reviewer feedback for tutorial 3
brandenchan Nov 8, 2022
e3db636
Incorporate reviewer feedback for tutorial 4
brandenchan Nov 8, 2022
405ed89
Move new tutorials into folder
brandenchan Nov 14, 2022
dff7aae
Merge main
brandenchan Nov 14, 2022
fba5806
Update index
brandenchan Nov 14, 2022
0ed9fd7
Remove prereqs
brandenchan Nov 14, 2022
4c711fa
Regenerate markdowns
brandenchan Nov 14, 2022
0eb9de3
Update index.toml
brandenchan Nov 16, 2022
a7318c5
Update index.toml
brandenchan Nov 16, 2022
7e7b4f3
Update tutorials/03_finetune_a_reader.ipynb
brandenchan Nov 16, 2022
7c437c4
Update tutorials/03_finetune_a_reader.ipynb
brandenchan Nov 16, 2022
f04a193
Incorporate Reviewer feedback
brandenchan Nov 16, 2022
627ec52
Regenerate markdown
brandenchan Nov 16, 2022
a52d1bc
Edit colab env setup sections
brandenchan Nov 16, 2022
2dc469a
Regenerate MD files
brandenchan Nov 16, 2022
11b1d33
Incorporate reviewer feedback
brandenchan Nov 22, 2022
f4c77f6
Set use_bm25 argument
brandenchan Nov 23, 2022
65fa864
Merge main
brandenchan Nov 29, 2022
1843731
Update naming
brandenchan Nov 29, 2022
b785f56
Delete old md and regenerate new md
brandenchan Nov 29, 2022
10d3329
Update index.toml and readme
brandenchan Nov 29, 2022
1296fc8
minor changes for new tutorial structure
TuanaCelik Nov 30, 2022
f0f90c8
add bm25 and lg updates
agnieszka-m Dec 5, 2022
5776b81
update with bm25
agnieszka-m Dec 6, 2022
fc737de
Update links and retriever
agnieszka-m Dec 6, 2022
0c7bee8
Update links
agnieszka-m Dec 6, 2022
d31cc90
update the gpu links
agnieszka-m Dec 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 22 additions & 21 deletions README.md

Large diffs are not rendered by default.

30 changes: 20 additions & 10 deletions index.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,31 @@ toc = true
colab = "https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/"

[[tutorial]]
title = "Build Your First QA System"
title = "Build Your First Question Answering System"
description = "Get Started by creating a Retriever Reader pipeline."
level = "beginner"
weight = 10
notebook = "01_Basic_QA_Pipeline.ipynb"
aliases = ["first-qa-system"]
notebook = "01_build_your_first_question_answering_system.ipynb"
aliases = ["first-qa-system", "without-elasticsearch", "03_Basic_QA_Pipeline_without_Elasticsearch"]
slug = "01_Basic_QA_Pipeline"

[[tutorial]]
title = "Fine-Tuning a Model on Your Own Data"
title = "Fine-Tune a Reader"
description = "Improve the performance of your Reader by performing fine-tuning."
brandenchan marked this conversation as resolved.
Show resolved Hide resolved
level = "intermediate"
weight = 50
notebook = "02_Finetune_a_model_on_your_data.ipynb"
notebook = "02_finetune_a_reader.ipynb"
aliases = ["fine-tuning-a-model"]
slug = "02_Finetune_a_model_on_your_data"

[[tutorial]]
title = "Build a QA System Without Elasticsearch"
description = "Create a Retriever Reader pipeline that requires no external database dependencies."
title = "Build a Scalable Question Answering System"
description = "Create a scalable Retriever Reader pipeline that uses an ElasticsearchDocumentStore."
level = "beginner"
weight = 15
notebook = "03_Basic_QA_Pipeline_without_Elasticsearch.ipynb"
aliases = ["without-elasticsearch"]
notebook = "03_build_a_scalable_question_answering_system.ipynb"
aliases = []
slug = "03_Scalable_QA_Pipeline"

[[tutorial]]
title = "Utilizing Existing FAQs for Question Answering"
Expand Down Expand Up @@ -154,4 +156,12 @@ description = "Use a MultiModalRetriever to build a cross-modal search pipeline.
level = "intermediate"
weight = 95
notebook = "19_Text_to_Image_search_pipeline_with_MultiModal_Retriever.ipynb"
aliases = ["multimodal"]
aliases = ["multimodal"]

[[tutorial]]
title = "Distill a Reader"
description = "Transfer a Reader's question answering ability to a smaller, more efficient model."
level = "intermediate"
weight = 115
notebook = "21_distill_a_reader.ipynb"
aliases = ["distill-reader"]
285 changes: 76 additions & 209 deletions markdowns/01_Basic_QA_Pipeline.md

Large diffs are not rendered by default.

155 changes: 53 additions & 102 deletions markdowns/02_Finetune_a_model_on_your_data.md
Original file line number Diff line number Diff line change
@@ -1,161 +1,113 @@
---
layout: tutorial
colab: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_Finetune_a_model_on_your_data.ipynb
colab: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/02_finetune_a_reader.ipynb
toc: True
title: "Fine-Tuning a Model on Your Own Data"
last_updated: 2022-11-24
title: "Fine-Tune a Reader"
last_updated: 2022-11-30
level: "intermediate"
weight: 50
description: Improve the performance of your Reader by performing fine-tuning.
category: "QA"
aliases: ['/tutorials/fine-tuning-a-model']
download: "/downloads/02_Finetune_a_model_on_your_data.ipynb"
download: "/downloads/02_finetune_a_reader.ipynb"
---



For many use cases it is sufficient to just use one of the existing public models that were trained on SQuAD or other public QA datasets (e.g. Natural Questions).
However, if you have domain-specific questions, fine-tuning your model on custom examples will very likely boost your performance.
While this varies by domain, we saw that ~ 2000 examples can easily increase performance by +5-20%.
- **Level**: Intermediate
- **Time to complete**: 20 minutes
- **Nodes Used**: `FARMReader`
- **Goal**: Learn how to improve the performance of a DistilBERT Reader model by performing further training on the SQuAD dataset.

This tutorial shows you how to fine-tune a pretrained model on your own dataset.
## Overview

### Prepare environment
Fine-tuning can improve your Reader's performance on question answering, especially if you're working with very specific domains. While many of the existing public models trained on public question answering datasets are enough for most use cases, fine-tuning can help your model understand the phrases and terms specific to your field. While this varies for each domain and dataset, we've had cases where ~2000 examples increased performance by as much as +5-20%. After completing this tutorial, you will have all the tools needed to fine-tune a pretrained model on your own dataset.

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
## Preparing the Colab Environment

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">
- [Enable GPU Runtime in GPU](https://docs.haystack.deepset.ai/docs/enable-gpu-runtime-in-colab)
- [Check if GPU is Enabled](https://docs.haystack.deepset.ai/docs/check-if-gpu-is-enabled)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/set-the-logging-level)


```python
# Make sure you have a GPU running
!nvidia-smi
```
## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

```python
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:


```python
import logging
```bash
%%bash

logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
pip install --upgrade pip
pip install farm-haystack[colab]
```


```python
from haystack.nodes import FARMReader
from haystack.utils import fetch_archive_from_http
```

## Creating Training Data

## Create Training Data
To start fine-tuning your Reader model, you need question answering data in the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format. One sample from this data should contain a question, a text answer, and the document containing the answer.

There are two ways to generate training data
You can start generating your own training data using one of the two tools that we offer:

1. **Annotation**: You can use the [annotation tool](https://haystack.deepset.ai/guides/annotation) to label your data, i.e. highlighting answers to your questions in a document. The tool supports structuring your workflow with organizations, projects, and users. The labels can be exported in SQuAD format that is compatible for training with Haystack.
1. **Annotation Tool**: You can use the deepset [Annotation Tool](https://haystack.deepset.ai/guides/annotation) to write questions and highlight answers in a document. The tool supports structuring your workflow with organizations, projects, and users. You can then export the question-answer pairs in the SQuAD format that is compatible with fine-tuning in Haystack.

![Snapshot of the annotation tool](https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/annotation_tool.png)
2. **Feedback Mechanism**: In a production system, you can collect users' feedback to model predictions with Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api) and use this as training data. To learn how to interact with the user feedback endpoints, see [User Feedback](https://docs.haystack.deepset.ai/docs/domain_adaptation#user-feedback).

2. **Feedback**: For production systems, you can collect training data from direct user feedback via Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned by the API. The API provides a feedback export endpoint to obtain the feedback data for fine-tuning your model further.


## Fine-tune your model
## Fine-tuning the Reader

Once you have collected training data, you can fine-tune your base models.
We initialize a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).
We recommend using a base model that was trained on SQuAD or a similar QA dataset before to benefit from Transfer Learning effects.

**Recommendation**: Run training on a GPU.
If you are using Colab: Enable this in the menu "Runtime" > "Change Runtime type" > Select "GPU" in dropdown.
Then change the `use_gpu` arguments below to `True`
1. Initialize the Reader, supplying the name of the base model you wish to improve.


```python
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
# data_dir = "PATH/TO_YOUR/TRAIN_DATA"
reader.train(data_dir=data_dir, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1, save_dir="my_model")
```

We recommend using a model that was trained on SQuAD or a similar question answering dataset to benefit from transfer learning effects. In this tutorial, we are using [distilbert-base-uncased-distilled-squad](https://huggingface.co/distilbert-base-uncased-distilled-squadbase), a base-sized DistilBERT model that was trained on SQuAD. To learn more about what model works best for your use case, see [Models](https://haystack.deepset.ai/pipeline_nodes/reader#models).

```python
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")
```
2. Provide the SQuAD format training data to the `Reader.train()` method.


```python
# If you want to load it at a later point, just do:
new_reader = FARMReader(model_name_or_path="my_model")
data_dir = "data/squad20"
reader.train(
data_dir=data_dir,
train_filename="dev-v2.0.json",
use_gpu=True,
n_epochs=1,
save_dir="my_model"
)
```

## Distill your model
In this case, we have used "distilbert-base-uncased" as our base model. This model was trained using a process called distillation. In this process, a bigger model is trained first and is used to train a smaller model which increases its accuracy. This is why "distilbert-base-uncased" can achieve quite competitive performance while being very small.

Sometimes, however, you can't use an already distilled model and have to distil it yourself. For this case, haystack has implemented [distillation features](https://haystack.deepset.ai/guides/model-distillation).

### Augmenting your training data
To get the most out of model distillation, we recommend increasing the size of your training data by using data augmentation. You can do this by running the [`augment_squad.py` script](https://github.com/deepset-ai/haystack/blob/main/haystack/utils/augment_squad.py):

With the default parameters above, we are starting with a base model trained on the SQuAD training dataset and we are further fine-tuning it on the SQuAD development dataset. To fine-tune the model for your domain, replace `train_filename` with your domain-specific dataset.

```python
# Downloading script
!wget https://raw.githubusercontent.com/deepset-ai/haystack/main/haystack/utils/augment_squad.py
To perform evaluation over the course of fine-tuning, see [FARMReader.train() API](https://docs.haystack.deepset.ai/reference/reader-api#farmreadertrain) for the relevant arguments.

doc_dir = "data/tutorial2"
## Saving and Loading

# Downloading smaller glove vector file (only for demonstration purposes)
glove_url = "https://nlp.stanford.edu/data/glove.6B.zip"
fetch_archive_from_http(url=glove_url, output_dir=doc_dir)
The model is automatically saved at the end of fine-tuning in the `save_dir` that you specified.
However, you can also manually save the Reader again by running:

# Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases)
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I am baffled and I think I need help figuring this out @julian-risch
Right now, we look at this dataset to count the times tutorial 2 has been run. Now we're decoupling the distillation bit so this line is out. However, we don't at any point for the training step have users download the data/squad20 "train_filename="dev-v2.0.json"? But the tutorial works. Does the train() function pull it by name from somewhere? Is it possible to once we merge these tutorial restructures that we track that? @agnieszka-m tagging you here for visibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't work. Only possible explanation is that you already have the data_dir = "data/squad20"directory with some data in it. For example if you ran another tutorial before. We need to also fetch_archive_from_http the data in tutorial 2. So this needs to be added here again. And I could upload a/ datasets/documents/squad_small2.json.zip" or something like that with a different dataset name than before for telemetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in that case for now, until we have any decisions to change or keep the way we do telemetry for tutorials, I will add a fetch_archive_from_http to tutorial 2 as well. As you suggested in the comment above. I will create a list of datasets for tutorials in the meantime. Until and if there is a change, it will be healthier to have.

I've already decoupled tutorial 1 and 3 into a separate PR as there's no need for those 2 to wait while we're fixing datasets etc for Tutorial 2 and 21..

fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Just replace the path with your dataset and adjust the output (also please remove glove path to use bigger glove vector file)
!python augment_squad.py --squad_path squad_small.json --output_path augmented_dataset.json --multiplication_factor 2 --glove_path glove.6B.300d.txt
```python
reader.save(directory="my_model")
```

In this case, we use a multiplication factor of 2 to keep this example lightweight. Usually you would use a factor like 20 depending on the size of your training data. Augmenting this small dataset with a multiplication factor of 2, should take about 5 to 10 minutes to run on one V100 GPU.

### Running distillation
Distillation in haystack is done in two steps: First, you run intermediate layer distillation on the augmented dataset to ensure the two models behave similarly. After that, you run the prediction layer distillation on the non-augmented dataset to optimize the model for your specific task.

If you want, you can leave out the intermediate layer distillation step and only run the prediction layer distillation. This way you also do not need to perform data augmentation. However, this will make the model significantly less accurate.
To load a saved model, run:


```python
# Loading a fine-tuned model as teacher e.g. "deepset/​bert-​base-​uncased-​squad2"
teacher = FARMReader(model_name_or_path="my_model", use_gpu=True)
new_reader = FARMReader(model_name_or_path="my_model")
```

# You can use any pre-trained language model as teacher that uses the same tokenizer as the teacher model.
# The number of the layers in the teacher model also needs to be a multiple of the number of the layers in the student.
student = FARMReader(model_name_or_path="huawei-noah/TinyBERT_General_6L_768D", use_gpu=True)
# Next Steps

student.distil_intermediate_layers_from(teacher, data_dir=".", train_filename="augmented_dataset.json", use_gpu=True)
student.distil_prediction_layer_from(teacher, data_dir="data/squad20", train_filename="dev-v2.0.json", use_gpu=True)
Now that you have a model with improved performance, why not transfer its question answering capabilities into a smaller, faster model? Starting with this new model, you can use model distillation to create a more efficient model with only a slight tradeoff in performance. To learn more, see [Distil a Reader](https://haystack.deepset.ai/tutorials/04_distil_a_reader).

student.save(directory="my_distilled_model")
```
To learn how to measure the performance of these Reader models, see [Evaluate a Reader model](https://haystack.deepset.ai/tutorials/05_evaluate_a_reader).

## About us

Expand All @@ -167,9 +119,8 @@ Our focus: Industry specific language models & large scale QA systems.
Some of our other work:
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Discord](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Discord](https://haystack.deepset.ai/community) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)
Loading