Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial restructure draft #44

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
de0f921
Draft tutorial 1 restructure
brandenchan Oct 11, 2022
e6a5693
Add title
brandenchan Oct 11, 2022
ac87643
Add link
brandenchan Oct 11, 2022
a2c3f36
Fix titles
brandenchan Oct 11, 2022
9d128a3
Add final message
brandenchan Oct 11, 2022
acad49e
Fix some links
brandenchan Oct 13, 2022
04470d7
Incorporate reviewer feedback
brandenchan Oct 13, 2022
d1a0524
Incorporate reviewer feedback
brandenchan Oct 14, 2022
b558a66
Regenerate markdown
brandenchan Oct 14, 2022
b99af07
Create second tutorial
brandenchan Oct 25, 2022
464ba3c
Clear output
brandenchan Nov 1, 2022
1ea723e
Create finetuning tutorial
brandenchan Nov 2, 2022
3dd310d
Create distillation tutorial
brandenchan Nov 2, 2022
8e4cc8b
Rename tutorial
brandenchan Nov 2, 2022
080b588
Test and refine distillation tutorial
brandenchan Nov 3, 2022
8b8d792
Fix first tutorial
brandenchan Nov 3, 2022
fbf252e
Run and test tutorial 2
brandenchan Nov 3, 2022
2411b4e
Run and test tutorial 3
brandenchan Nov 3, 2022
a8bfce2
Run and test tutorial 3
brandenchan Nov 3, 2022
b939038
Run and test tutorial 4
brandenchan Nov 3, 2022
f3d24c6
Rename tutorial 4
brandenchan Nov 4, 2022
f0f57a4
Oxford comma
brandenchan Nov 4, 2022
0daaaa4
Incorporate reviewer feedback for tutorial 2
brandenchan Nov 8, 2022
babafeb
Merge branch 'tutorial_restructure_draft' of https://github.com/deeps…
brandenchan Nov 8, 2022
4a8334b
Incorporate reviewer feedback for tutorial 3
brandenchan Nov 8, 2022
e3db636
Incorporate reviewer feedback for tutorial 4
brandenchan Nov 8, 2022
405ed89
Move new tutorials into folder
brandenchan Nov 14, 2022
dff7aae
Merge main
brandenchan Nov 14, 2022
fba5806
Update index
brandenchan Nov 14, 2022
0ed9fd7
Remove prereqs
brandenchan Nov 14, 2022
4c711fa
Regenerate markdowns
brandenchan Nov 14, 2022
0eb9de3
Update index.toml
brandenchan Nov 16, 2022
a7318c5
Update index.toml
brandenchan Nov 16, 2022
7e7b4f3
Update tutorials/03_finetune_a_reader.ipynb
brandenchan Nov 16, 2022
7c437c4
Update tutorials/03_finetune_a_reader.ipynb
brandenchan Nov 16, 2022
f04a193
Incorporate Reviewer feedback
brandenchan Nov 16, 2022
627ec52
Regenerate markdown
brandenchan Nov 16, 2022
a52d1bc
Edit colab env setup sections
brandenchan Nov 16, 2022
2dc469a
Regenerate MD files
brandenchan Nov 16, 2022
11b1d33
Incorporate reviewer feedback
brandenchan Nov 22, 2022
f4c77f6
Set use_bm25 argument
brandenchan Nov 23, 2022
65fa864
Merge main
brandenchan Nov 29, 2022
1843731
Update naming
brandenchan Nov 29, 2022
b785f56
Delete old md and regenerate new md
brandenchan Nov 29, 2022
10d3329
Update index.toml and readme
brandenchan Nov 29, 2022
1296fc8
minor changes for new tutorial structure
TuanaCelik Nov 30, 2022
f0f90c8
add bm25 and lg updates
agnieszka-m Dec 5, 2022
5776b81
update with bm25
agnieszka-m Dec 6, 2022
fc737de
Update links and retriever
agnieszka-m Dec 6, 2022
0c7bee8
Update links
agnieszka-m Dec 6, 2022
d31cc90
update the gpu links
agnieszka-m Dec 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
259 changes: 78 additions & 181 deletions markdowns/1.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,28 @@ date: "2020-09-03"
id: "tutorial1md"
--->

# Build Your First QA System
# Build Your First Question Answering System

<img style="float: right;" src="https://upload.wikimedia.org/wikipedia/en/d/d8/Game_of_Thrones_title_card.jpg">
- **Level**: Beginner
- **Time to complete**: 20 minutes
- **Prerequisites**: Prepare the Colab environment. See links below.
bilgeyucel marked this conversation as resolved.
Show resolved Hide resolved
agnieszka-m marked this conversation as resolved.
Show resolved Hide resolved
- **Nodes Used**: `ElasticsearchDocumentStore`, `BM25Retriever`
- **Goal**: After completing this tutorial, you will have built a question answering pipeline that can answer questions about the Game of Thrones series.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/main/tutorials/01_Basic_QA_Pipeline.ipynb)
This tutorial teaches you how to set up a question answering system that can search through complex knowledge bases, such as an internal wiki or a collection of financial reports. We will work on a set of Wikipedia pages about Game of Thrones. Let's learn how to build a question answering system and discover more about the marvellous seven kingdoms!
agnieszka-m marked this conversation as resolved.
Show resolved Hide resolved

Question Answering can be used in a variety of use cases. A very common one: Using it to navigate through complex knowledge bases or long documents ("search setting").

A "knowledge base" could for example be your website, an internal wiki or a collection of financial reports.
In this tutorial we will work on a slightly different domain: "Game of Thrones".

Let's see how we can use a bunch of Wikipedia articles to answer a variety of questions about the
marvellous seven kingdoms.
## Preparing the Colab Environment

- [Enable GPU Runtime in GPU](https://docs.haystack.deepset.ai/v5.2-unstable/docs/enable-gpu-runtime-in-colab)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment on this section: I really like how this is all clean now and visually it's not crowded anymore. But as they're in the tutorial section on HSH, would it be nicer to have them as drop downs rather than shoot them back to docs for each? I think it can be done with the

tag in .md cells but would have to try.. But I'd still keep them in docs too, they're useful guides!

Happy to be told no on this though. I'm just thinking in terms of flow while you're going through..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, I actually added the dropdown in my comment there by doing that 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice. I see that the details tag works in colab. But It doesn't work in jupyter notebook which makes me a bit hesitant to go ahead with this change.

- [Check if GPU is Enabled](https://docs.haystack.deepset.ai/v5.2-unstable/docs/check-if-gpu-is-enabled)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/v5.2-unstable/docs/set-the-logging-level)

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**
## Installing Haystack

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:


```bash
%%bash

nvidia-smi
```

To start, install the latest release of Haystack with `pip`:
To start, let's install the latest release of Haystack with `pip`:


```bash
Expand All @@ -49,45 +38,11 @@ pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
```

## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing - Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:


```python
import logging

logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
```

## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`, `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).
## Initializing the DocumentStore

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack-tutorials/blob/main/tutorials/03_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.
A DocumentStore stores the documents that the question answering system uses to find answers to your questions. To learn more, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).
agnieszka-m marked this conversation as resolved.
Show resolved Hide resolved
agnieszka-m marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add something like: To use it for search, you first need to initialize it. In this tutorial, you'll use ElasticsearchDocumentStore.


**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

### Start an Elasticsearch server locally
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.


```python
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()
```

### Start an Elasticsearch server in Colab

If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.
1. Download, extract and set the permission for the Elasticsearch image:
brandenchan marked this conversation as resolved.
Show resolved Hide resolved


```bash
Expand All @@ -98,24 +53,24 @@ tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2
```

2. Start the Elasticsearch Server:


```bash
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch
```

### Create the Document Store

The `ElasticsearchDocumentStore` class will try to open a connection in the constructor, here we wait 30 seconds only to be sure Elasticsearch is ready before continuing:


```python
import time
time.sleep(30)
```

Finally, we create the Document Store instance:
If you are working in an environment where Docker is available, you can also start Elasticsearch using Docker. You can do this [manually](https://docs.haystack.deepset.ai/docs/document_store#initialisation), or using our [`launch_es()`](https://docs.haystack.deepset.ai/reference/utils-api) utility function.

3. Initialize the `ElasticsearchDocumentStore` object in Haystack. Note that this will only successfully run if the Elasticsearch Server is fully started up and ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
3. Wait until the Elasticsearch Server starts.
4. Initialize the ElasticsearchDocumentStore.



```python
Expand All @@ -124,127 +79,78 @@ from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")
```

## Preprocessing of documents
document_store = ElasticsearchDocumentStore(
host=host,
username="",
password="",
index="document"
)
```

Haystack provides a customizable pipeline for:
- converting files into texts
- cleaning texts
- splitting texts
- writing them to a Document Store
## Preparing Documents

In this tutorial, we download Wikipedia articles about Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.
1. Download 517 articles from the Game of Thrones Wikipedia. You can find them in `data/tutorial1` as a set of `.txt` files.


```python
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http
from haystack.utils import fetch_archive_from_http


# Let's first fetch some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/tutorial1"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
# 'content': "<DOCUMENT_TEXT_HERE>",
# 'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
# }
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Pipeline)

# Let's have a look at the first 3 entries:
print(docs[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(docs)
fetch_archive_from_http(
url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
output_dir=doc_dir
)
```

## Initialize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `BM25Retriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)
2. Convert the files you just downloaded into Haystack [Document objects](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document) to write them into the DocumentStore. Apply the `clean_wiki_text` cleaning function to the text.


```python
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)
from haystack.utils import clean_wiki_text, convert_files_to_docs
docs = convert_files_to_docs(
dir_path=doc_dir,
clean_func=clean_wiki_text,
split_paragraphs=True
)
```

3. Write these Documents into the DocumentStore.

```python
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.nodes import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)
```python
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(docs)
```

### Reader
While the default code in this tutorial uses Game of Thrones data, you can also supply your own. So long as your data adheres to the [input format](https://docs.haystack.deepset.ai/docs/document_store#input-format) or is cast into a [Document object](https://docs.haystack.deepset.ai/docs/documents_answers_labels#document), it can be written into the DocumentStore.

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.
## Initializing the Retriever

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).
Initialize the `BM25Retriever`. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Initialize the `BM25Retriever`. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever)
Initialize the `BM25Retriever`. To learn more about Retrievers, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it makes sense to just briefly say here what the retriever does. Like one sentence: Retrievers sift through the Documents in the DocumentStore and retrieve the ones that best match the query.


**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)
```python
from haystack.nodes import BM25Retriever

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)
retriever = BM25Retriever(document_store=document_store)
```

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"
## Initializing the Reader

#### FARMReader
Initialize the `FARMReader` with the `deepset/robert-base-squad2` model. For more Reader options, see [Reader](https://docs.haystack.deepset.ai/docs/reader).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I think one sentence of explanation makes sense: A Reader takes the Documents it got from the Retriever and uses them to find the best answer. (or something like that)



```python
from haystack.nodes import FARMReader

# Load a local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
```

#### TransformersReader

Alternative:


```python
from haystack.nodes import TransformersReader
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)
```

### Pipeline
## Creating the Retriever-Reader Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).
The `ExtractiveQAPipeline` connects the Reader and Retriever. This makes the system fast because the Reader only processes the Documents that the Retriever has passed on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the code below also have a line for defining the pipe? pipe = Pipeline() ?



```python
Expand All @@ -253,59 +159,50 @@ from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
```

## Voilà! Ask a question!
## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).



```python
# You can configure how many candidates the Reader and Retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = pipe.run(
query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
query="Who is the father of Arya Stark?",
params={
"Retriever": {"top_k": 10},
"Reader": {"top_k": 5}
}
)
```

Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

```python
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
# prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})
```

Now you can either print the object directly:
2. The answers returned by the pipeline can be printed out directly:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. The answers returned by the pipeline can be printed out directly:
2. You can also directly print out the answers returned by the pipeline:



```python
from pprint import pprint

pprint(prediction)

# Sample output:
# {
# 'answers': [ <Answer: answer='Eddard', type='extractive', score=0.9919578731060028, offsets_in_document=[{'start': 608, 'end': 615}], offsets_in_context=[{'start': 72, 'end': 79}], document_id='cc75f739897ecbf8c14657b13dda890e', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
# <Answer: answer='Ned', type='extractive', score=0.9767240881919861, offsets_in_document=[{'start': 3687, 'end': 3801}], offsets_in_context=[{'start': 18, 'end': 132}], document_id='9acf17ec9083c4022f69eb4a37187080', meta={'name': '454_Music_of_Game_of_Thrones.txt'}}, context='...' >,
# ...
# ]
# 'documents': [ <Document: content_type='text', score=0.8034909798951382, meta={'name': '332_Sansa_Stark.txt'}, embedding=None, id=d1f36ec7170e4c46cde65787fe125dfe', content='\n===\'\'A Game of Thrones\'\'===\nSansa Stark begins the novel by being betrothed to Crown ...'>,
# <Document: content_type='text', score=0.8002150354529785, meta={'name': '191_Gendry.txt'}, embedding=None, id='dd4e070a22896afa81748d6510006d2', 'content='\n===Season 2===\nGendry travels North with Yoren and other Night's Watch recruits, including Arya ...'>,
# ...
# ],
# 'no_ans_gap': 11.688868522644043,
# 'node_id': 'Reader',
# 'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 5}},
# 'query': 'Who is the father of Arya Stark?',
# 'root_node': 'Query'
# }
```

Or use a util to simplify the output:
3. Simplify the printed answers:


```python
from haystack.utils import print_answers

# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="minimum")
print_answers(
prediction,
details="minimum" ## Choose from `minimum`, `medium` and `all`
)
```

And there you have it! Congratulations on building your first machine learning based question answering system!

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany
Expand Down
Loading