-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Tutorial: REST API #61
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my 2 cents. I tried to add a reason for most of my comments, but if anyting is unclear, just ping me and I'll be happy to chat.
One thing that's not clear to me is why did we need to first set up the demo files and pipeline if we created a totally new pipeline after that?
Also, I think I'd maybe add more info about the pipeline that they were creating. For example, what params it's using and why (just one sentence like: this pipeline uses the default params that work well for htis amount of files blahblah)?
index.toml
Outdated
|
||
[[tutorial]] | ||
title = "Haystack with REST API" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we're aiming at something like: Tutorial: Using Haystack's REST API."
So always "Tutorial" + the name of the task (starting with either Gerund or imperative -> please consult with Branden, I'm ok with either).
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Haystack with REST API\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about tutorial titles above
tutorials/20_REST_API.ipynb
Outdated
"\n", | ||
"- **Level**: Intermediate\n", | ||
"- **Time to complete**: 30 minutes\n", | ||
"- **Prerequisites**: N/A\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an intermediate tutorial, we surely need them to have some prerequisite knowledge - knowledge of (basic) NLP concepts (maybe there are some specific concepts they need to know for this tutorial)? Python knowledge (we're still debating whether this should be added as a prereq), basic knowledge of Haystack? You can also add a link to any beginner tutorials that they can complete before this one to make them understand the necessary concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I would suggest making this tutorial condensed to 'using the Rest API'. Lmk what you think here as I think this might be a bit controversial:
The usual problems we see is simply technical, in terms of just setting up, using and interacting with docker and rest api. So I would argue understanding in detail the pipeline structure and each component isn't the aim here. Imo the things people should come away with is:
- These .yml files are where I define my config
- They are able to edit the file in a way, maybe with some examples (change the model, reader setting etc) and use this as the pipeline that they query with rest api
- For details on each NLP related component and par I would forward them to the relevant docs (e.g. the full set of options for a preprocessor, or reader, or retriever etc)
Adding this comment as a reply here because of the mention of 'NLP concepts' which I think could be added as a prerequisite. But mainly I would say the prerequisites are: basic understanding of the YAML syntax (link to resource), basic understanding of docker.
tutorials/20_REST_API.ipynb
Outdated
"- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever`\n", | ||
"- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API.\n", | ||
"\n", | ||
"This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the next paragraph should go to a Description section (have a look at the recent tutorials that Branden updated).
tutorials/20_REST_API.ipynb
Outdated
"\n", | ||
"This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n", | ||
"\n", | ||
"First, we are going to set up the environment to run the same question answering pipeline in [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"First, we are going to set up the environment to run the same question answering pipeline in [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system." | |
"First, we are going to set up the environment to run the same question answering pipeline as in the [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm reading it the second time, I'm not sure I understand this. Do you mean they're going to use the same data as in the demo? or the same pipeline as in the demo? if the same pipeline, why?
tutorials/20_REST_API.ipynb
Outdated
"\n", | ||
"You can use REST API to index your files to your document store. This requires an indexing pipeline. Add the indexing pipeline to `document-search.haystack-pipeline.yml`, then, you can use `/file-upload` endpoint to upload your files to Elasticsearch. \n", | ||
"\n", | ||
"Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n", | |
"Download the same [demo files](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) you used in the first part of this tutorial and upload them using cURL. | |
To learn more about indexing, see [Indexing Documents](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store).\n", |
tutorials/20_REST_API.ipynb
Outdated
"Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n", | ||
"\n", | ||
"<aside>\n", | ||
"⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n", | |
"⚠️To index your files directly to Elasticsearch through a script, use the same indexing pipeline with your `document-search.haystack-pipeline.yml` file to make sure the files indexed are consistent.\n", |
Our voice: conversational
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The indexing pipeline should be as follows:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The indexing pipeline should be as follows:\n", | |
"Here's what the indexing pipeline should look like:\n", |
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"When you restart the Haystack API, REST API is going to load the new pipeline and the pipeline will be ready to use. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When you restart the Haystack API, REST API is going to load the new pipeline and the pipeline will be ready to use. " | |
"When you restart the Haystack API, REST API will load the new pipeline and the pipeline will be ready to use. " |
tutorials/20_REST_API.ipynb
Outdated
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a paragraph summing up what they have learned/built/achieved in this tutorial. Maybe some links they may want to check, additional info, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making a start on this! I think that you have covered many of the start up steps for the REST API in good detail.
I've made a set of comments that mostly request structural / content changes that I think will simplify the tutorial or provide users with a little bit of conceptual overview. This tutorial covers a really tough topic because we are asking users to interact a lot of components (e.g. Docker, CURL, Yaml Pipelines, REST API, Swagger) and I would push to make the experience still a bit more straight forward for the user.
More than happy to clarify any thing that's unclear, or to have a chat to align!
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### With Docker\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that there is both the option for with and without Docker. In the first tutorials that I've reworked, I've opted not to give options in the tutorial. Let's teach them to do one thing. I think this will make it much more linear and easy for a user to follow.
This also raises the question of the relationship between this tutorial and the guide on the REST API. The way I see it is that the docs page should have small individual sections that explain each task / step that could arise during the creation of the REST API, but they are separate from any specific environment / dataset / pipeline. The tutorial should be a single set of steps that works with a specific environment / dataset / pipeline to create one possible outcome.
So what I'd push for here is to take out the "REST API with Docker" section from this tutorial and turn this specifically into a REST API without Docker tutorial. You can always have a comment saying "If you'd like to initialize with Docker instead see [Initialization with Docker]"
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setting Up the Environment\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an overview on how I've been using heading levels so far:
Header level 1 reserved for title of notebook (e.g. # Build Your First Question Answering System
)
Header level 2 is for sections of the tutorial (e.g. ## Preparing the Colab Environment
)
Header level 3 is not used because it is not easily distinguishable from Header level 2 visually, and because it isn't shown in the right ToC
Within header level 2, you can have numbered lists
e.g. 1. Update/install Docker and Docker Compose, then launch Docker
then 2. Clone Haystack repository
tutorials/20_REST_API.ipynb
Outdated
"- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever`\n", | ||
"- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API.\n", | ||
"\n", | ||
"This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might not be clear yet to a user what a pipeline.yml
is. How about we just call it a search pipeline?
tutorials/20_REST_API.ipynb
Outdated
"source": [ | ||
"* **Launch Elasticsearch**\n", | ||
"\n", | ||
"Launching Elasticsearch takes some time, so, be sure to have a `healthy` Elasticsearch container before continue. You can check the health through `docker ps` command. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those who haven't worked with ES before might not know what a healthy response is from ES. If it isn't too cluttered I wonder if a screen shot might help here
tutorials/20_REST_API.ipynb
Outdated
"\n", | ||
"Test whether everything is okay by going to Swagger documentation for the Haystack REST API on [`http://127.0.0.1:8000/docs`](http://127.0.0.1:8000/docs) and trying out `/initialized` endpoint or sending a cURL request as `curl -X GET http://127.0.0.1:8000/initialized`. \n", | ||
"\n", | ||
"If everything is alright, you can start asking questions! Wikipedia pages about countries and capital are already indexed to Elasticsearch by the docker image we provided in [`docker-compose.yml`](https://github.com/deepset-ai/haystack/blob/main/docker-compose.yml#L22). To ask questions, you can use `/query` endpoint again via Haystack REST API UI or a cURL request. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you referring to exactly when you say REST API UI? Do you mean sending a request via Swagger or using the actual Haystack UI? Either way I think users need more hand holding and explanation before we can tell them to do this
tutorials/20_REST_API.ipynb
Outdated
"source": [ | ||
"# Haystack with REST API\n", | ||
"\n", | ||
"- **Level**: Intermediate\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I think this might actually be an advanced topic. The user has to work with Docker, requests to REST API, pipeline ymls. They also have to make changes to lots of files which they probably don't fully understand the significance of.
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Indexing pipeline\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to suggest performing indexing via the /file-upload endpoint, I think we need to give a bit more guidance as to what this actually entails. What does the curl command look like?
tutorials/20_REST_API.ipynb
Outdated
"Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n", | ||
"\n", | ||
"<aside>\n", | ||
"⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get what this is saying, but it's a bit hard to understand. Do you mean that if someone is doing both index via REST API endpoint and index via script that calls indexing pipeline, that the pipeline in each case is the same so that the indexed files are consistent?
On a formatting level, I think this "aside" block is quite hard to distinguish from regular text on the rendered page and would opt against using it.
tutorials/20_REST_API.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Voilà! Make a new query!\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we show users what the expected response should look like?
markdowns/20_REST_API.md
Outdated
- **Time to complete**: 30 minutes | ||
- **Prerequisites**: N/A | ||
- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever` | ||
- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'interact with a Haystack pipeline' ?
tutorials/20_REST_API.ipynb
Outdated
"\n", | ||
"- **Level**: Intermediate\n", | ||
"- **Time to complete**: 30 minutes\n", | ||
"- **Prerequisites**: N/A\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I would suggest making this tutorial condensed to 'using the Rest API'. Lmk what you think here as I think this might be a bit controversial:
The usual problems we see is simply technical, in terms of just setting up, using and interacting with docker and rest api. So I would argue understanding in detail the pipeline structure and each component isn't the aim here. Imo the things people should come away with is:
- These .yml files are where I define my config
- They are able to edit the file in a way, maybe with some examples (change the model, reader setting etc) and use this as the pipeline that they query with rest api
- For details on each NLP related component and par I would forward them to the relevant docs (e.g. the full set of options for a preprocessor, or reader, or retriever etc)
Adding this comment as a reply here because of the mention of 'NLP concepts' which I think could be added as a prerequisite. But mainly I would say the prerequisites are: basic understanding of the YAML syntax (link to resource), basic understanding of docker.
* Remove the demo setup process * Pay attention to the language * Add more description to all steps PR: #61
Hi @TuanaCelik, @brandenchan, and @agnieszka-m, I restructured the tutorial according to your feedback. The main changes I made:
I think this version is cleaner and easier to follow. I tried to pay attention not to suggest options and tried to stick with one approach only. Let me know what you think! 🙌 |
tutorials/20_REST_API.ipynb
Outdated
"pip install 'farm-haystack[all]'\n", | ||
"pip install -e rest_api/\n", | ||
"\n", | ||
"brew install xpdf # required for PDFToTextConverter node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the intrusion... by the way, great job!
I think that brew
is installed by default only in macOS, not in Linux and Windows.
From Haystack workflows for tests, I see that you can install xpdf
on Linux with the following command:
wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz && tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin
For Windows you can install xpdf
with choco install xpdf-utils
, but you need to have Chocolatey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of this Tutorial is much much clearer! Good work. I have just made a set of language changes.
markdowns/20_REST_API.md
Outdated
## Overview | ||
|
||
Haystack enables you to apply the latest NLP technology to your own data and create production-ready applications. Building an end-to-end NLP application requires the combination of multiple concepts. Here are those consepts: | ||
* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial. | |
* **DocumentStore** stores the data. You will use the ElasticsearchDocumentStore for this tutorial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it confuses me to put it like that. I said "You will use Elasticsearch for this tutorial" because they will start an Elasticsearch instance and then, use ElasticsearchDocumentStore to connect to the instance. Would it be better like this? 👇
* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial. | |
* **DocumentStore** stores the data. You will start an Elasticsearch instance and use `ElasticsearchDocumentStore` to connect to the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing job @bilgeyucel such great effort <3
I have left some minor comments. And one extra one that you could consider with the help of someone in core-engineering maybe is to include what OS these commands would run on, or some type of pre-req (in terms of setup, additionally to the ones you've already provided)
markdowns/20_REST_API.md
Outdated
|
||
## Create Pipeline YAML File | ||
|
||
YAML files are widely used for confugurations. Haystack enables defining pipelines as YAML files and `load_from_yaml` method loads pipelines from YAML file. In a YAML file, `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load_from_yaml
lets you load the YAML pipeline into your python script or pipeline object right? If I have this right could the following make sense?
YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the load_from_yaml
method will even allow you to load a pipeline defined in YAML into a Python object.
markdowns/20_REST_API.md
Outdated
type: EmbeddingRetriever | ||
params: | ||
document_store: DocumentStore | ||
top_k: 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up to you, but I think the section just under these yaml definitions are a great opportunity to hint to the learner how they can change setting. It might serve as a good mental map from the API reference they can see in docs to implementing them in a YAML file. top_k
for example but you can in a short paragraph maybe hint them into some other things they could set?
I'll leave this decision up to you and @brandenchan depending on the level of 'optionality' you decide to provide in this tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to mention API reference 💡 I don't want to offer any options such as saying that "add max_seq_len
to Retriever", instead, encourage them to play with the pipeline later and refer to the API Reference for that. So, this sentence👇 might come after sharing the whole YAML file. WDYT? @TuanaCelik @brandenchan
Feel free to play with the pipeline setup later on. Add or remove some nodes, change the parameters or add new ones. Check out Haystack API Reference for more options on nodes and parameters.
* Add more explanation * Make the language cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some language comments. Great job!
tutorials/20_REST_API.ipynb
Outdated
"source": [ | ||
"## Create Pipeline YAML File\n", | ||
"\n", | ||
"YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the `load_from_yaml()` method loads pipelines from YAML file. In a YAML file, the `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the `load_from_yaml()` method loads pipelines from YAML file. In a YAML file, the `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines." | |
"YAML files are widely used for configurations. You can define Haystack pipelines as YAML files and then use the `load_from_yaml()` method to load pipelines from the YAML file. In a YAML file, the `components` section defines all pipeline nodes and the `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines." |
It's recommended not to say your product "enables" people to do something. Instead, you can say they can do something with the product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agnieszka-m, here, the user doesn't use the load_from_yaml()
method, it is used by Haystack. I wanted to explain the process a little bit so that they can see the connection between a YAML file and python objects we use in Haystack. What do you think about this 👇 ?
YAML files are widely used for configurations. You can define Haystack pipelines as YAML files and then `load_from_yaml()` method loads the pipeline defined in YAML into a Python object. In a YAML file, the `components` section defines all pipeline nodes and the `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this tutorial, this is very much needed!
I took a first pass and I stopped at the setup phase, let me know what you think!
* Additionally language changes
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"2. Clone Haystack repository.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering, do we need this at all? Since we let users create a pipeline from scratch, why not doing the same for the compose file? I believe the narrative would flow better and the result would be much more rewarding for the user. Unexperienced Docker users could still just copy/paste the yaml code into a compose file if they want to skip this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @masci, I agree that they don't need to clone the whole repository only for the compose file, yet, I am not sure how to mitigate this issue. In my opinion, if we make them create the compose file from scratch in the tutorial, we need to explain what each line of code does. I didn't want to do that because:
- Explaining the compose file would make the tutorial longer and more complicated for inexperienced users
- The focus point of this tutorial is the pipeline YAML file and how to edit it, I don't want to shift toward Docker and put it in a spotlight
That's why I think for this tutorial it's more important to provide a compose file that works out of the box. Maybe I can give the Github link to the compose file and ask them to copy/paste it instead of making them clone the repo. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, they can curl https://raw.githubusercontent.com/deepset-ai/haystack/main/docker-compose.yml
👍
"source": [ | ||
"1. Create a document search pipeline.\n", | ||
"\n", | ||
"Time to design a document search pipeline from scratch. This will be your query pipeline. Create a new file named `document-search.haystack-pipeline.yml` in the `/pipeline` folder under `/rest_api` in the Haystack code base. Then, create a `PIPELINE_YAML_PATH` variable in the `docker-compose.yml` with the new file name. `PIPELINE_YAML_PATH` variable will tell `rest_api` which YAML file to run. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to my previous point, if we skip the git clone haystack
step the folder from which users run this tutorial might be much simpler, something like
tutorial
├── docker-compose.yml
└── document-search.haystack-pipeline.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea 💯
* No highlight language needs to be given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me!
* Add version info * Fix comment spacing
In this tutorial, I tried to explain the whole process from setting up an environment to querying the pipeline using REST API.
This tutorial cannot be run in Colab, therefore, the structure is different.
I created a preview link if you want to see how it is displayed on Haystack website.
TODO:
colab
attribute to be able to skip Colab button and link #81main
and generate the .md file with the download button and new title convention