New Tutorial: REST API #61

bilgeyucel · 2022-11-04T14:40:41Z

In this tutorial, I tried to explain the whole process from setting up an environment to querying the pipeline using REST API.

This tutorial cannot be run in Colab, therefore, the structure is different.

I created a preview link if you want to see how it is displayed on Haystack website.

TODO:

Update README - waiting for tutorial 19 to be merged
Remove Colab link - this tutorial cannot be run on Google Colab, removing the link is possible after Customize colab attribute to be able to skip Colab button and link #81
Merge main and generate the .md file with the download button and new title convention

review-notebook-app · 2022-11-04T14:40:46Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

agnieszka-m

Here are my 2 cents. I tried to add a reason for most of my comments, but if anyting is unclear, just ping me and I'll be happy to chat.
One thing that's not clear to me is why did we need to first set up the demo files and pipeline if we created a totally new pipeline after that?
Also, I think I'd maybe add more info about the pipeline that they were creating. For example, what params it's using and why (just one sentence like: this pipeline uses the default params that work well for htis amount of files blahblah)?

agnieszka-m · 2022-11-07T09:37:28Z

index.toml

+
+[[tutorial]]
+title = "Haystack with REST API"


I think here we're aiming at something like: Tutorial: Using Haystack's REST API."
So always "Tutorial" + the name of the task (starting with either Gerund or imperative -> please consult with Branden, I'm ok with either).

agnieszka-m · 2022-11-07T09:38:32Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Haystack with REST API\n",


See my comment about tutorial titles above

agnieszka-m · 2022-11-07T09:40:15Z

tutorials/20_REST_API.ipynb

+    "\n",
+    "- **Level**: Intermediate\n",
+    "- **Time to complete**: 30 minutes\n",
+    "- **Prerequisites**: N/A\n",


If this is an intermediate tutorial, we surely need them to have some prerequisite knowledge - knowledge of (basic) NLP concepts (maybe there are some specific concepts they need to know for this tutorial)? Python knowledge (we're still debating whether this should be added as a prereq), basic knowledge of Haystack? You can also add a link to any beginner tutorials that they can complete before this one to make them understand the necessary concepts.

Honestly I would suggest making this tutorial condensed to 'using the Rest API'. Lmk what you think here as I think this might be a bit controversial:
The usual problems we see is simply technical, in terms of just setting up, using and interacting with docker and rest api. So I would argue understanding in detail the pipeline structure and each component isn't the aim here. Imo the things people should come away with is:

These .yml files are where I define my config

They are able to edit the file in a way, maybe with some examples (change the model, reader setting etc) and use this as the pipeline that they query with rest api

For details on each NLP related component and par I would forward them to the relevant docs (e.g. the full set of options for a preprocessor, or reader, or retriever etc)

Adding this comment as a reply here because of the mention of 'NLP concepts' which I think could be added as a prerequisite. But mainly I would say the prerequisites are: basic understanding of the YAML syntax (link to resource), basic understanding of docker.

agnieszka-m · 2022-11-07T09:42:53Z

tutorials/20_REST_API.ipynb

+    "- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever`\n",
+    "- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API.\n",
+    "\n",
+    "This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n",


This and the next paragraph should go to a Description section (have a look at the recent tutorials that Branden updated).

agnieszka-m · 2022-11-07T09:44:24Z

tutorials/20_REST_API.ipynb

+    "\n",
+    "This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n",
+    "\n",
+    "First, we are going to set up the environment to run the same question answering pipeline in [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system."


Suggested change

"First, we are going to set up the environment to run the same question answering pipeline in [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system."

"First, we are going to set up the environment to run the same question answering pipeline as in the [Explore the World Demo](https://haystack-demo.deepset.ai/), then create a new pipeline for the new document search system."

Now that I'm reading it the second time, I'm not sure I understand this. Do you mean they're going to use the same data as in the demo? or the same pipeline as in the demo? if the same pipeline, why?

agnieszka-m · 2022-11-07T12:20:40Z

tutorials/20_REST_API.ipynb

+    "\n",
+    "You can use REST API to index your files to your document store. This requires an indexing pipeline. Add the indexing pipeline to `document-search.haystack-pipeline.yml`, then, you can use `/file-upload` endpoint to upload your files to Elasticsearch. \n",
+    "\n",
+    "Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n",


Suggested change

"Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n",

"Download the same [demo files](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) you used in the first part of this tutorial and upload them using cURL.

To learn more about indexing, see [Indexing Documents](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store).\n",

agnieszka-m · 2022-11-07T12:21:50Z

tutorials/20_REST_API.ipynb

+    "Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n",
+    "\n",
+    "<aside>\n",
+    "⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n",


Suggested change

"⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n",

"⚠️To index your files directly to Elasticsearch through a script, use the same indexing pipeline with your `document-search.haystack-pipeline.yml` file to make sure the files indexed are consistent.\n",

Our voice: conversational

agnieszka-m · 2022-11-07T12:22:11Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The indexing pipeline should be as follows:\n",


Suggested change

"The indexing pipeline should be as follows:\n",

"Here's what the indexing pipeline should look like:\n",

agnieszka-m · 2022-11-07T12:22:55Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When you restart the Haystack API, REST API is going to load the new pipeline and the pipeline will be ready to use. "


Suggested change

"When you restart the Haystack API, REST API is going to load the new pipeline and the pipeline will be ready to use. "

"When you restart the Haystack API, REST API will load the new pipeline and the pipeline will be ready to use. "

agnieszka-m · 2022-11-07T12:24:01Z

tutorials/20_REST_API.ipynb

+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}


I'd add a paragraph summing up what they have learned/built/achieved in this tutorial. Maybe some links they may want to check, additional info, etc.

brandenchan

Thanks for making a start on this! I think that you have covered many of the start up steps for the REST API in good detail.

I've made a set of comments that mostly request structural / content changes that I think will simplify the tutorial or provide users with a little bit of conceptual overview. This tutorial covers a really tough topic because we are asking users to interact a lot of components (e.g. Docker, CURL, Yaml Pipelines, REST API, Swagger) and I would push to make the experience still a bit more straight forward for the user.

More than happy to clarify any thing that's unclear, or to have a chat to align!

brandenchan · 2022-11-08T16:32:11Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### With Docker\n",


I see that there is both the option for with and without Docker. In the first tutorials that I've reworked, I've opted not to give options in the tutorial. Let's teach them to do one thing. I think this will make it much more linear and easy for a user to follow.

This also raises the question of the relationship between this tutorial and the guide on the REST API. The way I see it is that the docs page should have small individual sections that explain each task / step that could arise during the creation of the REST API, but they are separate from any specific environment / dataset / pipeline. The tutorial should be a single set of steps that works with a specific environment / dataset / pipeline to create one possible outcome.

So what I'd push for here is to take out the "REST API with Docker" section from this tutorial and turn this specifically into a REST API without Docker tutorial. You can always have a comment saying "If you'd like to initialize with Docker instead see [Initialization with Docker]"

brandenchan · 2022-11-08T16:36:42Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting Up the Environment\n",


Just an overview on how I've been using heading levels so far:
Header level 1 reserved for title of notebook (e.g. # Build Your First Question Answering System)
Header level 2 is for sections of the tutorial (e.g. ## Preparing the Colab Environment)
Header level 3 is not used because it is not easily distinguishable from Header level 2 visually, and because it isn't shown in the right ToC

Within header level 2, you can have numbered lists
e.g. 1. Update/install Docker and Docker Compose, then launch Docker then 2. Clone Haystack repository

brandenchan · 2022-11-08T16:41:47Z

tutorials/20_REST_API.ipynb

+    "- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever`\n",
+    "- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API.\n",
+    "\n",
+    "This tutorial teaches you how to create your production-ready document search `pipeline.yml` and interact with Haystack through REST API. \n",


I think it might not be clear yet to a user what a pipeline.yml is. How about we just call it a search pipeline?

brandenchan · 2022-11-08T16:45:27Z

tutorials/20_REST_API.ipynb

+   "source": [
+    "* **Launch Elasticsearch**\n",
+    "\n",
+    "Launching Elasticsearch takes some time, so, be sure to have a `healthy` Elasticsearch container before continue. You can check the health through `docker ps` command. \n",


Those who haven't worked with ES before might not know what a healthy response is from ES. If it isn't too cluttered I wonder if a screen shot might help here

brandenchan · 2022-11-08T16:50:54Z

tutorials/20_REST_API.ipynb

+    "\n",
+    "Test whether everything is okay by going to Swagger documentation for the Haystack REST API on [`http://127.0.0.1:8000/docs`](http://127.0.0.1:8000/docs) and trying out `/initialized` endpoint or sending a cURL request as `curl -X GET http://127.0.0.1:8000/initialized`. \n",
+    "\n",
+    "If everything is alright, you can start asking questions! Wikipedia pages about countries and capital are already indexed to Elasticsearch by the docker image we provided in [`docker-compose.yml`](https://github.com/deepset-ai/haystack/blob/main/docker-compose.yml#L22). To ask questions, you can use `/query` endpoint again via Haystack REST API UI or a cURL request.   "


What are you referring to exactly when you say REST API UI? Do you mean sending a request via Swagger or using the actual Haystack UI? Either way I think users need more hand holding and explanation before we can tell them to do this

tutorials/20_REST_API.ipynb

brandenchan · 2022-11-08T17:11:24Z

tutorials/20_REST_API.ipynb

+   "source": [
+    "# Haystack with REST API\n",
+    "\n",
+    "- **Level**: Intermediate\n",


To be honest, I think this might actually be an advanced topic. The user has to work with Docker, requests to REST API, pipeline ymls. They also have to make changes to lots of files which they probably don't fully understand the significance of.

brandenchan · 2022-11-08T17:14:15Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Indexing pipeline\n",


If we are going to suggest performing indexing via the /file-upload endpoint, I think we need to give a bit more guidance as to what this actually entails. What does the curl command look like?

brandenchan · 2022-11-08T17:17:34Z

tutorials/20_REST_API.ipynb

+    "Download the same demo files [here](https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip) and upload them using cURL. Check [this](https://docs.haystack.deepset.ai/docs/rest_api#indexing-documents-in-the-haystack-rest-api-document-store) documentation page for more detail about file indexing.\n",
+    "\n",
+    "<aside>\n",
+    "⚠️ If you want to index your files directly to Elasticsearch through script, be sure to provide the same indexing pipeline with your `document-search.haystack-pipeline.yml` file for consistency between indexed files.\n",


I think I get what this is saying, but it's a bit hard to understand. Do you mean that if someone is doing both index via REST API endpoint and index via script that calls indexing pipeline, that the pipeline in each case is the same so that the indexed files are consistent?

On a formatting level, I think this "aside" block is quite hard to distinguish from regular text on the rendered page and would opt against using it.

brandenchan · 2022-11-08T17:19:14Z

tutorials/20_REST_API.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Voilà! Make a new query!\n",


Can we show users what the expected response should look like?

TuanaCelik · 2022-11-09T15:31:01Z

markdowns/20_REST_API.md

+- **Time to complete**: 30 minutes
+- **Prerequisites**: N/A
+- **Nodes Used**: `ElasticsearchDocumentStore`, `EmbeddingRetriever`
+- **Goal**: After completing this tutorial, you will have learned how you can interact with Haystack through REST API.


'interact with a Haystack pipeline' ?

TuanaCelik · 2022-11-09T15:39:08Z

tutorials/20_REST_API.ipynb

+    "\n",
+    "- **Level**: Intermediate\n",
+    "- **Time to complete**: 30 minutes\n",
+    "- **Prerequisites**: N/A\n",


Honestly I would suggest making this tutorial condensed to 'using the Rest API'. Lmk what you think here as I think this might be a bit controversial:
The usual problems we see is simply technical, in terms of just setting up, using and interacting with docker and rest api. So I would argue understanding in detail the pipeline structure and each component isn't the aim here. Imo the things people should come away with is:

These .yml files are where I define my config

They are able to edit the file in a way, maybe with some examples (change the model, reader setting etc) and use this as the pipeline that they query with rest api

For details on each NLP related component and par I would forward them to the relevant docs (e.g. the full set of options for a preprocessor, or reader, or retriever etc)

Adding this comment as a reply here because of the mention of 'NLP concepts' which I think could be added as a prerequisite. But mainly I would say the prerequisites are: basic understanding of the YAML syntax (link to resource), basic understanding of docker.

* Remove the demo setup process * Pay attention to the language * Add more description to all steps PR: #61

bilgeyucel · 2022-11-23T12:41:28Z

Hi @TuanaCelik, @brandenchan, and @agnieszka-m, I restructured the tutorial according to your feedback. The main changes I made:

Removed the first part where I explained the demo setup. Now I directly focus on creating a new application but with more guidance
Didn't mention Swagger UI at all. Swagger UI is nice to use but not a production-focused tool, so, I removed it completely
Added more explanation to all steps.
Followed the new structure of tutorials on Tutorial restructure draft #44

I think this version is cleaner and easier to follow. I tried to pay attention not to suggest options and tried to stick with one approach only.

Let me know what you think! 🙌

anakin87 · 2022-11-23T14:04:15Z

tutorials/20_REST_API.ipynb

+    "pip install 'farm-haystack[all]'\n",
+    "pip install -e rest_api/\n",
+    "\n",
+    "brew install xpdf # required for PDFToTextConverter node"


Sorry for the intrusion... by the way, great job!

I think that brew is installed by default only in macOS, not in Linux and Windows.

From Haystack workflows for tests, I see that you can install xpdf on Linux with the following command:
wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz && tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

For Windows you can install xpdf with choco install xpdf-utils, but you need to have Chocolatey.

brandenchan

The structure of this Tutorial is much much clearer! Good work. I have just made a set of language changes.

markdowns/20_REST_API.md

brandenchan · 2022-11-23T14:44:21Z

markdowns/20_REST_API.md

+## Overview
+
+Haystack enables you to apply the latest NLP technology to your own data and create production-ready applications. Building an end-to-end NLP application requires the combination of multiple concepts. Here are those consepts:
+* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial.


Suggested change

* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial.

* **DocumentStore** stores the data. You will use the ElasticsearchDocumentStore for this tutorial.

Here it confuses me to put it like that. I said "You will use Elasticsearch for this tutorial" because they will start an Elasticsearch instance and then, use ElasticsearchDocumentStore to connect to the instance. Would it be better like this? 👇

Suggested change

* **DocumentStore** stores the data. You will use Elasticsearch for this tutorial.

* **DocumentStore** stores the data. You will start an Elasticsearch instance and use `ElasticsearchDocumentStore` to connect to the instance.

markdowns/20_REST_API.md

TuanaCelik

Amazing job @bilgeyucel such great effort <3
I have left some minor comments. And one extra one that you could consider with the help of someone in core-engineering maybe is to include what OS these commands would run on, or some type of pre-req (in terms of setup, additionally to the ones you've already provided)

markdowns/20_REST_API.md

TuanaCelik · 2022-11-24T16:44:39Z

markdowns/20_REST_API.md

+
+## Create Pipeline YAML File
+
+YAML files are widely used for confugurations. Haystack enables defining pipelines as YAML files and `load_from_yaml` method loads pipelines from YAML file. In a YAML file, `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines.


load_from_yaml lets you load the YAML pipeline into your python script or pipeline object right? If I have this right could the following make sense?

YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the load_from_yaml method will even allow you to load a pipeline defined in YAML into a Python object.

markdowns/20_REST_API.md

TuanaCelik · 2022-11-24T16:51:06Z

markdowns/20_REST_API.md

+    type: EmbeddingRetriever
+    params:
+      document_store: DocumentStore
+      top_k: 5 


This is up to you, but I think the section just under these yaml definitions are a great opportunity to hint to the learner how they can change setting. It might serve as a good mental map from the API reference they can see in docs to implementing them in a YAML file. top_k for example but you can in a short paragraph maybe hint them into some other things they could set?

I'll leave this decision up to you and @brandenchan depending on the level of 'optionality' you decide to provide in this tutorial

Good idea to mention API reference 💡 I don't want to offer any options such as saying that "add max_seq_len to Retriever", instead, encourage them to play with the pipeline later and refer to the API Reference for that. So, this sentence👇 might come after sharing the whole YAML file. WDYT? @TuanaCelik @brandenchan

Feel free to play with the pipeline setup later on. Add or remove some nodes, change the parameters or add new ones. Check out Haystack API Reference for more options on nodes and parameters.

* Add more explanation * Make the language cleaner

agnieszka-m

Just some language comments. Great job!

tutorials/20_REST_API.ipynb

agnieszka-m · 2022-11-28T07:27:19Z

tutorials/20_REST_API.ipynb

+   "source": [
+    "## Create Pipeline YAML File\n",
+    "\n",
+    "YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the `load_from_yaml()` method loads pipelines from YAML file. In a YAML file, the `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines."


Suggested change

"YAML files are widely used for configurations. Haystack enables defining pipelines as YAML files and the `load_from_yaml()` method loads pipelines from YAML file. In a YAML file, the `components` section defines all pipeline nodes and `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines."

"YAML files are widely used for configurations. You can define Haystack pipelines as YAML files and then use the `load_from_yaml()` method to load pipelines from the YAML file. In a YAML file, the `components` section defines all pipeline nodes and the `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines."

It's recommended not to say your product "enables" people to do something. Instead, you can say they can do something with the product.

Hi @agnieszka-m, here, the user doesn't use the load_from_yaml() method, it is used by Haystack. I wanted to explain the process a little bit so that they can see the connection between a YAML file and python objects we use in Haystack. What do you think about this 👇 ?

YAML files are widely used for configurations. You can define Haystack pipelines as YAML files and then `load_from_yaml()` method loads the pipeline defined in YAML into a Python object. In a YAML file, the `components` section defines all pipeline nodes and the `pipelines` section defines how these nodes are connected to each other to form a pipeline. Let's start with defining query and indexing pipelines.

tutorials/20_REST_API.ipynb

masci

I love this tutorial, this is very much needed!

I took a first pass and I stopped at the setup phase, let me know what you think!

tutorials/20_REST_API.ipynb

* Additionally language changes

masci · 2022-12-12T18:55:28Z