feat: Add Indexing Pipeline #6424

vblagoje · 2023-11-27T13:20:15Z

Why:

The changes in the indexing_pipeline branch aim to enhance the functionality of the Haystack library by introducing a prebuilt indexing pipeline. This pipeline simplifies the process of indexing documents into a DocumentStore and calculating embeddings for these documents.

fixes create ready-made pipelines #5992

What:

The key changes include:

New Functionality in haystack/utils/indexing.py: A new function build_indexing_pipeline has been added. This function creates an indexing pipeline that automatically detects file types (.txt, .pdf, .html) and converts them into Documents. It supports embedding models for document embeddings if specified.
Modifications in haystack/utils/__init__.py: The new build_indexing_pipeline function has been imported and added to the __all__ list for module exports.
Release Notes Update: A release note in releasenotes/notes/add-indexing-ready-made-pipeline-85c1da2f8f910f9d.yaml describes the addition of the indexing ready-made pipeline as a new feature.
Unit Tests: In test/pipelines/test_indexing_pipeline.py, comprehensive tests cover the new pipeline's functionality, including handling different file types and embedding models.

How can it be used:

Developers using the Haystack library can now easily create an indexing pipeline with a single function call. This pipeline will handle different file types and optionally compute embeddings, significantly simplifying the process of preparing data for search and analysis tasks.

Example usage:

from haystack.utils import build_indexing_pipeline
indexing_pipeline = build_indexing_pipeline(document_store=my_document_store, embedding_model="sentence-transformers/all-mpnet-base-v2")
indexing_pipeline.run(files=["path/to/file1", "path/to/file2"])

How did you test it:

Indexing Without Embeddings: Tested indexing files without embeddings and verified the number of documents written.
Indexing With Embeddings: Integration tests checked the functionality with embedding models, including handling directories and multiple file types.
Validation Tests: Additional tests validated the behavior with invalid input, such as incorrect document stores or embedding models.

Notes for the reviewer:

Please review the new function build_indexing_pipeline in indexing.py for its design and integration with existing components.
Pay special attention to the dynamic nature of the pipeline, particularly how it handles different file types and embedding models.
The unit tests in test_indexing_pipeline.py should cover most of the new functionality. Please check if any additional edge cases need testing.
The updates to the __init__.py file are minor but crucial for making the new functionality easily accessible.

Timoeller · 2023-11-28T12:04:30Z

Hey Vlad, this is going into the right direction.

What about customizing the supported file formats?
In the issue it is written:

We want to customize this pipeline on the number of supported file formats. This will make installation easier depending on which file types we want to convert. E.g. we can showcase an indexing pipeline that just converts TXT without additional dependencies.

vblagoje · 2023-12-04T14:55:31Z

Ok, this seems to be fine now @masci @anakin87 @silvanocerza - thanks

vblagoje requested review from a team as code owners November 27, 2023 13:20

vblagoje requested review from dfokina and masci and removed request for a team November 27, 2023 13:20

github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 27, 2023

vblagoje marked this pull request as draft November 27, 2023 13:36

vblagoje marked this pull request as ready for review November 27, 2023 15:06

vblagoje force-pushed the indexing_pipeline branch 4 times, most recently from 24df78c to 726c2b3 Compare November 30, 2023 14:32

anakin87 mentioned this pull request Dec 4, 2023

feat: Add RAG pipeline #6461

Merged

vblagoje added 6 commits December 4, 2023 15:25

Add build_indexing_pipeline utils function

de8003e

Pylint fixes

0142cd5

Move into another package to avoid circular deps

9e60ca9

Revert change

e876c8e

Revert haystack/utils/__init__.py change

d66d97a

Add example

93430ef

vblagoje force-pushed the indexing_pipeline branch from c38e099 to 93430ef Compare December 4, 2023 14:29

Use DocumentStore type, remove typing checks

108c66e

masci approved these changes Dec 4, 2023

View reviewed changes

masci merged commit 008a322 into main Dec 4, 2023
20 checks passed

masci deleted the indexing_pipeline branch December 4, 2023 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Indexing Pipeline #6424

feat: Add Indexing Pipeline #6424

vblagoje commented Nov 27, 2023 •

edited

Loading

Timoeller commented Nov 28, 2023

vblagoje commented Dec 4, 2023

feat: Add Indexing Pipeline #6424

feat: Add Indexing Pipeline #6424

Conversation

vblagoje commented Nov 27, 2023 • edited Loading

Why:

What:

How can it be used:

How did you test it:

Notes for the reviewer:

Timoeller commented Nov 28, 2023

vblagoje commented Dec 4, 2023

vblagoje commented Nov 27, 2023 •

edited

Loading