Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Indexing Pipeline #6424

Merged
merged 7 commits into from
Dec 4, 2023
Merged

feat: Add Indexing Pipeline #6424

merged 7 commits into from
Dec 4, 2023

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Nov 27, 2023

Why:

The changes in the indexing_pipeline branch aim to enhance the functionality of the Haystack library by introducing a prebuilt indexing pipeline. This pipeline simplifies the process of indexing documents into a DocumentStore and calculating embeddings for these documents.

What:

The key changes include:

  1. New Functionality in haystack/utils/indexing.py: A new function build_indexing_pipeline has been added. This function creates an indexing pipeline that automatically detects file types (.txt, .pdf, .html) and converts them into Documents. It supports embedding models for document embeddings if specified.
  2. Modifications in haystack/utils/__init__.py: The new build_indexing_pipeline function has been imported and added to the __all__ list for module exports.
  3. Release Notes Update: A release note in releasenotes/notes/add-indexing-ready-made-pipeline-85c1da2f8f910f9d.yaml describes the addition of the indexing ready-made pipeline as a new feature.
  4. Unit Tests: In test/pipelines/test_indexing_pipeline.py, comprehensive tests cover the new pipeline's functionality, including handling different file types and embedding models.

How can it be used:

Developers using the Haystack library can now easily create an indexing pipeline with a single function call. This pipeline will handle different file types and optionally compute embeddings, significantly simplifying the process of preparing data for search and analysis tasks.

Example usage:

from haystack.utils import build_indexing_pipeline
indexing_pipeline = build_indexing_pipeline(document_store=my_document_store, embedding_model="sentence-transformers/all-mpnet-base-v2")
indexing_pipeline.run(files=["path/to/file1", "path/to/file2"])

How did you test it:

  1. Indexing Without Embeddings: Tested indexing files without embeddings and verified the number of documents written.
  2. Indexing With Embeddings: Integration tests checked the functionality with embedding models, including handling directories and multiple file types.
  3. Validation Tests: Additional tests validated the behavior with invalid input, such as incorrect document stores or embedding models.

Notes for the reviewer:

  • Please review the new function build_indexing_pipeline in indexing.py for its design and integration with existing components.
  • Pay special attention to the dynamic nature of the pipeline, particularly how it handles different file types and embedding models.
  • The unit tests in test_indexing_pipeline.py should cover most of the new functionality. Please check if any additional edge cases need testing.
  • The updates to the __init__.py file are minor but crucial for making the new functionality easily accessible.

@vblagoje vblagoje requested review from a team as code owners November 27, 2023 13:20
@vblagoje vblagoje requested review from dfokina and masci and removed request for a team November 27, 2023 13:20
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 27, 2023
@vblagoje vblagoje marked this pull request as draft November 27, 2023 13:36
@vblagoje vblagoje marked this pull request as ready for review November 27, 2023 15:06
@Timoeller
Copy link
Contributor

Hey Vlad, this is going into the right direction.

What about customizing the supported file formats?
In the issue it is written:

We want to customize this pipeline on the number of supported file formats. This will make installation easier depending on which file types we want to convert. E.g. we can showcase an indexing pipeline that just converts TXT without additional dependencies.

@vblagoje vblagoje force-pushed the indexing_pipeline branch 4 times, most recently from 24df78c to 726c2b3 Compare November 30, 2023 14:32
@anakin87 anakin87 mentioned this pull request Dec 4, 2023
@vblagoje
Copy link
Member Author

vblagoje commented Dec 4, 2023

Ok, this seems to be fine now @masci @anakin87 @silvanocerza - thanks

@masci masci merged commit 008a322 into main Dec 4, 2023
20 checks passed
@masci masci deleted the indexing_pipeline branch December 4, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create ready-made pipelines
3 participants