Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why:
The changes in the
indexing_pipeline
branch aim to enhance the functionality of the Haystack library by introducing a prebuilt indexing pipeline. This pipeline simplifies the process of indexing documents into a DocumentStore and calculating embeddings for these documents.What:
The key changes include:
haystack/utils/indexing.py
: A new functionbuild_indexing_pipeline
has been added. This function creates an indexing pipeline that automatically detects file types (.txt, .pdf, .html) and converts them into Documents. It supports embedding models for document embeddings if specified.haystack/utils/__init__.py
: The newbuild_indexing_pipeline
function has been imported and added to the__all__
list for module exports.releasenotes/notes/add-indexing-ready-made-pipeline-85c1da2f8f910f9d.yaml
describes the addition of the indexing ready-made pipeline as a new feature.test/pipelines/test_indexing_pipeline.py
, comprehensive tests cover the new pipeline's functionality, including handling different file types and embedding models.How can it be used:
Developers using the Haystack library can now easily create an indexing pipeline with a single function call. This pipeline will handle different file types and optionally compute embeddings, significantly simplifying the process of preparing data for search and analysis tasks.
Example usage:
How did you test it:
Notes for the reviewer:
build_indexing_pipeline
inindexing.py
for its design and integration with existing components.test_indexing_pipeline.py
should cover most of the new functionality. Please check if any additional edge cases need testing.__init__.py
file are minor but crucial for making the new functionality easily accessible.