Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunker component #528

Merged
merged 9 commits into from
Oct 18, 2023
Merged

Chunker component #528

merged 9 commits into from
Oct 18, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

PR that adds a basic chunking component that chunks text based on length and overlap as well as the example rag pipeline

Taken from https://www.pinecone.io/learn/chunking-strategies/
Different chunking strategies are referenced there as well, might be interesting to reference them in the CC demo for people to try to implement

The starting dataset is arbitrary and can be changed later on, I just needed something for testing

RobbeSneyders and others added 2 commits October 17, 2023 18:32
…rsion (#525)

This PR addresses 2 issues:
- The `client_kwargs` argument was not propagated properly and couldn't
be used
- Python 3.11 leads to very long dependency resolution times with KfP
and aiplatform dependencies

---------

Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
data:
type: string

produces:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ...
I can assume in some use case setup it is useful to find the original document of the chunk.

We could also keep the text as it is and add an additional column chunks which will be a list of chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ... I can assume in some use case setup it is useful to find the original document of the chunk.

That's the currently the case since I am taking the original id of the document and appending the chunk number to it here

We could also keep the text as it is and add an additional column chunks which will be a list of chunks.

I think that's a good alternative, although just thinking about the next component (embedding) i'd rather have an embedding per row rather than a list of embeddings. I think it better matches also how things will be stored in the vector database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with having one row per chunk. Adding original_document_id as a column makes sens I think though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes might be better to link back to the original document, added!

logger = logging.getLogger(__name__)


def chunk_text(row, text_splitter: RecursiveCharacterTextSplitter) -> t.List[t.Tuple]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this function into the Component, so we can just access the text splitter as self.text_splitter instead of passing it around.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli! Looks good in general, left 2 comments.

For the testing, I think we need to get more used to writing tests for the transform method so we can do unit testing instead of integration testing.

@PhilippeMoussalli
Copy link
Contributor Author

Thanks @PhilippeMoussalli! Looks good in general, left 2 comments.

For the testing, I think we need to get more used to writing tests for the transform method so we can do unit testing instead of integration testing.

I agree :) added some tests

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great!

@PhilippeMoussalli PhilippeMoussalli merged commit 1dbfec2 into main Oct 18, 2023
@PhilippeMoussalli PhilippeMoussalli deleted the chunker-component branch October 18, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants