-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunker component #528
Chunker component #528
Conversation
…rsion (#525) This PR addresses 2 issues: - The `client_kwargs` argument was not propagated properly and couldn't be used - Python 3.11 leads to very long dependency resolution times with KfP and aiplatform dependencies --------- Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
data: | ||
type: string | ||
|
||
produces: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ...
I can assume in some use case setup it is useful to find the original document of the chunk.
We could also keep the text as it is and add an additional column chunks
which will be a list of chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ... I can assume in some use case setup it is useful to find the original document of the chunk.
That's the currently the case since I am taking the original id of the document and appending the chunk number to it here
We could also keep the text as it is and add an additional column
chunks
which will be a list of chunks.
I think that's a good alternative, although just thinking about the next component (embedding) i'd rather have an embedding per row rather than a list of embeddings. I think it better matches also how things will be stored in the vector database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with having one row per chunk. Adding original_document_id
as a column makes sens I think though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes might be better to link back to the original document, added!
d6b5480
to
211d372
Compare
components/chunk_text/src/main.py
Outdated
logger = logging.getLogger(__name__) | ||
|
||
|
||
def chunk_text(row, text_splitter: RecursiveCharacterTextSplitter) -> t.List[t.Tuple]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this function into the Component
, so we can just access the text splitter as self.text_splitter
instead of passing it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli! Looks good in general, left 2 comments.
For the testing, I think we need to get more used to writing tests for the transform
method so we can do unit testing instead of integration testing.
I agree :) added some tests |
4792548
to
572c8d9
Compare
a5f84b1
to
fcf8664
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great!
PR that adds a basic chunking component that chunks text based on length and overlap as well as the example rag pipeline
Taken from https://www.pinecone.io/learn/chunking-strategies/
Different chunking strategies are referenced there as well, might be interesting to reference them in the CC demo for people to try to implement
The starting dataset is arbitrary and can be changed later on, I just needed something for testing