Chunker component #528

PhilippeMoussalli · 2023-10-17T16:37:55Z

PR that adds a basic chunking component that chunks text based on length and overlap as well as the example rag pipeline

Taken from https://www.pinecone.io/learn/chunking-strategies/
Different chunking strategies are referenced there as well, might be interesting to reference them in the CC demo for people to try to implement

The starting dataset is arbitrary and can be changed later on, I just needed something for testing

…rsion (#525) This PR addresses 2 issues: - The `client_kwargs` argument was not propagated properly and couldn't be used - Python 3.11 leads to very long dependency resolution times with KfP and aiplatform dependencies --------- Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>

mrchtr · 2023-10-18T08:42:35Z

components/chunk_text/fondant_component.yaml

+      data:
+        type: string
+
+produces:


What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ...
I can assume in some use case setup it is useful to find the original document of the chunk.

We could also keep the text as it is and add an additional column chunks which will be a list of chunks.

What do you think about keeping a reference to the original document? E.g. doc_1 is chunked to doc_1_1, doc_1_2, ... I can assume in some use case setup it is useful to find the original document of the chunk.

That's the currently the case since I am taking the original id of the document and appending the chunk number to it here

We could also keep the text as it is and add an additional column chunks which will be a list of chunks.

I think that's a good alternative, although just thinking about the next component (embedding) i'd rather have an embedding per row rather than a list of embeddings. I think it better matches also how things will be stored in the vector database.

Agree with having one row per chunk. Adding original_document_id as a column makes sens I think though.

yes might be better to link back to the original document, added!

RobbeSneyders · 2023-10-18T10:07:59Z

components/chunk_text/src/main.py

+logger = logging.getLogger(__name__)
+
+
+def chunk_text(row, text_splitter: RecursiveCharacterTextSplitter) -> t.List[t.Tuple]:


I would move this function into the Component, so we can just access the text splitter as self.text_splitter instead of passing it around.

RobbeSneyders

Thanks @PhilippeMoussalli! Looks good in general, left 2 comments.

For the testing, I think we need to get more used to writing tests for the transform method so we can do unit testing instead of integration testing.

PhilippeMoussalli · 2023-10-18T12:10:06Z

Thanks @PhilippeMoussalli! Looks good in general, left 2 comments.

For the testing, I think we need to get more used to writing tests for the transform method so we can do unit testing instead of integration testing.

I agree :) added some tests

RobbeSneyders

Thanks! Looks great!

RobbeSneyders and others added 2 commits October 17, 2023 18:32

Merge branch 'main' into chunker-component

37df2c7

PhilippeMoussalli requested review from RobbeSneyders and mrchtr October 17, 2023 16:38

update class name

2173528

mrchtr reviewed Oct 18, 2023

View reviewed changes

add component readme

211d372

PhilippeMoussalli force-pushed the chunker-component branch from d6b5480 to 211d372 Compare October 18, 2023 09:06

RobbeSneyders reviewed Oct 18, 2023

View reviewed changes

PhilippeMoussalli added 3 commits October 18, 2023 14:06

Address PR feedback

a30d930

Add tests

8f22e20

update readme

572c8d9

PhilippeMoussalli force-pushed the chunker-component branch from 4792548 to 572c8d9 Compare October 18, 2023 12:10

PhilippeMoussalli added 2 commits October 18, 2023 14:10

Merge branch 'main' into chunker-component

ddc9065

add embedding op step

fcf8664

PhilippeMoussalli force-pushed the chunker-component branch from a5f84b1 to fcf8664 Compare October 18, 2023 14:19

RobbeSneyders approved these changes Oct 18, 2023

View reviewed changes

PhilippeMoussalli merged commit 1dbfec2 into main Oct 18, 2023

PhilippeMoussalli deleted the chunker-component branch October 18, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunker component #528

Chunker component #528

PhilippeMoussalli commented Oct 17, 2023

mrchtr Oct 18, 2023

PhilippeMoussalli Oct 18, 2023

RobbeSneyders Oct 18, 2023

PhilippeMoussalli Oct 18, 2023

RobbeSneyders Oct 18, 2023

RobbeSneyders left a comment

PhilippeMoussalli commented Oct 18, 2023

RobbeSneyders left a comment

		logger = logging.getLogger(__name__)


		def chunk_text(row, text_splitter: RecursiveCharacterTextSplitter) -> t.List[t.Tuple]:

Chunker component #528

Chunker component #528

Conversation

PhilippeMoussalli commented Oct 17, 2023

mrchtr Oct 18, 2023

Choose a reason for hiding this comment

PhilippeMoussalli Oct 18, 2023

Choose a reason for hiding this comment

RobbeSneyders Oct 18, 2023

Choose a reason for hiding this comment

PhilippeMoussalli Oct 18, 2023

Choose a reason for hiding this comment

RobbeSneyders Oct 18, 2023

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

PhilippeMoussalli commented Oct 18, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment