-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678
Comments
The LLMs in langchain have a token count function. Easier to use that than bringing in another library that doesn't know the specifics of the model. That said I noticed most of the LLMs don't implement their own, and rely on the base LLM class instead which uses the transformers library to count the tokens. I've been using llama.cpp and have been meaning to add a pull request for the addition of its built in token counting. It's ~3-4 lines of code. TLRD I agree the default should be token count rather than character count. I'd prefer an alternate implementation. |
RecursiveCharacterTextSplitter (and others inheriting from TextSplitter) all support custom length functions, and even have the convenient Issue I've noticed with those is that there's no way to take advantage of batch tokenization, instead each split is tokenized separately (and then re-tokenized during merge). Which is particularly problematic for huggingface "fast" tokenizers, where you can get significant speed improvements from batching. Looking at adding batched length function to the TextSplitter. Experimentally I've found it can run 5x faster than with the current implementation. |
Some preliminary thoughts here: |
@zs-dima
|
@sayan1999
Or It could be nice to have an option to pass tokens count instead chars count in this functions. |
So, the length function determines the way chunks are counted for size and over lap. By default lengthfunction calculates char by char. In my example chunk size and overlap will be counted by tokens |
Found an easier alternative, please check https://stackoverflow.com/a/77042212/13717851 |
I struggled with this too, and decided to subclass NLTK with a tiktoken length function, then just init the super with that as the length_function kwarg. Works exactly how I expected. You can also do the same with separators. By I'd just plan on using NLTK since it gives you full sentences. Much better chunking |
Really? Of all the solutions I found this the easiest, would you like to elaborate where is the issue?
|
I believe the issue is still the length function being str length. Does that actually chunk on token length and not character length? That's the crux |
|
My particular ask is to have it called as a class method so that each LLM can provide it's appropriate tokenizer since they are not all the same. Also not all setups and workflows allow for online access to the mentioned tokenizers |
It would be great to be able to mix the benefit of See AutoTokenizer as another class that retrieves that tokenising knowledge for a particular model. |
It does work with any tokenizer. That's why there's a e.g. with AutoTokenizer.....
or for a custom tokenizer
|
Thanks a lot @bhperry, it wasn't clear to me that it worked for the subclasses like RCTS so I'm very happy to have that confirmed. |
@bhperry Are there recommendations for settings we need to apply to the splitter or tokenizer when attempting to using For example, see the example below which we would expect to divide the input text into probably 2 splits, given the from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from transformers import AutoTokenizer
text = "This is some text. It should be split into chunks that don't exceed a limit of 40. When I say 'chunks', they're not necessarily on word boundaries but are defined by the tokenizer of the embedding model itself."
my_tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
display("Number of tokens in entire input string:", len(my_tokenizer(text).tokens()))
my_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer=my_tokenizer, chunk_size=40, chunk_overlap=0)
my_documents = [Document(page_content=text)]
my_splits = my_splitter.split_documents(my_documents)
display("Splits:", my_splits)
display("Split lengths in terms of tokens:", list(map(lambda x: len(my_tokenizer(x.page_content).tokens()), my_splits)))
There are more splits, and they're way shorter than expected, and doing a bit of debugging I think that's because of the classification tokens that are used are also added in the calculations. For example, the tokenizer takes the first token as 'this' but the full encoding of it is |
should do the trick |
Also worth noting, the recursive splitter does tend towards over-splitting. A split of exactly chunk length will be split again if it can be (not sure why that was chosen, but that's the way it is last I checked) |
Ah, nice! Little tweak, just need to count the list length: def length_function(text):
return len(tokenizer.encode(text, add_special_tokens=False)) Thank you! |
ah yeah forgot that part |
@bhperry Thanks for implementing this. I tested it with a script and it seems to work fine. To do this, I checked out the PR #5589 and used the result locally with my script. I haven't measured the speedup, still it is quite noticeable. The example code I used is in this GitHub repo. |
@bhperry Sorry, I have to take that back. I repeated the steps with the method of defining tokenizer via kwargs in RecursiveCharacterTextSplitter and then measured my script execution duration, which then needs to do a lot of tokenizations in parallel because it's using the text splitter. Looking at the results, the method of using the Huggingface AutoTokenizer.from_pretrained(), implementing a custom length function, which measures the fragments in units of tokens, and then defining that via the kwargs |
@Wolfsauge Did you mean #5583? Your results are interesting. Are the different documents all the same? Doesn't quite make sense if they are, since I would expect the first and third to take roughly the same amount of time (as you noted on your repo, the default batched mode from my PR only works with a fast tokenizer).
If they are different documents, or different sections of the same document, then the durations are not necessarily directly comparable. The number of times it needs to tokenize and the amount of batching that can be done is heavily dependent on the splits that get created for a given section of text. |
@bhperry Hello, thanks for your reply. Maybe I am not using the text splitter correctly, yet, or have not been clear. Yeah, I've used the same text as input and ran that with different ways to instantiate the RecursiveCharacterTextSplitter. In this experiment, I was trying to find the best way to do my tokenization tasks by running a script, which makes heavy use of the tokenization indirectly (because it is splitting a lot), and the same input file (pg84 is Frankenstein, story-0904 was another random text I chose because of its size in KB/tokens). The fastest results I can get at the moment, is when I am supplying the RecursiveCharacterTextSplitter class with a custom length function, that is based on the HF fast tokenizer. In the length function I just use the tokenizer() and count the elements of the input_ids result. The number of coordinates in the result vector is roughly the size of chunk measured in units of tokens special to the model, which is the "length" the splitter is looking for. It is shown in this thread how it works, already. When using RecursiveCharacterTextSplitter with a tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True) instead, my results weren't great, yet. Currently, I see my script.py running as 1 process with ~18 threads. I guess that is my error. I need to run many tokenizers in parallel, to speed things up, and also make sure each of my RecursiveCharacterTextSplitter instances has 1 cpu for itself? Or how am I supposed to get the most out of this new "batched" feature? Has that been covered in a previous conversation or is there some example code, that makes use of the batching and parallelism? I'd like to understand this better, because the tokenization is so expensive computationally in my script, but it's just a free time project. Thanks for any hints on the topic. Regards def get_length_of_chunk_in_tokens(my_chunk: str, buck_slip: dict) -> int:
my_result = buck_slip["tokenizer"](my_chunk)
input_ids = my_result.input_ids
length_of_chunk_in_tokens = len(input_ids)
return length_of_chunk_in_tokens
def get_tokenizer(buck_slip: dict) -> LlamaTokenizerFast:
tokenizer = AutoTokenizer.from_pretrained(
buck_slip["model_identifier"], use_fast=buck_slip["use_fast"]
)
ic(type(tokenizer))
ic(tokenizer.is_fast)
buck_slip["tokenizer.is_fast"] = tokenizer.is_fast
encoding = tokenizer("My name is Sylvain and I work at Hugging Face in Brooklyn.")
ic(type(encoding))
ic(encoding.is_fast)
buck_slip["encoding.is_fast"] = encoding.is_fast
return tokenizer
def get_text_splitter(
buck_slip: dict, custom_chunk_size: int, custom_chunk_overlap: int
) -> TextSplitter:
batched_tokenization = buck_slip["use_batched_tokenization"]
if batched_tokenization is True:
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
tokenizer=buck_slip["tokenizer"],
chunk_size=custom_chunk_size,
chunk_overlap=custom_chunk_overlap,
)
else:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=custom_chunk_size,
chunk_overlap=custom_chunk_overlap,
length_function=lambda x: get_length_of_chunk_in_tokens(x, buck_slip),
)
ic(type(text_splitter))
return text_splitter |
This wasn't an official feature, just something I put together based on my own observations. And since it wasn't getting any traction from reviewers I moved on to other things. As noted in this article https://huggingface.co/learn/nlp-course/chapter6/3 the fast tokenizer's batching really only starts to shine when it is getting large batches. So it's definitely possible that for some texts there's simply not enough chunks per batch to make it worthwhile. I'll revisit this PR at some point and see if I can improve upon the original idea. |
make error about
why it makes "maximum sequence length for this model (549 > 512) " error. |
That's just a warning from the huggingface tokenizer. It tokenizes the full text in order to determine where to split it, then splits down to chunk size. You can safely ignore that warning when the tokenizer is used in this way, it only matters when you are using the tokens as input to the associated model. |
I read this thread but I'm still confused, I have the following code:
But it still splits on character # instead of tokens amount. So is this not a viable method? Am I missing some arguments? |
Less familiar with tiktoken, but looking at the function def it appears to be doing the right thing (note the Are you using the right encoding_name? Default is |
@alch00001 hi I noticed that the sentences were being cut differently than I intended, so I checked the code. It turned out that the implementation was adding to the total length (total) directly, instead of always measuring the token length. So, I modified that part to let the tokenizer calculate the length, and then I was able to get the desired results. Could you check if it’s the same issue and try modifying the code on your end? I’d appreciate it if you could also share the results after you’ve tried. |
I think " _merge_splits" function within langchain.text_splitter has problem |
Yes, I tried multiple encodings. I tried other splitters as well but even the TokenTextSplitter does not give me good results. I might just write by own splitter. Could you be a little more specific in the part you modified and post your own code? Do you mean the _length_function or the _merge_splits? |
Looking at the splitter, there's a process where it cuts sentences and checks their lengths before combining them. However, in the existing combination process (def _merge_splits), the length of the sentence is checked just once through the tokenizer and then added to the previous sentence. I believe that after merging the sentences, we should recalculate the length using the tokenizer again. This is because, in some languages, the token length can differ when words are combined compared to when they are separate. For example, the current method is like this: total = len(word1) + len(word2), but I think it should be changed to total = len(word1 + word2), where len is the length measured by the tokenizer. def _merge_splits total = 0 my opinion total = 0 |
I'm not sure if this aligns exactly with the problem you're facing, but I encountered an issue where the sentences were being cut differently than the length calculated by the tokenizer. The sentences were being split shorter than when I calculated separately. When I changed it to recalculate the length on the combined sentences rather than merging the words separately, the issue was resolved. If the problem you have is different, could you share an example sentence? I've become curious while responding. |
@codinseok |
Feature request
LLM usually limits text by Tokens.
It may be useful to split a large text into chunks according to the number of Tokens rather than the number of characters.
For example, if LLM allows us to use 8000 tokens, and we want to split the text into chunks of up to 4000-tokens, then we can call
Motivation
If we split a text by number of characters, it is not obvious how many tokens these chunks will be.
And at the same time if we want to split a text into bigger possible chunks and keep these chunks under certain LLM tokens limit, we cannot operate by number of characters.
Your contribution
As an example of the
RecursiveCharacterTextSplitter(chunk_tokens
implementation it is very useful libraries that helps to split text into tokens:https://github.com/openai/tiktoken
The text was updated successfully, but these errors were encountered: