Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678

Closed
zs-dima opened this issue May 14, 2023 · 35 comments
Closed

Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678

zs-dima opened this issue May 14, 2023 · 35 comments

Comments

@zs-dima
Copy link

zs-dima commented May 14, 2023

Feature request

LLM usually limits text by Tokens.
It may be useful to split a large text into chunks according to the number of Tokens rather than the number of characters.
For example, if LLM allows us to use 8000 tokens, and we want to split the text into chunks of up to 4000-tokens, then we can call

text_splitter = RecursiveCharacterTextSplitter(chunk_tokens = 4000, ...

Motivation

If we split a text by number of characters, it is not obvious how many tokens these chunks will be.
And at the same time if we want to split a text into bigger possible chunks and keep these chunks under certain LLM tokens limit, we cannot operate by number of characters.

Your contribution

As an example of the RecursiveCharacterTextSplitter(chunk_tokens implementation it is very useful libraries that helps to split text into tokens:
https://github.com/openai/tiktoken

import tiktoken

def split_large_text(large_text, max_tokens):
    enc = tiktoken.get_encoding("cl100k_base")
    tokenized_text = enc.encode(large_text)

    chunks = []
    current_chunk = []
    current_length = 0

    for token in tokenized_text:
        current_chunk.append(token)
        current_length += 1

        if current_length >= max_tokens:
            chunks.append(enc.decode(current_chunk).rstrip(' .,;'))
            current_chunk = []
            current_length = 0

    if current_chunk:
        chunks.append(enc.decode(current_chunk).rstrip(' .,;'))

    return chunks
@s7726
Copy link
Contributor

s7726 commented May 27, 2023

The LLMs in langchain have a token count function. Easier to use that than bringing in another library that doesn't know the specifics of the model.

That said I noticed most of the LLMs don't implement their own, and rely on the base LLM class instead which uses the transformers library to count the tokens.

I've been using llama.cpp and have been meaning to add a pull request for the addition of its built in token counting. It's ~3-4 lines of code.

TLRD I agree the default should be token count rather than character count. I'd prefer an alternate implementation.

@bhperry
Copy link
Contributor

bhperry commented May 30, 2023

RecursiveCharacterTextSplitter (and others inheriting from TextSplitter) all support custom length functions, and even have the convenient from_hugginface_tokenizer/from_tiktoken_encoder classmethods.

Issue I've noticed with those is that there's no way to take advantage of batch tokenization, instead each split is tokenized separately (and then re-tokenized during merge). Which is particularly problematic for huggingface "fast" tokenizers, where you can get significant speed improvements from batching.

Looking at adding batched length function to the TextSplitter. Experimentally I've found it can run 5x faster than with the current implementation.

@runonthespot
Copy link

Some preliminary thoughts here:
Splitting recursively on \n\n \n etc for openai (tiktoken), I noticed that there are lots of edge cases to consider:
-\n\n is a singular token
-\n is a token
-many words appear both with leading space and without (different tokens).

@sayan1999
Copy link

sayan1999 commented Sep 3, 2023

@zs-dima
If I understand your requirement properly, you are looking for a way to use RecursiveCharacterTextSplitter with a restriction in the count of tokens per chunk,
you can do that by using the lengthfunction parameter from the class RecursiveCharacterTextSplitter
look at the example below:

from nltk.tokenize import word_tokenize
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import BartTokenizer

tok = BartTokenizer.from_pretrained("facebook/bart-large")
bart_tok_len = lambda x: len(tok(x)["input_ids"])


def chunkify_bart(text):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=50,
        length_function=lambda x: len(tok(x)["input_ids"]),
        is_separator_regex=False,
    )
    chunks = text_splitter.create_documents([text])

    print((" ").join([chunk.page_content for chunk in chunks]))
    print([len(tok(chunk.page_content)["input_ids"]) for chunk in chunks])


chunkify_bart(open("text.txt").read())

@zs-dima
Copy link
Author

zs-dima commented Sep 3, 2023

@sayan1999
Thanks a lot for the answer.

  1. Looks a bit overcomplicated.
  2. chunk_size - in chars that limit functionality.
  3. chunk_overlap - in chars.

Or chunk_size and chunk_overlap are not in char and set up by the length_function ?
It is not clear from documantation currently:
Recursively split by character
https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

It could be nice to have an option to pass tokens count instead chars count in this functions.

@sayan1999
Copy link

sayan1999 commented Sep 3, 2023

So, the length function determines the way chunks are counted for size and over lap.

By default lengthfunction calculates char by char.

In my example chunk size and overlap will be counted by tokens

@sayan1999
Copy link

Found an easier alternative, please check https://stackoverflow.com/a/77042212/13717851

@JackLeick
Copy link

I struggled with this too, and decided to subclass NLTK with a tiktoken length function, then just init the super with that as the length_function kwarg. Works exactly how I expected. You can also do the same with separators. By I'd just plan on using NLTK since it gives you full sentences. Much better chunking

@sayan1999
Copy link

sayan1999 commented Oct 31, 2023

Really? Of all the solutions I found this the easiest, would you like to elaborate where is the issue?
I am using this in all my projects now on as it is a basic fundamental for all my llm projects.

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1024, chunk_overlap=50
)

chunks = text_splitter.split_text(long_text)

@JackLeick
Copy link

I believe the issue is still the length function being str length. Does that actually chunk on token length and not character length? That's the crux

@bhperry
Copy link
Contributor

bhperry commented Nov 6, 2023

from_toktoken_encoder and from_huggingface_tokenizer classmethods both use a token-based length_function

@s7726
Copy link
Contributor

s7726 commented Nov 6, 2023

from_toktoken_encoder and from_huggingface_tokenizer classmethods both use a token-based length_function

My particular ask is to have it called as a class method so that each LLM can provide it's appropriate tokenizer since they are not all the same. Also not all setups and workflows allow for online access to the mentioned tokenizers

@ind1go
Copy link

ind1go commented Nov 7, 2023

It would be great to be able to mix the benefit of RecursiveCharacterTextSplitter (i.e. splitting by useful things like sentences and paragraphs) with custom tokenisers (i.e. the knowledge that a chunk is making the best use of a particular embedding model or LLM).

See AutoTokenizer as another class that retrieves that tokenising knowledge for a particular model.

@bhperry
Copy link
Contributor

bhperry commented Nov 7, 2023

It does work with any tokenizer. That's why there's a length_function kwarg, so you can define the length of a string in any way that you choose. from_tiktoken_encoder and from_huggingface_tokenizer are just helper classsmethods that define that function for you based on the particular type of tokenizer.

e.g. with AutoTokenizer.....

tokenizer = AutoTokenizer.from_pretrained("some/model")
splitter = RecursiveTextCharacterSplitter.from_huggingface_tokenizer(tokenizer)

or for a custom tokenizer

tokenizer = MyCustomTokenizer(...)


def custom_length_function(text):
  return len(tokenizer.encode(text))


splitter = RecursiveTextCharacterSplitter(length_function=custom_length_function)

@ind1go
Copy link

ind1go commented Nov 7, 2023

Thanks a lot @bhperry, it wasn't clear to me that it worked for the subclasses like RCTS so I'm very happy to have that confirmed.

@ind1go
Copy link

ind1go commented Nov 8, 2023

@bhperry Are there recommendations for settings we need to apply to the splitter or tokenizer when attempting to using RecursiveTextCharacterSplitter with a custom length function? The calculation isn't coming out correctly and I think it's because of the classification tokens that are included in each token calculation.

For example, see the example below which we would expect to divide the input text into probably 2 splits, given the thenlper/gte-small tokenizer splits by word most of the time:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from transformers import AutoTokenizer

text = "This is some text. It should be split into chunks that don't exceed a limit of 40. When I say 'chunks', they're not necessarily on word boundaries but are defined by the tokenizer of the embedding model itself."
my_tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
display("Number of tokens in entire input string:", len(my_tokenizer(text).tokens()))

my_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer=my_tokenizer, chunk_size=40, chunk_overlap=0)
my_documents = [Document(page_content=text)]

my_splits = my_splitter.split_documents(my_documents)
display("Splits:", my_splits)
display("Split lengths in terms of tokens:", list(map(lambda x: len(my_tokenizer(x.page_content).tokens()), my_splits)))
Number of tokens in entire input string: 53

Splits:
[Document(page_content='This is some text. It should be split'),
 Document(page_content="into chunks that don't exceed a limit of"),
 Document(page_content="40. When I say 'chunks', they're not"),
 Document(page_content='necessarily on word boundaries but are defined by'),
 Document(page_content='the tokenizer of the embedding model itself.')]

Split lengths in terms of tokens: [11, 12, 15, 10, 13]

There are more splits, and they're way shorter than expected, and doing a bit of debugging I think that's because of the classification tokens that are used are also added in the calculations.

For example, the tokenizer takes the first token as 'this' but the full encoding of it is ['[CLS]', 'this', '[SEP]'], which has a cost of 3 towards my budget of 40 chunks, then the next is ['[CLS]', 'is', '[SEP]'] and again that adds another three, whereas in reality that's not how the sentence as a whole would be embedded.

@bhperry
Copy link
Contributor

bhperry commented Nov 8, 2023

def length_function(text):
  return len(tokenizer.encode(text, add_special_tokens=False))

splitter = RecursiveTextChararcterSplitter(length_function=length_function, ...)

should do the trick

@bhperry
Copy link
Contributor

bhperry commented Nov 8, 2023

Also worth noting, the recursive splitter does tend towards over-splitting. A split of exactly chunk length will be split again if it can be (not sure why that was chosen, but that's the way it is last I checked)

@ind1go
Copy link

ind1go commented Nov 8, 2023

Ah, nice!

Little tweak, just need to count the list length:

def length_function(text):
  return len(tokenizer.encode(text, add_special_tokens=False))

Thank you!

@bhperry
Copy link
Contributor

bhperry commented Nov 8, 2023

ah yeah forgot that part

@Wolfsauge
Copy link

@bhperry Thanks for implementing this. I tested it with a script and it seems to work fine. To do this, I checked out the PR #5589 and used the result locally with my script. I haven't measured the speedup, still it is quite noticeable. The example code I used is in this GitHub repo.

@Wolfsauge
Copy link

@bhperry Sorry, I have to take that back. I repeated the steps with the method of defining tokenizer via kwargs in RecursiveCharacterTextSplitter and then measured my script execution duration, which then needs to do a lot of tokenizations in parallel because it's using the text splitter.

Looking at the results, the method of using the Huggingface AutoTokenizer.from_pretrained(), implementing a custom length function, which measures the fragments in units of tokens, and then defining that via the kwargs length_function=lambda x: get_length_of_chunk_in_tokens(x) in the RecursiveCharacterTextSplitter (as shown in posts above mine) gives me the faster results, compared to defining my RecursiveCharacterTextSplitter with the tokenizer keyword argument directly. I added the results of my measurements to the bottom of the GitHub page linked above.

@bhperry
Copy link
Contributor

bhperry commented Nov 20, 2023

@Wolfsauge Did you mean #5583?

Your results are interesting. Are the different documents all the same? Doesn't quite make sense if they are, since I would expect the first and third to take roughly the same amount of time (as you noted on your repo, the default batched mode from my PR only works with a fast tokenizer).

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0023.json:
    "use_fast": false,
    "use_batched_tokenization": false,
    "tokenizer.is_fast": false,
    "summarize_duration_seconds": 95.42

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0024.json:
    "use_fast": true,
    "use_batched_tokenization": false,
    "tokenizer.is_fast": true,
    "summarize_duration_seconds": 34.86

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0025.json:
    "use_fast": false,
    "use_batched_tokenization": true,
    "tokenizer.is_fast": false,
    "summarize_duration_seconds": 37.23

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0026.json:
    "use_fast": true,
    "use_batched_tokenization": true,
    "tokenizer.is_fast": true,
    "summarize_duration_seconds": 43.5

If they are different documents, or different sections of the same document, then the durations are not necessarily directly comparable. The number of times it needs to tokenize and the amount of batching that can be done is heavily dependent on the splits that get created for a given section of text.

@Wolfsauge
Copy link

@bhperry Hello, thanks for your reply.

Maybe I am not using the text splitter correctly, yet, or have not been clear. Yeah, I've used the same text as input and ran that with different ways to instantiate the RecursiveCharacterTextSplitter.

In this experiment, I was trying to find the best way to do my tokenization tasks by running a script, which makes heavy use of the tokenization indirectly (because it is splitting a lot), and the same input file (pg84 is Frankenstein, story-0904 was another random text I chose because of its size in KB/tokens).

The fastest results I can get at the moment, is when I am supplying the RecursiveCharacterTextSplitter class with a custom length function, that is based on the HF fast tokenizer. In the length function I just use the tokenizer() and count the elements of the input_ids result. The number of coordinates in the result vector is roughly the size of chunk measured in units of tokens special to the model, which is the "length" the splitter is looking for. It is shown in this thread how it works, already.

When using RecursiveCharacterTextSplitter with a tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True) instead, my results weren't great, yet.

Currently, I see my script.py running as 1 process with ~18 threads. I guess that is my error. I need to run many tokenizers in parallel, to speed things up, and also make sure each of my RecursiveCharacterTextSplitter instances has 1 cpu for itself? Or how am I supposed to get the most out of this new "batched" feature? Has that been covered in a previous conversation or is there some example code, that makes use of the batching and parallelism?

I'd like to understand this better, because the tokenization is so expensive computationally in my script, but it's just a free time project.

Thanks for any hints on the topic.

Regards

def get_length_of_chunk_in_tokens(my_chunk: str, buck_slip: dict) -> int:
    my_result = buck_slip["tokenizer"](my_chunk)
    input_ids = my_result.input_ids
    length_of_chunk_in_tokens = len(input_ids)

    return length_of_chunk_in_tokens


def get_tokenizer(buck_slip: dict) -> LlamaTokenizerFast:
    tokenizer = AutoTokenizer.from_pretrained(
        buck_slip["model_identifier"], use_fast=buck_slip["use_fast"]
    )
    ic(type(tokenizer))
    ic(tokenizer.is_fast)
    buck_slip["tokenizer.is_fast"] = tokenizer.is_fast

    encoding = tokenizer("My name is Sylvain and I work at Hugging Face in Brooklyn.")
    ic(type(encoding))
    ic(encoding.is_fast)
    buck_slip["encoding.is_fast"] = encoding.is_fast

    return tokenizer


def get_text_splitter(
    buck_slip: dict, custom_chunk_size: int, custom_chunk_overlap: int
) -> TextSplitter:
    batched_tokenization = buck_slip["use_batched_tokenization"]

    if batched_tokenization is True:
        text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
            tokenizer=buck_slip["tokenizer"],
            chunk_size=custom_chunk_size,
            chunk_overlap=custom_chunk_overlap,
        )
    else:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=custom_chunk_size,
            chunk_overlap=custom_chunk_overlap,
            length_function=lambda x: get_length_of_chunk_in_tokens(x, buck_slip),
        )
    ic(type(text_splitter))

    return text_splitter

@bhperry
Copy link
Contributor

bhperry commented Nov 21, 2023

This wasn't an official feature, just something I put together based on my own observations. And since it wasn't getting any traction from reviewers I moved on to other things.

As noted in this article https://huggingface.co/learn/nlp-course/chapter6/3 the fast tokenizer's batching really only starts to shine when it is getting large batches. So it's definitely possible that for some texts there's simply not enough chunks per batch to make it worthwhile. I'll revisit this PR at some point and see if I can improve upon the original idea.

@codinseok
Copy link

import pandas as pd
from transformers import AutoTokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large')

def length_function(text):
    return len(tokenizer.encode(text, add_special_tokens=False))

text_splitter = RecursiveCharacterTextSplitter(
    length_function=length_function,
    chunk_size=250,
    chunk_overlap=0
)


make error about

Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

10 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'para'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
KeyError: 'para'

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in __getitem__(self, key)
   5320             return getitem(key)
   5321 
-> 5322         if isinstance(key, slice):
   5323             # This case is separated from the conditional above to avoid
   5324             # pessimization com.is_bool_indexer and ndim checks.

KeyboardInterrupt:


why it makes "maximum sequence length for this model (549 > 512) " error.
I set chunk_size = 250

@bhperry
Copy link
Contributor

bhperry commented Jan 4, 2024

That's just a warning from the huggingface tokenizer. It tokenizes the full text in order to determine where to split it, then splits down to chunk size. You can safely ignore that warning when the tokenizer is used in this way, it only matters when you are using the tokens as input to the associated model.

@alch00001
Copy link

alch00001 commented Mar 11, 2024

I read this thread but I'm still confused, I have the following code:

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( separators=["\n\n", "\n", " "], chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

But it still splits on character # instead of tokens amount. So is this not a viable method? Am I missing some arguments?

@bhperry
Copy link
Contributor

bhperry commented Mar 11, 2024

Less familiar with tiktoken, but looking at the function def it appears to be doing the right thing (note the _tiktoken_encoder function that gets passed into length_function for the splitter).

Are you using the right encoding_name? Default is gpt2, which may be closer to character split than longer tokens for your documents.

@codinseok
Copy link

@alch00001 hi
#16435

I noticed that the sentences were being cut differently than I intended, so I checked the code. It turned out that the implementation was adding to the total length (total) directly, instead of always measuring the token length. So, I modified that part to let the tokenizer calculate the length, and then I was able to get the desired results. Could you check if it’s the same issue and try modifying the code on your end? I’d appreciate it if you could also share the results after you’ve tried.

@codinseok
Copy link

I think " _merge_splits" function within langchain.text_splitter has problem

@alch00001
Copy link

@bhperry

Yes, I tried multiple encodings. I tried other splitters as well but even the TokenTextSplitter does not give me good results. I might just write by own splitter.

@codinseok

Could you be a little more specific in the part you modified and post your own code? Do you mean the _length_function or the _merge_splits?

@codinseok
Copy link

@alch00001

Looking at the splitter, there's a process where it cuts sentences and checks their lengths before combining them. However, in the existing combination process (def _merge_splits), the length of the sentence is checked just once through the tokenizer and then added to the previous sentence. I believe that after merging the sentences, we should recalculate the length using the tokenizer again. This is because, in some languages, the token length can differ when words are combined compared to when they are separate.

For example, the current method is like this: total = len(word1) + len(word2), but I think it should be changed to total = len(word1 + word2), where len is the length measured by the tokenizer.

def _merge_splits

total = 0
for d in splits:
_len = self._length_function(d)
if (
total + _len + (separator_len if len(current_doc) > 0 else 0)
> self._chunk_size
):

my opinion

total = 0
for d in splits:
current_text = self._join_docs(current_doc, separator)
total = self._length_function(current_text)
_len = self._length_function(d)

@codinseok
Copy link

@alch00001

I'm not sure if this aligns exactly with the problem you're facing, but I encountered an issue where the sentences were being cut differently than the length calculated by the tokenizer. The sentences were being split shorter than when I calculated separately. When I changed it to recalculate the length on the combined sentences rather than merging the words separately, the issue was resolved.

If the problem you have is different, could you share an example sentence? I've become curious while responding.

@alch00001
Copy link

@codinseok
Thank you, I will try and implement some of your solutions and get back to you. The text I'm working with is actually German language, so this may be a factor in the problems I am having.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 14, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants