Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678

zs-dima · 2023-05-14T18:16:05Z

Feature request

LLM usually limits text by Tokens.
It may be useful to split a large text into chunks according to the number of Tokens rather than the number of characters.
For example, if LLM allows us to use 8000 tokens, and we want to split the text into chunks of up to 4000-tokens, then we can call

text_splitter = RecursiveCharacterTextSplitter(chunk_tokens = 4000, ...

Motivation

If we split a text by number of characters, it is not obvious how many tokens these chunks will be.
And at the same time if we want to split a text into bigger possible chunks and keep these chunks under certain LLM tokens limit, we cannot operate by number of characters.

Your contribution

As an example of the RecursiveCharacterTextSplitter(chunk_tokens implementation it is very useful libraries that helps to split text into tokens:
https://github.com/openai/tiktoken

import tiktoken

def split_large_text(large_text, max_tokens):
    enc = tiktoken.get_encoding("cl100k_base")
    tokenized_text = enc.encode(large_text)

    chunks = []
    current_chunk = []
    current_length = 0

    for token in tokenized_text:
        current_chunk.append(token)
        current_length += 1

        if current_length >= max_tokens:
            chunks.append(enc.decode(current_chunk).rstrip(' .,;'))
            current_chunk = []
            current_length = 0

    if current_chunk:
        chunks.append(enc.decode(current_chunk).rstrip(' .,;'))

    return chunks

The text was updated successfully, but these errors were encountered:

s7726 · 2023-05-27T16:51:22Z

The LLMs in langchain have a token count function. Easier to use that than bringing in another library that doesn't know the specifics of the model.

That said I noticed most of the LLMs don't implement their own, and rely on the base LLM class instead which uses the transformers library to count the tokens.

I've been using llama.cpp and have been meaning to add a pull request for the addition of its built in token counting. It's ~3-4 lines of code.

TLRD I agree the default should be token count rather than character count. I'd prefer an alternate implementation.

bhperry · 2023-05-30T21:00:38Z

RecursiveCharacterTextSplitter (and others inheriting from TextSplitter) all support custom length functions, and even have the convenient from_hugginface_tokenizer/from_tiktoken_encoder classmethods.

Issue I've noticed with those is that there's no way to take advantage of batch tokenization, instead each split is tokenized separately (and then re-tokenized during merge). Which is particularly problematic for huggingface "fast" tokenizers, where you can get significant speed improvements from batching.

Looking at adding batched length function to the TextSplitter. Experimentally I've found it can run 5x faster than with the current implementation.

runonthespot · 2023-06-25T21:31:11Z

Some preliminary thoughts here:
Splitting recursively on \n\n \n etc for openai (tiktoken), I noticed that there are lots of edge cases to consider:
-\n\n is a singular token
-\n is a token
-many words appear both with leading space and without (different tokens).

sayan1999 · 2023-09-03T13:22:55Z

@zs-dima
If I understand your requirement properly, you are looking for a way to use RecursiveCharacterTextSplitter with a restriction in the count of tokens per chunk,
you can do that by using the lengthfunction parameter from the class RecursiveCharacterTextSplitter
look at the example below:

from nltk.tokenize import word_tokenize
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import BartTokenizer

tok = BartTokenizer.from_pretrained("facebook/bart-large")
bart_tok_len = lambda x: len(tok(x)["input_ids"])


def chunkify_bart(text):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1024,
        chunk_overlap=50,
        length_function=lambda x: len(tok(x)["input_ids"]),
        is_separator_regex=False,
    )
    chunks = text_splitter.create_documents([text])

    print((" ").join([chunk.page_content for chunk in chunks]))
    print([len(tok(chunk.page_content)["input_ids"]) for chunk in chunks])


chunkify_bart(open("text.txt").read())

zs-dima · 2023-09-03T15:04:50Z

@sayan1999
Thanks a lot for the answer.

Looks a bit overcomplicated.
chunk_size - in chars that limit functionality.
chunk_overlap - in chars.

Or chunk_size and chunk_overlap are not in char and set up by the length_function ?
It is not clear from documantation currently:
Recursively split by character
https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

It could be nice to have an option to pass tokens count instead chars count in this functions.

sayan1999 · 2023-09-03T19:40:30Z

So, the length function determines the way chunks are counted for size and over lap.

By default lengthfunction calculates char by char.

In my example chunk size and overlap will be counted by tokens

sayan1999 · 2023-09-05T06:39:27Z

Found an easier alternative, please check https://stackoverflow.com/a/77042212/13717851

JackLeick · 2023-09-15T20:03:23Z

I struggled with this too, and decided to subclass NLTK with a tiktoken length function, then just init the super with that as the length_function kwarg. Works exactly how I expected. You can also do the same with separators. By I'd just plan on using NLTK since it gives you full sentences. Much better chunking

sayan1999 · 2023-10-31T04:27:48Z

Really? Of all the solutions I found this the easiest, would you like to elaborate where is the issue?
I am using this in all my projects now on as it is a basic fundamental for all my llm projects.

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1024, chunk_overlap=50
)

chunks = text_splitter.split_text(long_text)

JackLeick · 2023-10-31T14:06:13Z

I believe the issue is still the length function being str length. Does that actually chunk on token length and not character length? That's the crux

bhperry · 2023-11-06T15:33:16Z

from_toktoken_encoder and from_huggingface_tokenizer classmethods both use a token-based length_function

s7726 · 2023-11-06T16:11:35Z

from_toktoken_encoder and from_huggingface_tokenizer classmethods both use a token-based length_function

My particular ask is to have it called as a class method so that each LLM can provide it's appropriate tokenizer since they are not all the same. Also not all setups and workflows allow for online access to the mentioned tokenizers

ind1go · 2023-11-07T15:36:10Z

It would be great to be able to mix the benefit of RecursiveCharacterTextSplitter (i.e. splitting by useful things like sentences and paragraphs) with custom tokenisers (i.e. the knowledge that a chunk is making the best use of a particular embedding model or LLM).

See AutoTokenizer as another class that retrieves that tokenising knowledge for a particular model.

bhperry · 2023-11-07T15:52:35Z

It does work with any tokenizer. That's why there's a length_function kwarg, so you can define the length of a string in any way that you choose. from_tiktoken_encoder and from_huggingface_tokenizer are just helper classsmethods that define that function for you based on the particular type of tokenizer.

e.g. with AutoTokenizer.....

tokenizer = AutoTokenizer.from_pretrained("some/model")
splitter = RecursiveTextCharacterSplitter.from_huggingface_tokenizer(tokenizer)

or for a custom tokenizer

tokenizer = MyCustomTokenizer(...)


def custom_length_function(text):
  return len(tokenizer.encode(text))


splitter = RecursiveTextCharacterSplitter(length_function=custom_length_function)

ind1go · 2023-11-07T15:59:26Z

Thanks a lot @bhperry, it wasn't clear to me that it worked for the subclasses like RCTS so I'm very happy to have that confirmed.

ind1go · 2023-11-08T14:17:19Z

@bhperry Are there recommendations for settings we need to apply to the splitter or tokenizer when attempting to using RecursiveTextCharacterSplitter with a custom length function? The calculation isn't coming out correctly and I think it's because of the classification tokens that are included in each token calculation.

For example, see the example below which we would expect to divide the input text into probably 2 splits, given the thenlper/gte-small tokenizer splits by word most of the time:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from transformers import AutoTokenizer

text = "This is some text. It should be split into chunks that don't exceed a limit of 40. When I say 'chunks', they're not necessarily on word boundaries but are defined by the tokenizer of the embedding model itself."
my_tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
display("Number of tokens in entire input string:", len(my_tokenizer(text).tokens()))

my_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer=my_tokenizer, chunk_size=40, chunk_overlap=0)
my_documents = [Document(page_content=text)]

my_splits = my_splitter.split_documents(my_documents)
display("Splits:", my_splits)
display("Split lengths in terms of tokens:", list(map(lambda x: len(my_tokenizer(x.page_content).tokens()), my_splits)))

Number of tokens in entire input string: 53

Splits:
[Document(page_content='This is some text. It should be split'),
 Document(page_content="into chunks that don't exceed a limit of"),
 Document(page_content="40. When I say 'chunks', they're not"),
 Document(page_content='necessarily on word boundaries but are defined by'),
 Document(page_content='the tokenizer of the embedding model itself.')]

Split lengths in terms of tokens: [11, 12, 15, 10, 13]

There are more splits, and they're way shorter than expected, and doing a bit of debugging I think that's because of the classification tokens that are used are also added in the calculations.

For example, the tokenizer takes the first token as 'this' but the full encoding of it is ['[CLS]', 'this', '[SEP]'], which has a cost of 3 towards my budget of 40 chunks, then the next is ['[CLS]', 'is', '[SEP]'] and again that adds another three, whereas in reality that's not how the sentence as a whole would be embedded.

bhperry · 2023-11-08T15:55:47Z

def length_function(text):
  return len(tokenizer.encode(text, add_special_tokens=False))

splitter = RecursiveTextChararcterSplitter(length_function=length_function, ...)

should do the trick

bhperry · 2023-11-08T15:58:30Z

Also worth noting, the recursive splitter does tend towards over-splitting. A split of exactly chunk length will be split again if it can be (not sure why that was chosen, but that's the way it is last I checked)

ind1go · 2023-11-08T16:04:02Z

Ah, nice!

Little tweak, just need to count the list length:

def length_function(text):
  return len(tokenizer.encode(text, add_special_tokens=False))

Thank you!

bhperry · 2023-11-08T16:06:02Z

ah yeah forgot that part

Wolfsauge · 2023-11-17T22:22:51Z

@bhperry Thanks for implementing this. I tested it with a script and it seems to work fine. To do this, I checked out the PR #5589 and used the result locally with my script. I haven't measured the speedup, still it is quite noticeable. The example code I used is in this GitHub repo.

Wolfsauge · 2023-11-19T07:49:17Z

@bhperry Sorry, I have to take that back. I repeated the steps with the method of defining tokenizer via kwargs in RecursiveCharacterTextSplitter and then measured my script execution duration, which then needs to do a lot of tokenizations in parallel because it's using the text splitter.

Looking at the results, the method of using the Huggingface AutoTokenizer.from_pretrained(), implementing a custom length function, which measures the fragments in units of tokens, and then defining that via the kwargs length_function=lambda x: get_length_of_chunk_in_tokens(x) in the RecursiveCharacterTextSplitter (as shown in posts above mine) gives me the faster results, compared to defining my RecursiveCharacterTextSplitter with the tokenizer keyword argument directly. I added the results of my measurements to the bottom of the GitHub page linked above.

bhperry · 2023-11-20T15:49:12Z

@Wolfsauge Did you mean #5583?

Your results are interesting. Are the different documents all the same? Doesn't quite make sense if they are, since I would expect the first and third to take roughly the same amount of time (as you noted on your repo, the default batched mode from my PR only works with a fast tokenizer).

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0023.json:
    "use_fast": false,
    "use_batched_tokenization": false,
    "tokenizer.is_fast": false,
    "summarize_duration_seconds": 95.42

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0024.json:
    "use_fast": true,
    "use_batched_tokenization": false,
    "tokenizer.is_fast": true,
    "summarize_duration_seconds": 34.86

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0025.json:
    "use_fast": false,
    "use_batched_tokenization": true,
    "tokenizer.is_fast": false,
    "summarize_duration_seconds": 37.23

story-0904-analysis-jondurbin_airoboros-m-7b-3.1.2-0026.json:
    "use_fast": true,
    "use_batched_tokenization": true,
    "tokenizer.is_fast": true,
    "summarize_duration_seconds": 43.5

If they are different documents, or different sections of the same document, then the durations are not necessarily directly comparable. The number of times it needs to tokenize and the amount of batching that can be done is heavily dependent on the splits that get created for a given section of text.

Wolfsauge · 2023-11-21T15:41:07Z

@bhperry Hello, thanks for your reply.

Maybe I am not using the text splitter correctly, yet, or have not been clear. Yeah, I've used the same text as input and ran that with different ways to instantiate the RecursiveCharacterTextSplitter.

In this experiment, I was trying to find the best way to do my tokenization tasks by running a script, which makes heavy use of the tokenization indirectly (because it is splitting a lot), and the same input file (pg84 is Frankenstein, story-0904 was another random text I chose because of its size in KB/tokens).

The fastest results I can get at the moment, is when I am supplying the RecursiveCharacterTextSplitter class with a custom length function, that is based on the HF fast tokenizer. In the length function I just use the tokenizer() and count the elements of the input_ids result. The number of coordinates in the result vector is roughly the size of chunk measured in units of tokens special to the model, which is the "length" the splitter is looking for. It is shown in this thread how it works, already.

When using RecursiveCharacterTextSplitter with a tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True) instead, my results weren't great, yet.

Currently, I see my script.py running as 1 process with ~18 threads. I guess that is my error. I need to run many tokenizers in parallel, to speed things up, and also make sure each of my RecursiveCharacterTextSplitter instances has 1 cpu for itself? Or how am I supposed to get the most out of this new "batched" feature? Has that been covered in a previous conversation or is there some example code, that makes use of the batching and parallelism?

I'd like to understand this better, because the tokenization is so expensive computationally in my script, but it's just a free time project.

Thanks for any hints on the topic.

Regards

def get_length_of_chunk_in_tokens(my_chunk: str, buck_slip: dict) -> int:
    my_result = buck_slip["tokenizer"](my_chunk)
    input_ids = my_result.input_ids
    length_of_chunk_in_tokens = len(input_ids)

    return length_of_chunk_in_tokens


def get_tokenizer(buck_slip: dict) -> LlamaTokenizerFast:
    tokenizer = AutoTokenizer.from_pretrained(
        buck_slip["model_identifier"], use_fast=buck_slip["use_fast"]
    )
    ic(type(tokenizer))
    ic(tokenizer.is_fast)
    buck_slip["tokenizer.is_fast"] = tokenizer.is_fast

    encoding = tokenizer("My name is Sylvain and I work at Hugging Face in Brooklyn.")
    ic(type(encoding))
    ic(encoding.is_fast)
    buck_slip["encoding.is_fast"] = encoding.is_fast

    return tokenizer


def get_text_splitter(
    buck_slip: dict, custom_chunk_size: int, custom_chunk_overlap: int
) -> TextSplitter:
    batched_tokenization = buck_slip["use_batched_tokenization"]

    if batched_tokenization is True:
        text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
            tokenizer=buck_slip["tokenizer"],
            chunk_size=custom_chunk_size,
            chunk_overlap=custom_chunk_overlap,
        )
    else:
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=custom_chunk_size,
            chunk_overlap=custom_chunk_overlap,
            length_function=lambda x: get_length_of_chunk_in_tokens(x, buck_slip),
        )
    ic(type(text_splitter))

    return text_splitter

bhperry · 2023-11-21T23:49:18Z

This wasn't an official feature, just something I put together based on my own observations. And since it wasn't getting any traction from reviewers I moved on to other things.

As noted in this article https://huggingface.co/learn/nlp-course/chapter6/3 the fast tokenizer's batching really only starts to shine when it is getting large batches. So it's definitely possible that for some texts there's simply not enough chunks per batch to make it worthwhile. I'll revisit this PR at some point and see if I can improve upon the original idea.

codinseok · 2023-12-05T08:57:31Z

import pandas as pd
from transformers import AutoTokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter

tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large')

def length_function(text):
    return len(tokenizer.encode(text, add_special_tokens=False))

text_splitter = RecursiveCharacterTextSplitter(
    length_function=length_function,
    chunk_size=250,
    chunk_overlap=0
)

make error about

Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_loc(self, key, method, tolerance)
   3801             try:
-> 3802                 return self._engine.get_loc(casted_key)
   3803             except KeyError as err:

10 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'para'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
KeyError: 'para'

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in __getitem__(self, key)
   5320             return getitem(key)
   5321 
-> 5322         if isinstance(key, slice):
   5323             # This case is separated from the conditional above to avoid
   5324             # pessimization com.is_bool_indexer and ndim checks.

KeyboardInterrupt:

why it makes "maximum sequence length for this model (549 > 512) " error.
I set chunk_size = 250

bhperry · 2024-01-04T17:46:42Z

That's just a warning from the huggingface tokenizer. It tokenizes the full text in order to determine where to split it, then splits down to chunk size. You can safely ignore that warning when the tokenizer is used in this way, it only matters when you are using the tokens as input to the associated model.

alch00001 · 2024-03-11T11:40:16Z

I read this thread but I'm still confused, I have the following code:

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( separators=["\n\n", "\n", " "], chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)

But it still splits on character # instead of tokens amount. So is this not a viable method? Am I missing some arguments?

bhperry · 2024-03-11T17:15:45Z

Less familiar with tiktoken, but looking at the function def it appears to be doing the right thing (note the _tiktoken_encoder function that gets passed into length_function for the splitter).

Are you using the right encoding_name? Default is gpt2, which may be closer to character split than longer tokens for your documents.

codinseok · 2024-03-12T00:39:06Z

@alch00001 hi
#16435

I noticed that the sentences were being cut differently than I intended, so I checked the code. It turned out that the implementation was adding to the total length (total) directly, instead of always measuring the token length. So, I modified that part to let the tokenizer calculate the length, and then I was able to get the desired results. Could you check if it’s the same issue and try modifying the code on your end? I’d appreciate it if you could also share the results after you’ve tried.

codinseok · 2024-03-12T00:43:57Z

I think " _merge_splits" function within langchain.text_splitter has problem

alch00001 · 2024-03-13T09:26:29Z

@bhperry

Yes, I tried multiple encodings. I tried other splitters as well but even the TokenTextSplitter does not give me good results. I might just write by own splitter.

@codinseok

Could you be a little more specific in the part you modified and post your own code? Do you mean the _length_function or the _merge_splits?

codinseok · 2024-03-14T02:45:16Z

@alch00001

Looking at the splitter, there's a process where it cuts sentences and checks their lengths before combining them. However, in the existing combination process (def _merge_splits), the length of the sentence is checked just once through the tokenizer and then added to the previous sentence. I believe that after merging the sentences, we should recalculate the length using the tokenizer again. This is because, in some languages, the token length can differ when words are combined compared to when they are separate.

For example, the current method is like this: total = len(word1) + len(word2), but I think it should be changed to total = len(word1 + word2), where len is the length measured by the tokenizer.

def _merge_splits

https://github.com/langchain-ai/langchain/blob/master/libs/text-splitters/langchain_text_splitters/base.py#L106

total = 0
for d in splits:
_len = self._length_function(d)
if (
total + _len + (separator_len if len(current_doc) > 0 else 0)
> self._chunk_size
):

my opinion

https://github.com/langchain-ai/langchain/pull/16435/files

total = 0
for d in splits:
current_text = self._join_docs(current_doc, separator)
total = self._length_function(current_text)
_len = self._length_function(d)

codinseok · 2024-03-14T02:47:39Z

@alch00001

I'm not sure if this aligns exactly with the problem you're facing, but I encountered an issue where the sentences were being cut differently than the length calculated by the tokenizer. The sentences were being split shorter than when I calculated separately. When I changed it to recalculate the length on the combined sentences rather than merging the words separately, the issue was resolved.

If the problem you have is different, could you share an example sentence? I've become curious while responding.

alch00001 · 2024-03-15T12:38:13Z

@codinseok
Thank you, I will try and implement some of your solutions and get back to you. The text I'm working with is actually German language, so this may be a factor in the problems I am having.

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 14, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 21, 2024

Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678

Split by Tokens instead of characters: RecursiveCharacterTextSplitter #4678

Comments

zs-dima commented May 14, 2023 • edited Loading

Feature request

Motivation

Your contribution

s7726 commented May 27, 2023

bhperry commented May 30, 2023 • edited Loading

runonthespot commented Jun 25, 2023

sayan1999 commented Sep 3, 2023 • edited Loading

zs-dima commented Sep 3, 2023 • edited Loading

sayan1999 commented Sep 3, 2023 • edited Loading

sayan1999 commented Sep 5, 2023

JackLeick commented Sep 15, 2023

sayan1999 commented Oct 31, 2023 • edited Loading

JackLeick commented Oct 31, 2023

bhperry commented Nov 6, 2023

s7726 commented Nov 6, 2023 • edited Loading

ind1go commented Nov 7, 2023

bhperry commented Nov 7, 2023 • edited Loading

ind1go commented Nov 7, 2023

ind1go commented Nov 8, 2023

bhperry commented Nov 8, 2023 • edited Loading

bhperry commented Nov 8, 2023 • edited Loading

ind1go commented Nov 8, 2023

bhperry commented Nov 8, 2023

Wolfsauge commented Nov 17, 2023

Wolfsauge commented Nov 19, 2023

bhperry commented Nov 20, 2023 • edited Loading

Wolfsauge commented Nov 21, 2023

bhperry commented Nov 21, 2023

codinseok commented Dec 5, 2023

bhperry commented Jan 4, 2024

alch00001 commented Mar 11, 2024 • edited Loading

bhperry commented Mar 11, 2024

codinseok commented Mar 12, 2024

codinseok commented Mar 12, 2024

alch00001 commented Mar 13, 2024

codinseok commented Mar 14, 2024

codinseok commented Mar 14, 2024

alch00001 commented Mar 15, 2024

zs-dima commented May 14, 2023 •

edited

Loading

bhperry commented May 30, 2023 •

edited

Loading

sayan1999 commented Sep 3, 2023 •

edited

Loading

zs-dima commented Sep 3, 2023 •

edited

Loading

sayan1999 commented Sep 3, 2023 •

edited

Loading

sayan1999 commented Oct 31, 2023 •

edited

Loading

s7726 commented Nov 6, 2023 •

edited

Loading

bhperry commented Nov 7, 2023 •

edited

Loading

bhperry commented Nov 8, 2023 •

edited

Loading

bhperry commented Nov 8, 2023 •

edited

Loading

bhperry commented Nov 20, 2023 •

edited

Loading

alch00001 commented Mar 11, 2024 •

edited

Loading