- Rust Crate: text-splitter
- Python Bindings: semantic-text-splitter (unfortunately couldn't acquire the same package name)
Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.
This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.
Add it to your project with
cargo add text-splitter
The simplest way to use this crate is to use the default implementation, which uses character count for chunk size.
use text_splitter::TextSplitter;
// Maximum number of characters in a chunk
let max_characters = 1000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::new(max_characters);
let chunks = splitter.chunks("your document text");
Requires the tokenizers
feature to be activated and adding tokenizers
to dependencies. The example below, using from_pretrained()
, also requires tokenizers http
feature to be enabled.
cargo add text-splitter --features tokenizers
cargo add tokenizers --features http
use text_splitter::{ChunkConfig, TextSplitter};
// Can also use anything else that implements the ChunkSizer
// trait from the text_splitter crate.
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(ChunkConfig::new(max_tokens).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");
Requires the tiktoken-rs
feature to be activated and adding tiktoken-rs
to dependencies.
cargo add text-splitter --features tiktoken-rs
cargo add tiktoken-rs
use text_splitter::{ChunkConfig, TextSplitter};
// Can also use anything else that implements the ChunkSizer
// trait from the text_splitter crate.
use tiktoken_rs::cl100k_base;
let tokenizer = cl100k_base().unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(ChunkConfig::new(max_tokens).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");
You also have the option of specifying your chunk capacity as a range.
Once a chunk has reached a length that falls within the range it will be returned.
It is always possible that a chunk may be returned that is less than the start
value, as adding the next piece of text may have made it larger than the end
capacity.
use text_splitter::{ChunkConfig, TextSplitter};
// Maximum number of characters in a chunk. Will fill up the
// chunk until it is somewhere in this range.
let max_characters = 500..2000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::new(max_characters);
let chunks = splitter.chunks("your document text");
All of the above examples also can also work with Markdown text. If you enable the markdown
feature, you can use the MarkdownSplitter
in the same ways as the TextSplitter
.
cargo add text-splitter --features markdown
use text_splitter::MarkdownSplitter;
// Maximum number of characters in a chunk. Can also use a range.
let max_characters = 1000;
// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = MarkdownSplitter::new(max_characters);
let chunks = splitter.chunks("# Header\n\nyour document text");
All of the above examples also can also work with code that can be parsed with tree-sitter. If you enable the code
feature, you can use the CodeSplitter
in the same ways as the TextSplitter
.
cargo add text-splitter --features code
cargo add tree-sitter-<language>
use text_splitter::CodeSplitter;
// Maximum number of characters in a chunk. Can also use a range.
let max_characters = 1000;
// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = CodeSplitter::new(tree_sitter_rust::LANGUAGE, max_characters).expect("Invalid tree-sitter language");
let chunks = splitter.chunks("your code file");
To preserve as much semantic meaning within a chunk as possible, each chunk is composed of the largest semantic units that can fit in the next given chunk. For each splitter type, there is a defined set of semantic levels. Here is an example of the steps used:
- Split the text by a increasing semantic levels.
- Check the first item for each level and select the highest level whose first item still fits within the chunk size.
- Merge as many of these neighboring sections of this level or above into a chunk to maximize chunk length. Boundaries of higher semantic levels are always included when merging, so that the chunk doesn't inadvertantly cross semantic boundaries.
The boundaries used to split the text if using the chunks
method, in ascending order:
- Characters
- Unicode Grapheme Cluster Boundaries
- Unicode Word Boundaries
- Unicode Sentence Boundaries
- Ascending sequence length of newlines. (Newline is
\r\n
,\n
, or\r
) Each unique length of consecutive newline sequences is treated as its own semantic level. So a sequence of 2 newlines is a higher level than a sequence of 1 newline, and so on.
Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
Markdown is parsed according to the CommonMark
spec, along with some optional features such as GitHub Flavored Markdown.
- Characters
- Unicode Grapheme Cluster Boundaries
- Unicode Word Boundaries
- Unicode Sentence Boundaries
- Soft line breaks (single newline) which isn't necessarily a new element in Markdown.
- Inline elements such as: text nodes, emphasis, strong, strikethrough, link, image, table cells, inline code, footnote references, task list markers, and inline html.
- Block elements suce as: paragraphs, code blocks, footnote definitions, metadata. Also, a block quote or row/item within a table or list that can contain other "block" type elements, and a list or table that contains items.
- Thematic breaks or horizontal rules.
- Headings by level
Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
- Characters
- Unicode Grapheme Cluster Boundaries
- Unicode Word Boundaries
- Unicode Sentence Boundaries
- Ascending depth of the syntax tree. So function would have a higher level than a statement inside of the function, and so on.
Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.
Feature | Description |
---|---|
code |
Enables the CodeSplitter struct for parsing code documents via tree-sitter parsers. |
markdown |
Enables the MarkdownSplitter struct for parsing Markdown documents via the CommonMark spec. |
Dependency Feature | Version Supported | Description |
---|---|---|
rust_tokenizers |
^8.0.0 |
Enables (Text/Markdown)Splitter::new to take any of the provided tokenizers as an argument. |
tiktoken-rs |
^0.6.0 |
Enables (Text/Markdown)Splitter::new to take tiktoken_rs::CoreBPE as an argument. This is useful for splitting text for OpenAI models. |
tokenizers |
^0.20.0 |
Enables (Text/Markdown)Splitter::new to take tokenizers::Tokenizer as an argument. This is useful for splitting text models that have a Hugging Face-compatible tokenizer. |
This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.
A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.