Releases · benbrandt/text-splitter

06 Jul 05:38

benbrandt

v0.14.1

304e55f

v0.14.1

What's New

Small performance improvements where checking the size of the chunk is avoided if we already know it is too small or we don't need to. #261
Loosen dependency ranges for Rust crates to allow for more flexibility in the versions you can use.

Full Changelog: v0.14.0...v0.14.1

Assets 2

0 Join discussion

21 Jun 20:54

benbrandt

v0.14.0

7c3cbbd

v0.14.0

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.

Before

pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

After

pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}

Optimization for SemanticSplitRange searching by @benbrandt in #219
Performance Optimization: Expanding binary search window by @benbrandt in #231

Full Changelog: v0.13.3...v0.14.0

Contributors

benbrandt

Assets 2

2 Join discussion

02 Jun 21:10

benbrandt

v0.13.3

a3900eb

v0.13.3

What's Changed

Fixes broken PyPI publish because of a bad dev dependency specification

Full Changelog: v0.13.2...v0.13.3

Assets 2

0 Join discussion

02 Jun 20:36

benbrandt

v0.13.2

04317e9

v0.13.2 - CodeSplitter

What's Changed

New CodeSplitter for splitting code in any languages that tree-sitter grammars are available for. It should provide decent chunks, but please provide feedback if you notice any strange behavior.

Rust Usage

cargo add text-splitter --features code
cargo add tree-sitter-<language>

use text_splitter::CodeSplitter;
// Default implementation uses character count for chunk size.
// Can also use all of the same tokenizer implementations as `TextSplitter`.
let splitter = CodeSplitter::new(tree_sitter_rust::language(), 1000).expect("Invalid tree-sitter language");

let chunks = splitter.chunks("your code file");

Python Usage

from semantic_text_splitter import CodeSplitter
import tree_sitter_python

# Default implementation uses character count for chunk size.
# Can also use all of the same tokenizer implementations as `TextSplitter`.
splitter = CodeSplitter(tree_sitter_python.language(), capacity=1000)

chunks = splitter.chunks("your code file");

Full Changelog: v0.13.1...v0.13.2

Assets 2

0 Join discussion

07 May 20:59

benbrandt

v0.13.1

0fc34dd

v0.13.1

What's Changed

Fix a bug in the fallback logic to make sure we are still respecting the maximum bytes we should be searching in. Again, this only affects Markdown splitting at very small sizes. in #174

Full Changelog: v0.13.0...v0.13.1

Assets 2

0 Join discussion

05 May 23:02

benbrandt

v0.13.0

9676ab1

v0.13.0

What's New / Breaking Changes

Unicode Segmentation is now only used as a fallback. This prioritizes the semantic levels of each splitter, and only uses Unicode grapheme/word/sentence segmentation when none of the semantic levels can be split at the desired capacity.

In most cases, this won't change the behavior of the splitter, and will likely mean that speed will improve because it is able to skip several semantic levels at the start, acting as a bisect or binary search, and only go back to the lower levels if it can't fit.

However, for the MarkdownSplitter at very small sizes (i.e., less than 16 tokens), this may produce different output, becuase prior to this change, the splitter may have used Unicode sentence segmentation instead of the Markdown semantic levels, due to an optimization in the level selection. Now, the splitter will prioritize the parsed Markdown levels before it falls back to Unicode segmentation, which preserves better structure at small sizes.

So, it is likely in most cases, this is a non-breaking update. However, if you were using extremely small chunk sizes for Markdown, the behavior is different, and I wanted to inidicate that with a major version bump

Full Changelog: v0.12.3...v0.13.0

Assets 2

0 Join discussion

01 May 13:42

benbrandt

v0.12.3

86e408f

v0.12.3

Bug Fix

Remove leftover dbg! statements in chunk overlap code #154 🤦🏻‍♂️

Apologies if I spammed your logs!

New Contributors

@Sagebati made their first contribution in #164

Full Changelog: v0.12.2...v0.12.3

Contributors

Sagebati

Assets 2

0 Join discussion

28 Apr 21:02

benbrandt

v0.12.2

c6e599e

v0.12.2 - Chunk Overlap

What's New

Support for chunk overlapping: Several of you have been waiting on this for awhile now, and I am happy to say that chunk overlapping is now available in a way that still stays true to the spirit of finding good semantic break points.

When a new chunk is emitted, if chunk overlapping is enabled, the splitter will look back at the semantic sections of the current level and pull in as many as possible that fit within the overlap window. This does mean that none can be taken, which is often the case when close to a higher semantic level boundary.

When it will almost always produce an overlap is when the current semantic level couldn't be fit into a single chunk, and it provides overlapping sections since we may not have found a good break point in the middle of the section. Which seems to be the main motivation for using chunk overlapping in the first place.

Rust Usage

let chunk_config = ChunkConfig::new(256)
    // .with_sizer(sizer) // Optional tokenizer or other chunk sizer impl
    .with_overlap(64)
    .expect("Overlap must be less than desired chunk capacity");
let splitter = TextSplitter::new(chunk_config); // Or MarkdownSplitter

Python Usage

splitter = TextSplitter(256, overlap=64) # or any of the class methods to use a tokenizer

Full Changelog: v0.12.1...v0.12.2

Assets 2

0 Join discussion

26 Apr 21:55

benbrandt

v0.12.1

0d6b722

v0.12.1 - rust_tokenizers support

What's Changed

rust_tokenizers support has been added to the Rust crate in #156

Full Changelog: v0.12.0...v0.12.1

Assets 2

0 Join discussion

23 Apr 21:31

benbrandt

v0.12.0

b03b1be

v0.12.0 - Centralized Chunk Configuration

What's New

This release is a big API change to pull all chunk configuration options into the same place, at initialization of the splitters. This was motivated by two things:

These settings are all important to deciding how to split the text for a given use case, and in practice I saw them often being set together anyway.
To prep the library for new features like chunk overlap, where error handling has to be introduced to make sure that invariants are kept between all of the settings. These errors should be handled as sson as possible before chunking the text.

Overall, I think this has aligned the library with the usage I have seen in the wild, and pulls all of the settings for the "domain" of chunking into a single unit.

Breaking Changes

Rust

Trimming is now enabled by default. This brings the Rust crate in alignment with the Python package. But for every use case I saw, this was already being set to true, and this does logically make sense as the default behavior.
TextSplitter and MarkdownSplitter now take a ChunkConfig in their ::new method
- This bring the ChunkSizer, ChunkCapacity and trim settings into a single struct that can be instantiated with a builder-lite pattern.
- with_trim_chunks method has been removed from TextSplitter and MarkdownSplitter. You can now set trim in the ChunkConfig struct.
ChunkCapacity is now a struct instead of a Trait. If you were using a custom ChunkCapacity, you can change your impl to a From<TYPE> for ChunkCapacity instead. and you should be able to still pass it in to all of the same methods.
- This also means ChunkSizers take a concrete type in their method instead of an impl

Migration Examples

Default settings:

/// Before
let splitter = TextSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);

/// After
let splitter = TextSplitter::new(500);
let chunks = splitter.chunks("your document text");

Hugging Face Tokenizers:

/// Before
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);

/// After
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");

Tiktoken:

/// Before
let tokenizer = cl100k_base().unwrap();
let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);

/// After
let tokenizer = cl100k_base().unwrap();
let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer));
let chunks = splitter.chunks("your document text");

Ranges:

/// Before
let splitter = TextSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500..2000);

/// After
let splitter = TextSplitter::new(500..2000);
let chunks = splitter.chunks("your document text");

Markdown:

/// Before
let splitter = MarkdownSplitter::default().with_trim_chunks(true);
let chunks = splitter.chunks("your document text", 500);

/// After
let splitter = MarkdownSplitter::new(500);
let chunks = splitter.chunks("your document text");

ChunkSizer impls

pub trait ChunkSizer {
    /// Before
    fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize;
    /// After
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

ChunkCapacity impls

/// Before
impl ChunkCapacity for Range<usize> {
    fn start(&self) -> Option<usize> {
        Some(self.start)
    }

    fn end(&self) -> usize {
        self.end.saturating_sub(1).max(self.start)
    }
}

/// After
impl From<Range<usize>> for ChunkCapacity {
    fn from(range: Range<usize>) -> Self {
        ChunkCapacity::new(range.start)
            .with_max(range.end.saturating_sub(1).max(range.start))
            .expect("invalid range")
    }
}

Python

Chunk capacity is now a required arguement in the __init__ and classmethods of TextSplitter and MarkdownSplitter
trim_chunks parameter is now just trim in the __init__ and classmethods of TextSplitter and MarkdownSplitter

Migration Examples

Default settings:

# Before
splitter = TextSplitter()
chunks = splitter.chunks("your document text", 500)

# After
splitter = TextSplitter(500)
chunks = splitter.chunks("your document text")

Ranges:

# Before
splitter = TextSplitter()
chunks = splitter.chunks("your document text", (200,1000))

# After
splitter = TextSplitter((200,1000))
chunks = splitter.chunks("your document text")

Hugging Face Tokenizers:

# Before
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer)
chunks = splitter.chunks("your document text", 500)

# After
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = TextSplitter.from_huggingface_tokenizer(tokenizer, 500)
chunks = splitter.chunks("your document text")

Tiktoken:

# Before
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo")
chunks = splitter.chunks("your document text", 500)

# After
splitter = TextSplitter.from_tiktoken_model("gpt-3.5-turbo", 500)
chunks = splitter.chunks("your document text")

Custom callback:

# Before
splitter = TextSplitter.from_callback(lambda text: len(text))
chunks = splitter.chunks("your document text", 500)

# After
splitter = TextSplitter.from_callback(lambda text: len(text), 500)
chunks = splitter.chunks("your document text")

Markdown:

# Before
splitter = MarkdownSplitter()
chunks = splitter.chunks("your document text", 500)

# After
splitter = MarkdownSplitter(500)
chunks = splitter.chunks("your document text")

Full Changelog: v0.11.0...v0.12.0

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

What's New

Breaking Changes

Rust

Before

After

Contributors

What's Changed

What's Changed

Rust Usage

Python Usage

What's Changed

What's New / Breaking Changes

Bug Fix

New Contributors

Contributors

What's New

Rust Usage

Python Usage

What's Changed

What's New

Breaking Changes

Rust

Migration Examples

Python

Migration Examples

Releases: benbrandt/text-splitter

v0.14.1

What's New

v0.14.0

What's New

Breaking Changes

Rust

Before

After

Contributors

v0.13.3

What's Changed

v0.13.2 - CodeSplitter

What's Changed

Rust Usage

Python Usage

v0.13.1

What's Changed

v0.13.0

What's New / Breaking Changes

v0.12.3

Bug Fix

New Contributors

Contributors

v0.12.2 - Chunk Overlap

What's New

Rust Usage

Python Usage

v0.12.1 - rust_tokenizers support

What's Changed

v0.12.0 - Centralized Chunk Configuration

What's New

Breaking Changes

Rust

Migration Examples

Python

Migration Examples