v0.8.0 - Performance Improvements
What's New
Significantly fewer allocations necessary when generating chunks. This should result in a performance improvement for most use cases. This was achieved by both reusing pre-allocated collections, as well as memoizing chunk size calculations since that is often the bottleneck, and tokenizer libraries tend to be very allocation heavy!
Benchmarks show:
- 20-40% fewer allocations caused by the core algorithm.
- Up to 20% fewer allocations when using tokenizers to calculate chunk sizes.
- In some cases, especially with Markdown, these improvements can also result in up to 20% faster chunk generation.
Breaking Changes
- There was a bug in the
MarkdownSplitter
logic that caused some strange split points. - The
Text
semantic level inMarkdownSplitter
has been merged with inline elements to also find better split points inside content. - Fixed a bug that could cause the algorithm to use a lower semantic level than necessary on occasion. This mostly impacted the
MarkdownSplitter
, but there were same cases of different behavior in theTextSplitter
as well if chunks are not trimmed.
All of the above can cause different chunks to be output than before, depending on the text. So, even though these are bug fixes to bring intended behavior, they are being treated as a major version bump.
Full Changelog: v0.7.0...v0.8.0