Skip to content

Releases: benbrandt/text-splitter

v0.20.0

14 Dec 20:50
Compare
Choose a tag to compare

Breaking Changes

  • Switched backing Unicode segmentation implementation from unicode-segmentation to icu_segmenter. This brings some modest performance gains, along with being able to leverage the official Unicode crate. There may be slight differences in chunk behavior in some edge cases, so treating this as a breaking change.

Full Changelog: v0.19.1...v0.20.0

v0.19.1

14 Dec 07:07
Compare
Choose a tag to compare

What's New

  • Python splitters have new chunk_all and chunk_all_indices method so the multiple texts can be processed in parallel. (For Rust, you should be able to use rayon to do this already)

Full Changelog: v0.19.0...v0.19.1

v0.19.0

28 Nov 10:49
9248906
Compare
Choose a tag to compare

Breaking Changes

  • Update to tokenizers v0.21

Full Changelog: v0.18.1...v0.19.0

v0.18.1

25 Oct 19:31
977b0c6
Compare
Choose a tag to compare

What's New

  • Ensure tokenizer sizers with truncation parameters count their overflow encodings by @Jeadie in #433

New Contributors

Full Changelog: v0.18.0...v0.18.1

v0.18.0

14 Oct 12:57
27fefce
Compare
Choose a tag to compare

Breaking

Change supported tiktoken-rs version to 0.6.x

Full Changelog: v0.17.1...v0.18.0

v0.17.1

11 Oct 05:07
4eb54cf
Compare
Choose a tag to compare

What's New

  • Loosen regex crate version requirement

Full Changelog: v0.17.0...v0.17.1

v0.17.0

06 Oct 13:33
474f5a6
Compare
Choose a tag to compare

Breaking Changes

  • Support tree-sitter@v0.24 for CodeSplitters.
  • Due to a slight change in the backing unicode segmentation implementation, there are some slight shifts in behavior for CodeSplitters as well (in my tests, mostly that semicolons have a more logical grouping with previous content).

Full Changelog: v0.16.1...v0.17.0

v0.16.1

07 Sep 11:27
e53d5e2
Compare
Choose a tag to compare

What's New

Updates pulldown-cmark to v0.12.1 to address an issue with high CPU usage for certain Markdown elements.

Full Changelog: v0.16.0...v0.16.1

v0.16.0

02 Sep 21:32
Compare
Choose a tag to compare

Breaking Changes

  • Update to v0.23.0 of tree-sitter for CodeSplitter. There was a breaking change for language definitions, so this is also a breaking change for us, especially on the Python side, since we support passing the language in.
  • Minimum Python version for the Python bindings is now 3.9 since 3.8 will be EOL next month.

Python

Make sure to upgrade to the latest version of your tree-sitter language package.

Rust

Make sure to upgrade to the latest version of your tree-sitter language package crate. These know have a LANGUAGE constant rather than a language() function.

// Before
tree_sitter_rust::language()
// After
tree_sitter_rust::LANGUAGE

What's New

  • MarkdownSplitter can better parse the Commonmark HS extension for Definition Lists.

Full Changelog: v0.15.0...v0.16.0

v0.15.0

11 Aug 05:21
Compare
Choose a tag to compare

What's New

  • Support version 0.20.0 of the tokenizers crate.

Python

  • No longer cause a segmentation fault when using the wrong type for tree-sitter languages. Fixes #265

Full Changelog: v0.14.1...v0.15.0