Releases · benbrandt/text-splitter

14 Dec 20:50

benbrandt

v0.20.0

919fba6

v0.20.0 Latest

Latest

Breaking Changes

Switched backing Unicode segmentation implementation from unicode-segmentation to icu_segmenter. This brings some modest performance gains, along with being able to leverage the official Unicode crate. There may be slight differences in chunk behavior in some edge cases, so treating this as a breaking change.

Full Changelog: v0.19.1...v0.20.0

Assets 2

0 Join discussion

14 Dec 07:07

benbrandt

v0.19.1

da61ef2

v0.19.1

What's New

Python splitters have new chunk_all and chunk_all_indices method so the multiple texts can be processed in parallel. (For Rust, you should be able to use rayon to do this already)

Full Changelog: v0.19.0...v0.19.1

Assets 2

0 Join discussion

28 Nov 10:49

benbrandt

v0.19.0

9248906

v0.19.0

Breaking Changes

Update to tokenizers v0.21

Full Changelog: v0.18.1...v0.19.0

Assets 2

0 Join discussion

25 Oct 19:31

benbrandt

v0.18.1

977b0c6

v0.18.1

What's New

Ensure tokenizer sizers with truncation parameters count their overflow encodings by @Jeadie in #433

New Contributors

@Jeadie made their first contribution in #433

Full Changelog: v0.18.0...v0.18.1

Contributors

Jeadie

Assets 2

0 Join discussion

14 Oct 12:57

benbrandt

v0.18.0

27fefce

v0.18.0

Breaking

Change supported tiktoken-rs version to 0.6.x

Full Changelog: v0.17.1...v0.18.0

Assets 2

0 Join discussion

11 Oct 05:07

benbrandt

v0.17.1

4eb54cf

v0.17.1

What's New

Loosen regex crate version requirement

Full Changelog: v0.17.0...v0.17.1

Assets 2

0 Join discussion

06 Oct 13:33

benbrandt

v0.17.0

474f5a6

v0.17.0

Breaking Changes

Support tree-sitter@v0.24 for CodeSplitters.
Due to a slight change in the backing unicode segmentation implementation, there are some slight shifts in behavior for CodeSplitters as well (in my tests, mostly that semicolons have a more logical grouping with previous content).

Full Changelog: v0.16.1...v0.17.0

Assets 2

0 Join discussion

07 Sep 11:27

benbrandt

v0.16.1

e53d5e2

v0.16.1

What's New

Updates pulldown-cmark to v0.12.1 to address an issue with high CPU usage for certain Markdown elements.

Full Changelog: v0.16.0...v0.16.1

Assets 2

0 Join discussion

02 Sep 21:32

benbrandt

v0.16.0

79a8137

v0.16.0

Breaking Changes

Update to v0.23.0 of tree-sitter for CodeSplitter. There was a breaking change for language definitions, so this is also a breaking change for us, especially on the Python side, since we support passing the language in.
Minimum Python version for the Python bindings is now 3.9 since 3.8 will be EOL next month.

Python

Make sure to upgrade to the latest version of your tree-sitter language package.

Rust

Make sure to upgrade to the latest version of your tree-sitter language package crate. These know have a LANGUAGE constant rather than a language() function.

// Before
tree_sitter_rust::language()
// After
tree_sitter_rust::LANGUAGE

What's New

MarkdownSplitter can better parse the Commonmark HS extension for Definition Lists.

Full Changelog: v0.15.0...v0.16.0

Assets 2

0 Join discussion

11 Aug 05:21

benbrandt

v0.15.0

67b20aa

v0.15.0

What's New

Support version 0.20.0 of the tokenizers crate.

Python

No longer cause a segmentation fault when using the wrong type for tree-sitter languages. Fixes #265

Full Changelog: v0.14.1...v0.15.0

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking Changes

What's New

Breaking Changes

What's New

New Contributors

Contributors

Breaking

What's New

Breaking Changes

What's New

Breaking Changes

Python

Rust

What's New

What's New

Python

Releases: benbrandt/text-splitter

v0.20.0

Breaking Changes

v0.19.1

What's New

v0.19.0

Breaking Changes

v0.18.1

What's New

New Contributors

Contributors

v0.18.0

Breaking

v0.17.1

What's New

v0.17.0

Breaking Changes

v0.16.1

What's New

v0.16.0

Breaking Changes

Python

Rust

What's New

v0.15.0

What's New

Python