Skip to content

Commit

Permalink
Update changelog and documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
benbrandt committed May 23, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
1 parent 66368c9 commit a534137
Showing 4 changed files with 24 additions and 26 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## v0.3.1

### What's New

- Handle more levels of newlines. Will now find the largest newline sequence in the text, and then work back from there, treating each consecutive newline sequence length as its own semantic level.

## v0.3.0

### Breaking Changes
7 changes: 1 addition & 6 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "text-splitter"
version = "0.3.0"
version = "0.3.1"
authors = ["Ben Brandt <benjamin.j.brandt@gmail.com>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models)."
@@ -42,8 +42,3 @@ opt-level = 3
[profile.dev.package.similar]
opt-level = 3

[profile.dev.package.tiktoken-rs]
opt-level = 3

[profile.dev.package.tokenizers]
opt-level = 3
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ let chunks = splitter.chunks("your document text", max_characters);
### By Tokens

```rust
use text_splitter::{TextSplitter};
use text_splitter::TextSplitter;
// Can also use tiktoken-rs, or anything that implements the TokenCount
// trait from the text_splitter crate.
use tokenizers::Tokenizer;
@@ -52,14 +52,13 @@ To preserve as much semantic meaning within a chunk as possible, a recursive app
- Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
- No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level `split` method, in descending length:
The boundaries used to split the text if using the top-level `chunks` method, in descending length:

1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
2. 1 newline
3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
6. Characters
1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
5. Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

22 changes: 10 additions & 12 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -55,12 +55,11 @@ To preserve as much semantic meaning within a chunk as possible, a recursive app
The boundaries used to split the text if using the top-level `chunks` method, in descending length:
1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
2. 1 newline
3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
6. Characters
1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
5. Characters
Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
@@ -188,12 +187,11 @@ where
///
/// The boundaries used to split the text if using the top-level `split` method, in descending length:
///
/// 1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
/// 2. 1 newline
/// 3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
/// 4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
/// 5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
/// 6. Characters
/// 1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
/// 2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
/// 3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
/// 4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
/// 5. Characters
///
/// Splitting doesn't occur below the character level, otherwise you could get partial
/// bytes of a char, which may not be a valid unicode str.

0 comments on commit a534137

Please sign in to comment.