Update changelog and documentation

benbrandt · May 23, 2023 · a534137 · a534137
1 parent 66368c9
commit a534137
Showing 4 changed files with 24 additions and 26 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Changelog
 
+## v0.3.1
+
+### What's New
+
+- Handle more levels of newlines. Will now find the largest newline sequence in the text, and then work back from there, treating each consecutive newline sequence length as its own semantic level.
+
 ## v0.3.0
 
 ### Breaking Changes

diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "text-splitter"
-version = "0.3.0"
+version = "0.3.1"
 authors = ["Ben Brandt <benjamin.j.brandt@gmail.com>"]
 edition = "2021"
 description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models)."
@@ -42,8 +42,3 @@ opt-level = 3
 [profile.dev.package.similar]
 opt-level = 3
 
-[profile.dev.package.tiktoken-rs]
-opt-level = 3
-
-[profile.dev.package.tokenizers]
-opt-level = 3
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ let chunks = splitter.chunks("your document text", max_characters);
 ### By Tokens
 
 ```rust
-use text_splitter::{TextSplitter};
+use text_splitter::TextSplitter;
 // Can also use tiktoken-rs, or anything that implements the TokenCount
 // trait from the text_splitter crate.
 use tokenizers::Tokenizer;
@@ -52,14 +52,13 @@ To preserve as much semantic meaning within a chunk as possible, a recursive app
    - Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
    - No. Split by the next level and repeat.
 
-The boundaries used to split the text if using the top-level `split` method, in descending length:
+The boundaries used to split the text if using the top-level `chunks` method, in descending length:
 
-1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
-2. 1 newline
-3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
-4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
-5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
-6. Characters
+1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
+2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
+3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
+4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
+5. Characters
 
 Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
 

diff --git a/src/lib.rs b/src/lib.rs
@@ -55,12 +55,11 @@ To preserve as much semantic meaning within a chunk as possible, a recursive app
 
 The boundaries used to split the text if using the top-level `chunks` method, in descending length:
 
-1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
-2. 1 newline
-3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
-4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
-5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
-6. Characters
+1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
+2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
+3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
+4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
+5. Characters
 
 Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.
 
@@ -188,12 +187,11 @@ where
     ///
     /// The boundaries used to split the text if using the top-level `split` method, in descending length:
     ///
-    /// 1. 2 or more newlines (Newline is `\r\n`, `\n`, or `\r`)
-    /// 2. 1 newline
-    /// 3. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
-    /// 4. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
-    /// 5. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
-    /// 6. Characters
+    /// 1. Descending sequence length of newlines. (Newline is `\r\n`, `\n`, or `\r`) Each unique length of consecutive newline sequences is treated as its own semantic level.
+    /// 2. [Unicode Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
+    /// 3. [Unicode Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
+    /// 4. [Unicode Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
+    /// 5. Characters
     ///
     /// Splitting doesn't occur below the character level, otherwise you could get partial
     /// bytes of a char, which may not be a valid unicode str.