v0.4.0 - New Chunk Capacity #12
benbrandt
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What's New
New Chunk Capacity (can now size chunks with Ranges)
New
ChunkCapacity
trait. When callingsplitter.chunks()
orsplitter.chunk_indices()
, thechunk_size
argument has been replaced withchunk_capacity
, which can be anything that implements theChunkCapacity
trait. This means that now the following can all be passed in:usize
Range<usize>
RangeFrom<usize>
RangeFull
RangeInclusive<usize>
RangeTo<usize>
RangeToInclusive<usize>
This is helpful for cases where you do have a maximum chunk size, but you don't necessarily want to fill it up all the way every time. This can be helpful in embedding cases, where you have some maximum context size, but you don't necessarily want to muddy the embeddings with lots of neighboring semantic elements. You can use a range to express this now, and the chunks will stop filling up once they have reached a size within the range.
Simplified Chunk Sizing traits
Simplified
ChunkSizer
trait that allows for various calculations of chunk size. No longer requires full validation logic, since that now happens within theTextSplitter
itself.Breaking Changes
ChunkValidator
trait removed. Insteadimpl ChunkSizer
instead, which just requires calculating chunk_size and not the full validation logic.TokenCount
trait removed. You can just useChunkSizer
directly instead.TextChunks
iterator is no longerpub
.This discussion was created from the release v0.4.0 - New Chunk Capacity.
Beta Was this translation helpful? Give feedback.
All reactions