-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced Chunks #117
Comments
Hi Ben! I think what you describe here with "balanced chunks" is actually something that nowadays is called "semantic chunking" (not to be confused with purely punctuation- or grammar-based semantics that you already implemented) with strategies like MMR, RAPTOR or combinations of those. I wrote about this in more detail here do-me/SemanticFinder#45 but the core idea really boils down to:
Imho if you spin your idea further of splitting text more "human-like" then you would necessarily end up with a similar strategy at some point. |
Hi @do-me thanks for this. I had this more scoped here: https://github.com/users/benbrandt/projects/2?pane=issue&itemId=57419637 Basically this issue is more that the current algorithm is "greedy", but if we know we are within a certain semantic level, I think there might be a way to try and spread the content out more evenly between the chunks of the level. Does that make sense? I do agree that all of the semantic embedding approaches are interesting. I am hopeful my crate can be helpful in terms of making sure that all of the chunks being embedded are within the context window of the embedding model, and allowing for a two-pass approach... but it all needs to be explored :) |
That is because I haven't really scoped it out yet 😆 That being said, I want to get a few other features done first, namely chunk overlap and Code splitting, and then I would probably reassess priority from there. Hopefully that makes sense? |
Sure, sounds really exciting! :) Looking forward to your future developments! |
This would be a great feature. Maybe food for thought... Previously I'd have a bunch of documents that I'd pre-calcuate the total (token) size of those documents, find the outliers, then chunk them down using the median of the non-outlier distribution. This way, the outliers get merged into the bell curve. Obviously you wouldn't want to chunk twice and "waste" the compute of the first cycle. So I think you can take the total size of the |
If anyone is curious, I implemented this in my crate. I had previously done this in python last year and submitted it as a PR to langchain, but it went no where. I wrote a comparison test here.
The smallest_token_size tells the story. As far as computationally heavy goes, yes I assumed it would be much slower. I fought hard to get it anywhere near the benchmarks for this library, and it was still 10x slower than the benchmarks posted in some cases. (Although there seems to be an issue currently with text-splitter and tiktoken where my implementation is faster.) Once resolved, I don't think it would be possible to make it in the same ballpark as fast as this library would be. However, the orphan chunk issue is a deal breaker for me for text. That said, this library is much more polished, and the code and markdown splitter are things I will be using. Happy to discuss. |
Motivation:
Right now, the chunking uses a greedy algorithm. The following would output the following chunks:
This may not always be desirable, since it can leave "orphaned" elements at the end.
In some cases it would be better to realize at this semantic level there is a more ideal split of:
Finding this is not straightforward in all cases, I attempted it in the past, but at least that attempt led to the algorithm generating several more chunks, rather than finding the best split point. Because tokenization isn't always predictable, there may need to be some allowance for extra chunks being generated, but ideally we can find good split points within the current number of chunks.
Todo:
TextSplitter
andMarkdownSplitter
should both have an opt-in method of enabling balanced chunking (this behavior may be better in all scenarios, but it is unclear, and we need to probably pour over the snapshot diffs to see how much of a difference it makes).The text was updated successfully, but these errors were encountered: