Heuristics for very large documents #184

do-me · 2024-05-16T19:07:17Z

Problem

I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.

Possible solutions

Is there maybe some bottleneck in the code somewhere where complexity explodes? Considering for how fast text-splitter works for normal-sized docs, it feels like this should be doable in a couple of seconds.
Could we maybe apply some additional heuristics here for dealing with very large docs?
Maybe point 2 in combination with multiprocessing Performance: use all available CPU cores #165 might be an idea?

Here is my example (37Mb parquet file):

from semantic_text_splitter import TextSplitter
import pandas as pd 
df = pd.read_parquet("free_trade_agreement.parquet")
splitter = TextSplitter((1500,2000)) 
test_chunks = splitter.chunks(df.text.item())

Couldn't upload the file here in the issue so I'm hosting it on my Drive here.

In case, do you have any other ideas how to make it work?

do-me · 2024-05-17T06:20:11Z

Update: it ran all night without success. Seemed to be still running this morning, so I aborted.

benbrandt · 2024-05-17T14:41:44Z

Hmm ok I am traveling today but I will try and take a look this weekend. It does seem strange, and I'm not sure if there is something about loading it from Parquet that is causing an issue... But I assume not since the others are working fine.

benbrandt · 2024-05-17T14:45:44Z

Hmm also the rust crate assumes the entire string is in memory. The rust crate at least only ever uses references to the single string, but on the Python side it eventually has to allocate a string for every chunk so that Python owns them... That may be another issue.

Just trying to think about what could be different since I assume it is making some progress... I'll also try and see if it runs any differently straight from Rust rather than through the Python bindings.

do-me · 2024-05-17T14:48:00Z

It's not about the parquet file, I just exported it as parquet for you for convenience. Yes, also the other texts are working fine. Seems really to be about the size of the text that somehow causes something to fail or rather to blow up the processing exponentially. Also it fits nicely in memory as it's only 37Mb.

benbrandt · 2024-05-17T14:53:47Z

Ok good to know, just wasn't something I had tried before so I wasn't sure. Thanks!

benbrandt · 2024-05-19T04:38:42Z

@do-me So I have been doing some tests. If I bump the chunk size up enough, it does finish, and it does some to progress even at smaller sizes, including what you posted. I ran it in Rust so I could output incremental progress with each chunk, and it is moving, just slowly.

The issue I am seeing is that there are no newlines anywhere in this file. So for each chunk, it is binary searching across the entire file at sentence granularity, which at this size is indeed unfortunate...

I will need to look into this some more. I make an assumption in some areas that I found a granularity that was too big to fit, and do a binary search for sections of the lower granularity, but only up to the byte offset of the end of the granularity I know is too big.

But I am now thinking that it might be better to only do binary search if I can reasonably limit the search space. If I can't, then I am searching across the entire document, and due to the expensive nature of tokenization, it is likely binary search can become quite expensive, and may not deliver the speed improvement it does if I can limit the search space.

I could also change the binary search to not be a strict binary search... I will have to play around with it... I will give this a try, and run my benchmarks again in the happy path where I often do have an upper bound to the search space, and make sure there aren't any regressions.

do-me · 2024-05-19T08:49:07Z

I see, thanks for looking into this! So from what you describe, this could happen on other semantic levels as well.

The logic to stream the already identified chunks would be awesome! Especially for a potential wasm version as I could already start to calculate embeddings!

Indeed, we will work on the data side as wel and try to include the line breaks.

benbrandt · 2024-06-21T20:08:23Z

@do-me just want to give you a heads up so you know I didn't forget. I've been trying dozens of implementations, all with different tradeoffs, but I think I found one I am happy with.

I was able to split this file in less than a second with my latest branch 🎉
I have to do some cleanup/ release preparation but I hope to get this out soon!

do-me · 2024-06-25T08:00:50Z

Awesome @benbrandt, thank you very much for your efforts!
I'll need to run some bigger processing tasks in the next couple of weeks so I'll come back here by then and let you know about my experiences :)

Performance:
With the latest release semantic-text-splitter-0.14.0 processing now takes 2.8s on my M3 for this doc.

CPU times: user 2.79 s, sys: 28.9 ms, total: 2.82 s
Wall time: 2.83 s

benbrandt mentioned this issue Jun 12, 2024

Slower performance compared to Langchain's RecursiveCharacterTextSplitter #223

Open

benbrandt linked a pull request Jun 21, 2024 that will close this issue

Performance Optimization: Expanding binary search window #231

Merged

benbrandt closed this as completed in #231 Jun 21, 2024

do-me mentioned this issue Jun 28, 2024

Problem with very large files umarbutler/semchunk#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristics for very large documents #184

Heuristics for very large documents #184

do-me commented May 16, 2024

do-me commented May 17, 2024

benbrandt commented May 17, 2024

benbrandt commented May 17, 2024

do-me commented May 17, 2024 •

edited

Loading

benbrandt commented May 17, 2024

benbrandt commented May 19, 2024

do-me commented May 19, 2024

benbrandt commented Jun 21, 2024

do-me commented Jun 25, 2024 •

edited

Loading

Heuristics for very large documents #184

Heuristics for very large documents #184

Comments

do-me commented May 16, 2024

Problem

Possible solutions

do-me commented May 17, 2024

benbrandt commented May 17, 2024

benbrandt commented May 17, 2024

do-me commented May 17, 2024 • edited Loading

benbrandt commented May 17, 2024

benbrandt commented May 19, 2024

do-me commented May 19, 2024

benbrandt commented Jun 21, 2024

do-me commented Jun 25, 2024 • edited Loading

do-me commented May 17, 2024 •

edited

Loading

do-me commented Jun 25, 2024 •

edited

Loading