-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heuristics for very large documents #184
Comments
Update: it ran all night without success. Seemed to be still running this morning, so I aborted. |
Hmm ok I am traveling today but I will try and take a look this weekend. It does seem strange, and I'm not sure if there is something about loading it from Parquet that is causing an issue... But I assume not since the others are working fine. |
Hmm also the rust crate assumes the entire string is in memory. The rust crate at least only ever uses references to the single string, but on the Python side it eventually has to allocate a string for every chunk so that Python owns them... That may be another issue. Just trying to think about what could be different since I assume it is making some progress... I'll also try and see if it runs any differently straight from Rust rather than through the Python bindings. |
It's not about the parquet file, I just exported it as parquet for you for convenience. Yes, also the other texts are working fine. Seems really to be about the size of the text that somehow causes something to fail or rather to blow up the processing exponentially. Also it fits nicely in memory as it's only 37Mb. |
Ok good to know, just wasn't something I had tried before so I wasn't sure. Thanks! |
@do-me So I have been doing some tests. If I bump the chunk size up enough, it does finish, and it does some to progress even at smaller sizes, including what you posted. I ran it in Rust so I could output incremental progress with each chunk, and it is moving, just slowly. The issue I am seeing is that there are no newlines anywhere in this file. So for each chunk, it is binary searching across the entire file at sentence granularity, which at this size is indeed unfortunate... I will need to look into this some more. I make an assumption in some areas that I found a granularity that was too big to fit, and do a binary search for sections of the lower granularity, but only up to the byte offset of the end of the granularity I know is too big. But I am now thinking that it might be better to only do binary search if I can reasonably limit the search space. If I can't, then I am searching across the entire document, and due to the expensive nature of tokenization, it is likely binary search can become quite expensive, and may not deliver the speed improvement it does if I can limit the search space. I could also change the binary search to not be a strict binary search... I will have to play around with it... I will give this a try, and run my benchmarks again in the happy path where I often do have an upper bound to the search space, and make sure there aren't any regressions. |
I see, thanks for looking into this! So from what you describe, this could happen on other semantic levels as well. The logic to stream the already identified chunks would be awesome! Especially for a potential wasm version as I could already start to calculate embeddings! Indeed, we will work on the data side as wel and try to include the line breaks. |
@do-me just want to give you a heads up so you know I didn't forget. I've been trying dozens of implementations, all with different tradeoffs, but I think I found one I am happy with. I was able to split this file in less than a second with my latest branch 🎉 |
Awesome @benbrandt, thank you very much for your efforts! Performance:
|
Problem
I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.
Possible solutions
Here is my example (37Mb parquet file):
Couldn't upload the file here in the issue so I'm hosting it on my Drive here.
In case, do you have any other ideas how to make it work?
The text was updated successfully, but these errors were encountered: