-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: use all available CPU cores #165
Comments
Hi @do-me I have been thinking about this, and actually I think what you are doing is exactly what I would do as well. The issue is, because of how tokenization works, it isn't easy to properly split the work in an efficient manner that will produce the exact same chunks, because the chunking is greedy. It is possible there is a way to do this that will preserve the behavior, but all of the my initial sketches this weekend have potential issues, either in still not utilizing enough cores, or producing different chunks, or doing significantly more work to "fix" the behavior. If I can find a way to properly split all of the chunking and use the cores, this would definitely be nice. But I also think that splitting the work by chunking each document as a separate task, rather than lower in the chunking, is what I would often do in most cases anyway, as it is a nice chunk of work to parallelize. In regards to if this should happen on the Python or Rust side, I think in this case, it won't matter too much. Because at some point the allocated strings have to come across the FFI boundary. Unless I find a more efficient way to handle this, I think it won't make a large difference, and I think it would be better for now to leave the flexibility up to the user in how they want to compose this. For example, I wouldn't want to tie the parallelization to pandas or something like this, and it doesn't seem like too much work to mange this (you might even be able to do you wrapping function as a I'll put this in the backlog so I don't lose track of it, but for now I think you are doing the right and most efficient thing 👍🏻 |
Thanks for sharing your ideas. I see, that's what I expected with tokenization and that's totally understandable. I think it's a good decision. By the way, here is the (more or less) one-liner if you want to use multiprocessing without pandas/swifter: Anyway, I was thinking a little ahead about my use case with the text-splitter wasm version as I always have one document, but this document might be extremely large, like containing the text of a book or a couple of books. What would you recommend in this case? I have two heuristics in mind:
|
I am using the Python bindings and noticed that text-splitter is running on one core only. I think it would be great to allow for an option to use all available CPU cores (at least for character-based splitting if using tokenizers would add too much complexity).
As a workaround I am currently using pandarallel for my pandas dataframe:
It requires the wrapping function to be able to hash it. On my 16 core machine it makes a big difference of 7 mins single-core processing vs 45 secs multi-core processing!
I don't know what's more efficient:
Happy to test in case!
The text was updated successfully, but these errors were encountered: