Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Par bridge with optional buffering IO handling #1173

Closed
pickfire opened this issue Jun 10, 2024 · 2 comments
Closed

Par bridge with optional buffering IO handling #1173

pickfire opened this issue Jun 10, 2024 · 2 comments

Comments

@pickfire
Copy link

My use case is to process many files in a cpu heavy workload, what I did is do a glob of files, par_bridge it into a cpu intensive task, but in that task it needs to first loads the file which is IO heavy instead of CPU heavy.

I am thinking if it is possible to use another thread to do read the file into memory first then only par_iter it?

glob(...).par_buffer(polars_read).par_iter(polars_process);

The part of buffering reads it into memory and prepare am extra set of items for each cpu intensive function to process polars_process without having to waste cpu time doing IO.

@adamreichold
Copy link
Collaborator

If I understand you correctly, you would like to read in data with twice as many threads as you use to process it? What do you think about using two separate pools to do that, i.e. roughly

let cpu_pool = ThreadPoolBuilder::new().build().unwrap();

let io_pool = ThreadPoolBuilder::new().num_threads(cpu_pool.current_num_threads() * 2).build().unwrap();

io_pool.scope(|io_scope| {
   glob(...).par_iter().map(polars_read).for_each(|item| {
       cpu_pool.in_place_scope(move |cpu_scope| {
           cpu_scope.spawn(move || polars_process(item));
       });
   });
});

(I have not even compiled this, the code is only meant to illustrate using two threads pool to explicitly implement the two-IO-per-CPU-thread approach.)

@pickfire
Copy link
Author

pickfire commented Jul 28, 2024

Interesting, something like this could indeed work. But it seemed like a lot of work given that I only want to glob files. I guess I can close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants