Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically optimize batch size for indexed CSVs #2174

Closed
jqnatividad opened this issue Sep 28, 2024 · 3 comments · Fixed by #2178
Closed

Automatically optimize batch size for indexed CSVs #2174

jqnatividad opened this issue Sep 28, 2024 · 3 comments · Fixed by #2178
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. performance

Comments

@jqnatividad
Copy link
Owner

jqnatividad commented Sep 28, 2024

There are several parallelized commands with a --batch size parameter (often with a default of 50k)

If a CSV is indexed, automatically optimize the --batch parameter so that its set to abs(row_count/num_cpus) + 1.

In this way, we can process the command using optimum number of parallelized passes.

For example, the 1 million row NYC 311 benchmark data with an index, will have a batch of 62,500 rows on a 16 core CPU (1,000,000 / 16 ) instead of the default 50k.

Currently, 1M rows is processed in 20 batches of 50k rows. With this approach, it's processed in 16 batches of 62.5k rows.

@jqnatividad jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. performance labels Sep 28, 2024
@jqnatividad
Copy link
Owner Author

Only do this when the rowcount is greater than the default 50k rowcount.

Otherwise, the overhead of setting up batched parallelized commands may end up making the command slower - e.g. for an indexed CSV with 100 rows, the "optimized" batch size is 7 rows (100/16 = 6.25, round up to 7). This will result in 14 batches with a batch size of 7, and one additional batch with just 2 rows (7 * 14 = 98; 100 - 98 = 2).

We would have been better just running one batch (as 100 < 50k) and letting rayon transparently handle parallelizing the batch.

@jqnatividad
Copy link
Owner Author

Benchmarking shows that optimizing batch size even for unindexed CSVs is worth it. So this was implemented even for unindexed CSVs.

@jqnatividad
Copy link
Owner Author

I ended up optimizing batch size only when a CSV is indexed or the polars feature is enabled.

2afb614

Otherwise, the perf hit from counting the rows to determine the optimal batch size may cancel out the optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant