Automatically optimize batch size for indexed CSVs #2174

jqnatividad · 2024-09-28T10:55:09Z

There are several parallelized commands with a --batch size parameter (often with a default of 50k)

If a CSV is indexed, automatically optimize the --batch parameter so that its set to abs(row_count/num_cpus) + 1.

In this way, we can process the command using optimum number of parallelized passes.

For example, the 1 million row NYC 311 benchmark data with an index, will have a batch of 62,500 rows on a 16 core CPU (1,000,000 / 16 ) instead of the default 50k.

Currently, 1M rows is processed in 20 batches of 50k rows. With this approach, it's processed in 16 batches of 62.5k rows.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-09-28T11:53:51Z

Only do this when the rowcount is greater than the default 50k rowcount.

Otherwise, the overhead of setting up batched parallelized commands may end up making the command slower - e.g. for an indexed CSV with 100 rows, the "optimized" batch size is 7 rows (100/16 = 6.25, round up to 7). This will result in 14 batches with a batch size of 7, and one additional batch with just 2 rows (7 * 14 = 98; 100 - 98 = 2).

We would have been better just running one batch (as 100 < 50k) and letting rayon transparently handle parallelizing the batch.

jqnatividad · 2024-09-29T23:50:31Z

Benchmarking shows that optimizing batch size even for unindexed CSVs is worth it. So this was implemented even for unindexed CSVs.

jqnatividad · 2024-09-30T17:59:32Z

I ended up optimizing batch size only when a CSV is indexed or the polars feature is enabled.

2afb614

Otherwise, the perf hit from counting the rows to determine the optimal batch size may cancel out the optimization.

jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. performance labels Sep 28, 2024

jqnatividad mentioned this issue Sep 29, 2024

Automatically determine optimal batch size #2178

Merged

jqnatividad closed this as completed in #2178 Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically optimize batch size for indexed CSVs #2174

Automatically optimize batch size for indexed CSVs #2174

jqnatividad commented Sep 28, 2024 •

edited

Loading

jqnatividad commented Sep 28, 2024

jqnatividad commented Sep 29, 2024

jqnatividad commented Sep 30, 2024

Automatically optimize batch size for indexed CSVs #2174

Automatically optimize batch size for indexed CSVs #2174

Comments

jqnatividad commented Sep 28, 2024 • edited Loading

jqnatividad commented Sep 28, 2024

jqnatividad commented Sep 29, 2024

jqnatividad commented Sep 30, 2024

jqnatividad commented Sep 28, 2024 •

edited

Loading