You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several parallelized commands with a --batch size parameter (often with a default of 50k)
If a CSV is indexed, automatically optimize the --batch parameter so that its set to abs(row_count/num_cpus) + 1.
In this way, we can process the command using optimum number of parallelized passes.
For example, the 1 million row NYC 311 benchmark data with an index, will have a batch of 62,500 rows on a 16 core CPU (1,000,000 / 16 ) instead of the default 50k.
Currently, 1M rows is processed in 20 batches of 50k rows. With this approach, it's processed in 16 batches of 62.5k rows.
The text was updated successfully, but these errors were encountered:
Only do this when the rowcount is greater than the default 50k rowcount.
Otherwise, the overhead of setting up batched parallelized commands may end up making the command slower - e.g. for an indexed CSV with 100 rows, the "optimized" batch size is 7 rows (100/16 = 6.25, round up to 7). This will result in 14 batches with a batch size of 7, and one additional batch with just 2 rows (7 * 14 = 98; 100 - 98 = 2).
We would have been better just running one batch (as 100 < 50k) and letting rayon transparently handle parallelizing the batch.
There are several parallelized commands with a
--batch
size parameter (often with a default of 50k)If a CSV is indexed, automatically optimize the
--batch
parameter so that its set to abs(row_count/num_cpus) + 1.In this way, we can process the command using optimum number of parallelized passes.
For example, the 1 million row NYC 311 benchmark data with an index, will have a
batch
of 62,500 rows on a 16 core CPU (1,000,000 / 16 ) instead of the default 50k.Currently, 1M rows is processed in 20 batches of 50k rows. With this approach, it's processed in 16 batches of 62.5k rows.
The text was updated successfully, but these errors were encountered: