-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841
Allow Setting Minimum Parallelism with RowCount Based Demuxer #7841
Conversation
fa73eb8
to
3d3149f
Compare
@alamb @metesynnada This PR and #7801 are rebased and ready for review when you have a chance. This one is smaller and addresses the performance regression, so probably best to prioritize this one. |
Thank you @devinjdangelo -- I have been accumulating quite a review backlog while working on some other writing projects lol -- I hope to make a dent in this backlog tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @devinjdangelo -- this is (another) really nice PR.
I reran the test from #7791 (review)
And I do confirm this PR goes much faster. The fact the setting is configurable also means users can trade off buffering and fewer/more compacted files, which is very nice
It is also really nice that this PR still doesn't make empty files if there are no batches to send.
The only thing I think this PR needs prior to merge is some sort of test (perhaps you could set minimum_parallel_output_files
to 1 and demonstrate that a single file is created, and then set it minimum_parallel_output_files
to 3 and demonstrate that more than 1 is created or something
/// RecordBatches will be distributed in round robin fashion to each | ||
/// parallel writer. Each writer is closed and a new file opened once | ||
/// soft_max_rows_per_output_file is reached. | ||
pub minimum_parallel_output_files: usize, default = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about defaulting to the number of cores (maybe if this was 0
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returns to additional cores seems to decline very fast beyond 4 tasks in my testing. I believe this is because ~4 parallel serialization tasks no longer bottlenecks the end-to-end execution plan. Going beyond 4 tasks mostly gives higher memory usage and smaller output files for little benefit.
My testing is mostly on a 32core system. I have not tested on enough different configurations to know if core_count/8 is a reasonable default or if a static 4 tasks is a decent default.
It will also depend a lot on the actual execution plan. If you are writing a pre-cached in memory dataset, then you definitely want 1 task/output file per core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to work on a statement level option soon, so you could easily do:
copy my_in_memory_table to my_dir (format parquet, output_files 32);
to boost the parallelism for specific plans that benefit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
This PR has a small conflict, but I am pretty sure once that is fixed it will be ready to go |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
I'll sort this out today, and see if I can improve the tests as you suggested. |
LGTM -- thanks again @devinjdangelo |
Which issue does this PR close?
Addresses performance regression of #7791
Rationale for this change
#7791 introduced a row count targeting execution time partitioning strategy for DataSinks. The initial implementation only writes a single file at a time, which guarantees that only 1 file will ever be written with <
soft_max_rows_per_output_file
rows and all others will have >=soft_max_rows_per_output_file
. This PR introduces a new settingminimum_parallel_output_files
which will write N files in parallel, each targetingsoft_max_rows_per_output_file
. This allows the user to configure a balance between parallelism and achieving the desired file size.The behavior of this PR is identical to #7791 if minimum_parallel_output_files is set to 1.
What changes are included in this PR?
minimum_parallel_output_files
config settingminimum_parallel_output_files
setting.Are these changes tested?
Yes by existing tests
Are there any user-facing changes?
Default behavior is now to output at least 4 files in parallel even if
soft_max_rows_per_output_file
is not reached.