-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-864] Don't force shuffle if reducing partition count. #866
Conversation
Test PASSed. |
Could you add |
b08f323
to
88368bc
Compare
@heuermh Done! |
Test PASSed. |
LGTM, thanks. |
Can I merge this? If so, is the Merge github button still ok? If I remember correctly, the |
I thought I'd pushed a fix for the commit-pr.sh issue? Let me rebase this first... or, do you just want me to merge it? |
Rebase and I'll hit the button, no need to bend the don't-merge-your-own-pull-requests rule. :) |
Resolves bigdatagenomics#864. In Spark, coalescing will reduce the number of partitions in an RDD without performing a shuffle, but coalescing will only increase the number of partitions if a shuffle is performed. This PR modifies Transform and Vcf2ADAM to check whether the coalesce option will increase or decrease the partition count. Additionally, it adds a flag that allows the user to force a shuffle; this may be desirable as this causes a HashPartitioned shuffle, which may improve the balance of records across partitions. Additionally, we modify ADAM2Fasta to support similar options.
88368bc
to
fb550fd
Compare
Rebased! |
Test PASSed. |
[ADAM-864] Don't force shuffle if reducing partition count.
Thanks! |
Resolves #864. In Spark, coalescing will reduce the number of partitions in an
RDD without performing a shuffle, but coalescing will only increase the number
of partitions if a shuffle is performed. This PR modifies Transform and Vcf2ADAM
to check whether the coalesce option will increase or decrease the partition
count. Additionally, it adds a flag that allows the user to force a shuffle;
this may be desirable as this causes a HashPartitioned shuffle, which may
improve the balance of records across partitions.