Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronously copy table data to the host during shuffle #11280

Merged
merged 3 commits into from
Jul 31, 2024

Conversation

jlowe
Copy link
Contributor

@jlowe jlowe commented Jul 31, 2024

Leverages rapidsai/cudf#16429 to asynchronously copy the partitioned table data to the host. This avoids unnecessary stream synchronization between each device buffer being copied back to the host and better overlaps CPU and GPU work during sliceInternalOnCpu. This has no measurable performance difference on NDS runs because the schema of shuffled data is relatively narrow (i.e.: not many separate buffers to copy back to the host and thus not many unnecessary CUDA stream synchronizations to save), but for wide shuffled schemas, this can make a significant difference. For example, sliceInternalOnCpu for a repartition of a table with 512 integer columns takes half the time with this asynchronous copy.

@jlowe jlowe added the performance A performance related task/issue label Jul 31, 2024
@jlowe jlowe self-assigned this Jul 31, 2024
@jlowe
Copy link
Contributor Author

jlowe commented Jul 31, 2024

build

@jlowe jlowe merged commit dfcff71 into NVIDIA:branch-24.10 Jul 31, 2024
44 checks passed
@jlowe jlowe deleted the copy-host-async branch July 31, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants