Skip to content

[QST] SQL Performance issue #5342

Answered by jlowe
eyalhir74 asked this question in General
Discussion options

You must be logged in to vote

Thanks a ton for the qdrep file, that helps a lot!

So yes, shuffle and buffer reading is a significant chunk of this. Looking at the first stage of the query where it first starts reading Parquet:

All the yellow ranges after the purple "Hash partition" and green "Parquet readBatch" range on that thread are serializing out the task's output for Spark shuffle. The gaps between those are the CPU thread writing out shuffle data to disk and scheduling the next task. These are all outside of the scope of the RAPIDS Accelerator and is part of standard Apache Spark. Once we get to the Parquet read, we can see that the first 137ms of that range is spent buffering the input data from the filesyste…

Replies: 9 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
2 participants
Converted from issue

This discussion was converted from issue #2783 on April 27, 2022 17:00.