-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13930][SQL] Apply fast serialization on collect limit operator #11759
Conversation
Test build #53309 has finished for PR 11759 at commit
|
retest this please. |
Test build #53310 has finished for PR 11759 at commit
|
Test build #53320 has finished for PR 11759 at commit
|
* Runs this query returning the result as an array. | ||
* Packing the UnsafeRows into byte array for faster serialization. | ||
* The byte arrays are in the following format: | ||
* [size] [bytes of UnsafeRow] [size] [bytes of UnsafeRow] ... [-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the implementation details, it has nothing with the APIs, I'd like to keep these as comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, you make it as an function.
Test build #53501 has finished for PR 11759 at commit
|
Test build #53502 has finished for PR 11759 at commit
|
cc @davies The comments are addressed and tests are passed. Please see if this is ok now. Thanks! |
LGTM, merging this into master, thanks! |
## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has been introduced to collecting DataFrame/Dataset (apache#11664). The same technology can be used on collect limit operator too. ## How was this patch tested? Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`. Without this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X With this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 833 / 1284 1.3 794.4 1.0X collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes apache#11759 from viirya/execute-take.
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-13930
Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
How was this patch tested?
Add a benchmark for collect limit to
BenchmarkWholeStageCodegen
.Without this patch:
With this patch: