-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338
Conversation
for (int i = lengthAligned; i < lengthInBytes; i++) { | ||
int halfWord = base.getByte(i); | ||
int halfWord = Platform.getByte(o, offset + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So seems the performance regression is due to the cost of virtual function calls on MemoryBlock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was my guess too at the beginning, but if you just do this change, performance won't change. Seems reasonable what said by @kiszk about the clue being the size of the javabyte code generated, but needs more investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Seems there are more than single cause for this performance regression.
cc @cloud-fan @kiszk I checked the bytecode generated and the size of the generated code seems not to be the issue either. FYI I am attaching here the disassembled code before and after: |
Thanks for working on it! Will it be helpful if we move these hash methods to |
@cloud-fan I think I tried doing something like what you suggested but it didn't help. Moreover, the current code in |
This basically reverts the memory block in the hash computing, now the memory block is just a holder of the base object and base offset. This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block. |
Test build #95704 has finished for PR 22338 at commit
|
@mgaido91 thanks, interestingly I did experiments with similar code in my box. Regarding bytecode size, what I have seen in my experiments is the following. I have not found the root cause of this behavior. Is Your original code: 38us
Your original code with moving
|
@kiszk yes, I know. Moving |
In addition to your commit, I applied the following change, basically use
But, furthermore, when I also used |
@cloud-fan I checked the original PR and I couldn't find any benchmark related to Murmur3 hash. The only benchmark I found there was using the |
you're right @kiszk. Let me update the PR accordingly in order to narrow down the change/problem, thanks. |
While I saw these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause". Let us narrow down the problem and find the root cause. |
Yes, I am in the same situation. I see the difference but I cannot explain it. |
Yeah, it's interesting... Seems both |
Test build #95721 has finished for PR 22338 at commit
|
Since the change looks safer to me and it does fix the regression, I'm merging it to unblock 2.4 release. Please continue to investigate the root cause, thanks! |
## What changes were proposed in this pull request? SPARK-10399 introduced a performance regression on the hash computation for UTF8String. The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too. ## How was this patch tested? running the perf test from the JIRA Closes #22338 from mgaido91/SPARK-25317. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 64c314e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0b9ccd5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
SPARK-10399 introduced a performance regression on the hash computation for UTF8String.
The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too.
How was this patch tested?
running the perf test from the JIRA