Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

Closed
wants to merge 2 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Sep 5, 2018

What changes were proposed in this pull request?

SPARK-10399 introduced a performance regression on the hash computation for UTF8String.

The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too.

How was this patch tested?

running the perf test from the JIRA

for (int i = lengthAligned; i < lengthInBytes; i++) {
int halfWord = base.getByte(i);
int halfWord = Platform.getByte(o, offset + i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So seems the performance regression is due to the cost of virtual function calls on MemoryBlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was my guess too at the beginning, but if you just do this change, performance won't change. Seems reasonable what said by @kiszk about the clue being the size of the javabyte code generated, but needs more investigation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Seems there are more than single cause for this performance regression.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

cc @cloud-fan @kiszk

I checked the bytecode generated and the size of the generated code seems not to be the issue either. FYI I am attaching here the disassembled code before and after:
afterPatch.txt
beforePatch.txt

@cloud-fan
Copy link
Contributor

Thanks for working on it! Will it be helpful if we move these hash methods to MemoryBlock? e.g. the code can be int halfWord =bytes[offset + i];

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

@cloud-fan I think I tried doing something like what you suggested but it didn't help. Moreover, the current code in MemoryBlock already leverages Platform.get* for most of its methods, so that wouldn't really change much I think.

@cloud-fan
Copy link
Contributor

This basically reverts the memory block in the hash computing, now the memory block is just a holder of the base object and base offset. This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block.

@SparkQA
Copy link

SparkQA commented Sep 5, 2018

Test build #95704 has finished for PR 22338 at commit 91adce5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Sep 5, 2018

@mgaido91 thanks, interestingly I did experiments with similar code in my box.
While I am using the linux box, I can confirm the performance improvement (or performance recover).

Regarding bytecode size, what I have seen in my experiments is the following. I have not found the root cause of this behavior. Is checkedCast too slow, which takes 74us?

Your original code: 38us

  public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) {
    return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed);                                                                                                                                            
  }

  private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) {
    // This is not compatible with original and another implementations.                                                                                                                                                  
    // But remain it for backward compatibility for the components existing before 2.3.                                                                                                                                   
    assert (lengthInBytes >= 0): "lengthInBytes cannot be negative";
    int lengthAligned = lengthInBytes - lengthInBytes % 4;
    ....

Your original code with moving checkedCast: 112us

  public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) {
    // return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed);                                                                                                                                            
    return hashUnsafeBytesBlock(base, -1, seed);
  }

  private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) {
    // This is not compatible with original and another implementations.                                                                                                                                                  
    // But remain it for backward compatibility for the components existing before 2.3.                                                                                                                                   
    lengthInBytes = Ints.checkedCast(base.size());  // MOVED checkedCast() here
    assert (lengthInBytes >= 0): "lengthInBytes cannot be negative";
    int lengthAligned = lengthInBytes - lengthInBytes % 4;
    ...

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

@kiszk yes, I know. Moving checkedCast makes a big difference, but if you just move it, without the other changes, there is no perf gain (at least this is what I found in my experiments).

@kiszk
Copy link
Member

kiszk commented Sep 5, 2018

In addition to your commit, I applied the following change, basically use MemoryBlock in hashBytesByIntBlock(). I got 33us.

  private static int hashBytesByIntBlock(MemoryBlock base, int lengthInBytes, int seed) {
    assert (lengthInBytes % 4 == 0);
    int h1 = seed;
    for (int i = 0; i < lengthInBytes; i += 4) {
      int halfWord = base.getInt(i);
      int k1 = mixK1(halfWord);
      h1 = mixH1(h1, k1);
    }
    return h1;
  }

But, furthermore, when I also used MemoryBlock in hashUnsafeBytesBlock(), I got 111us.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block.

@cloud-fan I checked the original PR and I couldn't find any benchmark related to Murmur3 hash. The only benchmark I found there was using the HiveHasher. So I cannot really answer on this. Do you have a benchmark to test it? Thanks.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

you're right @kiszk. Let me update the PR accordingly in order to narrow down the change/problem, thanks.

@kiszk
Copy link
Member

kiszk commented Sep 5, 2018

While I saw these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause".

Let us narrow down the problem and find the root cause.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Sep 5, 2018

While I say these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause".

Yes, I am in the same situation. I see the difference but I cannot explain it.

@viirya
Copy link
Member

viirya commented Sep 5, 2018

Yeah, it's interesting... Seems both checkedCast and Platform.getByte are changed and then performance gets gain. Any single change doesn't work.

@SparkQA
Copy link

SparkQA commented Sep 5, 2018

Test build #95721 has finished for PR 22338 at commit 6cb94ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Since the change looks safer to me and it does fix the regression, I'm merging it to unblock 2.4 release. Please continue to investigate the root cause, thanks!

asfgit pushed a commit that referenced this pull request Sep 6, 2018
## What changes were proposed in this pull request?

SPARK-10399 introduced a performance regression on the hash computation for UTF8String.

The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too.

## How was this patch tested?

running the perf test from the JIRA

Closes #22338 from mgaido91/SPARK-25317.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 64c314e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in 64c314e Sep 6, 2018
asfgit pushed a commit that referenced this pull request Sep 9, 2018
## What changes were proposed in this pull request?

When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai  saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai  found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release.

In the future release, we still can continue the investigation and find out the root cause of the regression.

## How was this patch tested?

The existing test cases

Closes #22361 from gatorsmile/revertMemoryBlock.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0b9ccd5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Sep 9, 2018
## What changes were proposed in this pull request?

When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai  saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai  found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release.

In the future release, we still can continue the investigation and find out the root cause of the regression.

## How was this patch tested?

The existing test cases

Closes #22361 from gatorsmile/revertMemoryBlock.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants