[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

mgaido91 · 2018-09-05T08:54:54Z

What changes were proposed in this pull request?

SPARK-10399 introduced a performance regression on the hash computation for UTF8String.

The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too.

How was this patch tested?

running the perf test from the JIRA

viirya · 2018-09-05T09:14:14Z

common/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java

    for (int i = lengthAligned; i < lengthInBytes; i++) {
-      int halfWord = base.getByte(i);
+      int halfWord = Platform.getByte(o, offset + i);


So seems the performance regression is due to the cost of virtual function calls on MemoryBlock?

that was my guess too at the beginning, but if you just do this change, performance won't change. Seems reasonable what said by @kiszk about the clue being the size of the javabyte code generated, but needs more investigation.

Ok. Seems there are more than single cause for this performance regression.

mgaido91 · 2018-09-05T12:23:04Z

cc @cloud-fan @kiszk

I checked the bytecode generated and the size of the generated code seems not to be the issue either. FYI I am attaching here the disassembled code before and after:
afterPatch.txt
beforePatch.txt

cloud-fan · 2018-09-05T12:34:42Z

Thanks for working on it! Will it be helpful if we move these hash methods to MemoryBlock? e.g. the code can be int halfWord =bytes[offset + i];

mgaido91 · 2018-09-05T12:41:20Z

@cloud-fan I think I tried doing something like what you suggested but it didn't help. Moreover, the current code in MemoryBlock already leverages Platform.get* for most of its methods, so that wouldn't really change much I think.

cloud-fan · 2018-09-05T13:11:02Z

This basically reverts the memory block in the hash computing, now the memory block is just a holder of the base object and base offset. This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block.

SparkQA · 2018-09-05T13:23:56Z

Test build #95704 has finished for PR 22338 at commit 91adce5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-09-05T13:25:17Z

@mgaido91 thanks, interestingly I did experiments with similar code in my box.
While I am using the linux box, I can confirm the performance improvement (or performance recover).

Regarding bytecode size, what I have seen in my experiments is the following. I have not found the root cause of this behavior. Is checkedCast too slow, which takes 74us?

Your original code: 38us

  public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) {
    return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed);                                                                                                                                            
  }

  private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) {
    // This is not compatible with original and another implementations.                                                                                                                                                  
    // But remain it for backward compatibility for the components existing before 2.3.                                                                                                                                   
    assert (lengthInBytes >= 0): "lengthInBytes cannot be negative";
    int lengthAligned = lengthInBytes - lengthInBytes % 4;
    ....

Your original code with moving checkedCast: 112us

  public static int hashUnsafeBytesBlock(MemoryBlock base, int seed) {
    // return hashUnsafeBytesBlock(base, Ints.checkedCast(base.size()), seed);                                                                                                                                            
    return hashUnsafeBytesBlock(base, -1, seed);
  }

  private static int hashUnsafeBytesBlock(MemoryBlock base, int lengthInBytes, int seed) {
    // This is not compatible with original and another implementations.                                                                                                                                                  
    // But remain it for backward compatibility for the components existing before 2.3.                                                                                                                                   
    lengthInBytes = Ints.checkedCast(base.size());  // MOVED checkedCast() here
    assert (lengthInBytes >= 0): "lengthInBytes cannot be negative";
    int lengthAligned = lengthInBytes - lengthInBytes % 4;
    ...

mgaido91 · 2018-09-05T13:32:09Z

@kiszk yes, I know. Moving checkedCast makes a big difference, but if you just move it, without the other changes, there is no perf gain (at least this is what I found in my experiments).

kiszk · 2018-09-05T13:43:44Z

In addition to your commit, I applied the following change, basically use MemoryBlock in hashBytesByIntBlock(). I got 33us.

  private static int hashBytesByIntBlock(MemoryBlock base, int lengthInBytes, int seed) {
    assert (lengthInBytes % 4 == 0);
    int h1 = seed;
    for (int i = 0; i < lengthInBytes; i += 4) {
      int halfWord = base.getInt(i);
      int k1 = mixK1(halfWord);
      h1 = mixH1(h1, k1);
    }
    return h1;
  }

But, furthermore, when I also used MemoryBlock in hashUnsafeBytesBlock(), I got 111us.

mgaido91 · 2018-09-05T13:44:02Z

This does fix the regression, will we also lose the perf speed up too? IIUC we did observe significant perf boost when introducing the memory block.

@cloud-fan I checked the original PR and I couldn't find any benchmark related to Murmur3 hash. The only benchmark I found there was using the HiveHasher. So I cannot really answer on this. Do you have a benchmark to test it? Thanks.

mgaido91 · 2018-09-05T13:54:53Z

you're right @kiszk. Let me update the PR accordingly in order to narrow down the change/problem, thanks.

kiszk · 2018-09-05T14:00:44Z

While I saw these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause".

Let us narrow down the problem and find the root cause.

mgaido91 · 2018-09-05T14:18:13Z

While I say these performance differences, I do not understand why these difference occurs completely. That is why I said "I have not found the root cause".

Yes, I am in the same situation. I see the difference but I cannot explain it.

viirya · 2018-09-05T14:42:40Z

Yeah, it's interesting... Seems both checkedCast and Platform.getByte are changed and then performance gets gain. Any single change doesn't work.

SparkQA · 2018-09-05T18:35:27Z

Test build #95721 has finished for PR 22338 at commit 6cb94ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-06T07:28:12Z

Since the change looks safer to me and it does fix the regression, I'm merging it to unblock 2.4 release. Please continue to investigate the root cause, thanks!

## What changes were proposed in this pull request? SPARK-10399 introduced a performance regression on the hash computation for UTF8String. The regression can be evaluated with the code attached in the JIRA. That code runs in about 120 us per method on my laptop (MacBook Pro 2.5 GHz Intel Core i7, RAM 16 GB 1600 MHz DDR3) while the code from branch 2.3 takes on the same machine about 45 us for me. After the PR, the code takes about 45 us on the master branch too. ## How was this patch tested? running the perf test from the JIRA Closes #22338 from mgaido91/SPARK-25317. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 64c314e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0b9ccd5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR #22338, the performance regression still exists. If we revert the changes in #19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String

91adce5

viirya reviewed Sep 5, 2018

View reviewed changes

use MemoryBlock in hashBytesByIntBlock according to kiszk's comment

6cb94ed

asfgit closed this in 64c314e Sep 6, 2018

gatorsmile mentioned this pull request Sep 7, 2018

Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317] #22361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

mgaido91 commented Sep 5, 2018

viirya Sep 5, 2018

mgaido91 Sep 5, 2018

viirya Sep 5, 2018

mgaido91 commented Sep 5, 2018

cloud-fan commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

cloud-fan commented Sep 5, 2018

SparkQA commented Sep 5, 2018

kiszk commented Sep 5, 2018 •

edited

Loading

mgaido91 commented Sep 5, 2018

kiszk commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

kiszk commented Sep 5, 2018 •

edited

Loading

mgaido91 commented Sep 5, 2018

viirya commented Sep 5, 2018 •

edited

Loading

SparkQA commented Sep 5, 2018

cloud-fan commented Sep 6, 2018

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

[SPARK-25317][CORE] Avoid perf regression in Murmur3 Hash on UTF8String #22338

Conversation

mgaido91 commented Sep 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

viirya Sep 5, 2018

Choose a reason for hiding this comment

mgaido91 Sep 5, 2018

Choose a reason for hiding this comment

viirya Sep 5, 2018

Choose a reason for hiding this comment

mgaido91 commented Sep 5, 2018

cloud-fan commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

cloud-fan commented Sep 5, 2018

SparkQA commented Sep 5, 2018

kiszk commented Sep 5, 2018 • edited Loading

mgaido91 commented Sep 5, 2018

kiszk commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

mgaido91 commented Sep 5, 2018

kiszk commented Sep 5, 2018 • edited Loading

mgaido91 commented Sep 5, 2018

viirya commented Sep 5, 2018 • edited Loading

SparkQA commented Sep 5, 2018

cloud-fan commented Sep 6, 2018

kiszk commented Sep 5, 2018 •

edited

Loading

kiszk commented Sep 5, 2018 •

edited

Loading

viirya commented Sep 5, 2018 •

edited

Loading