diff --git a/README.md b/README.md index 63937b2..1917fe3 100644 --- a/README.md +++ b/README.md @@ -4,15 +4,37 @@ murmur3 Native Go implementation of Austin Appleby's third MurmurHash revision (aka MurmurHash3). -Includes assembly for amd64 for go1.5+ for 128 bit hashes, seeding function, +Includes assembly for amd64 for 64/128 bit hashes, seeding functions, and string functions to avoid string to slice conversions. -Hand rolled 32 bit assembly was removed during 1.11 due to Go's compiler -catching up and generating equal or better assembly. +Hand rolled 32 bit assembly was removed during 1.11, but may be reintroduced +if the compiler slows down any more. As is, the compiler generates marginally +slower code (by one instruction in the hot loop). The reference algorithm has been slightly hacked as to support the streaming mode required by Go's standard [Hash interface](http://golang.org/pkg/hash/#Hash). +Endianness +========== + +Unlike the canonical source, this library **always** reads bytes as little +endian numbers. This makes the hashes portable across architectures, although +does mean that hashing is a bit slower on big endian architectures. + +Safety +====== + +This library used to use `unsafe` to convert four bytes to a `uint32` and eight +bytes to a `uint64`, but Go 1.14 introduced checks around those types of +conversions that flagged that code as erroneous when hashing on unaligned +input. While the code would not be problematic on amd64, it could be +problematic on some architectures. + +As of Go 1.14, those conversions were removed at the expense of a very minor +performance hit. This hit affects all cpu architectures on for `Sum32`, and +non-amd64 architectures for `Sum64` and `Sum128`. For 64 and 128, custom +assembly exists for amd64 that preserves performance. + Testing ======= @@ -22,6 +44,11 @@ Testing includes comparing random inputs against the [canonical implementation](https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp), and testing length 0 through 17 inputs to force all branches. +Because this code always reads input as little endian, testing against the +canonical source is skipped for big endian architectures. The canonical source +just converts bytes to numbers, meaning on big endian architectures, it will +use different numbers for its hashing. + Documentation ============= @@ -32,53 +59,71 @@ Full documentation can be found on `godoc`. Benchmarks ========== -The following benchmarks show deltas for the 128 bit algorithms only; the 32 -bit algorithms have the same implementation. +Benchmarks below were run on an amd64 machine with _and_ without the custom +assembly. The following numbers are for Go 1.14.1 and are comparing against +[spaolacci/murmur3](https://github.com/spaolacci/murmur3). + +You will notice that at small sizes, the other library is better. This is due +to this library converting to safe code for Go 1.14. At large sizes, this +library is nearly identical to the other. On amd64, the 64 bit and 128 bit +sums come out to ~9% faster. -In comparison to [spaolacci/murmur3](https://github.com/spaolacci/murmur3) on -Go at commit [447965d4e0](https://github.com/golang/go/commit/447965d4e0) -(i.e., post 1.11): +32 bit sums: ``` -benchmark old ns/op new ns/op delta -Benchmark128Branches/0-4 22.2 6.28 -71.71% -Benchmark128Branches/1-4 23.6 8.46 -64.15% -Benchmark128Branches/2-4 24.3 8.68 -64.28% -Benchmark128Branches/3-4 24.7 9.07 -63.28% -Benchmark128Branches/4-4 25.2 8.16 -67.62% -Benchmark128Branches/5-4 25.9 8.89 -65.68% -Benchmark128Branches/6-4 26.8 9.32 -65.22% -Benchmark128Branches/7-4 27.4 9.82 -64.16% -Benchmark128Branches/8-4 28.1 7.68 -72.67% -Benchmark128Branches/9-4 29.6 9.04 -69.46% -Benchmark128Branches/10-4 30.2 9.14 -69.74% -Benchmark128Branches/11-4 30.8 9.53 -69.06% -Benchmark128Branches/12-4 31.5 8.65 -72.54% -Benchmark128Branches/13-4 31.5 9.26 -70.60% -Benchmark128Branches/14-4 32.5 9.69 -70.18% -Benchmark128Branches/15-4 33.4 10.1 -69.76% -Benchmark128Branches/16-4 24.9 10.0 -59.84% -Benchmark64Sizes/32-4 27.8 13.6 -51.08% -Benchmark64Sizes/64-4 35.2 18.8 -46.59% -Benchmark64Sizes/128-4 49.6 30.5 -38.51% -Benchmark64Sizes/256-4 77.9 54.5 -30.04% -Benchmark64Sizes/512-4 136 105 -22.79% -Benchmark64Sizes/1024-4 251 209 -16.73% -Benchmark64Sizes/2048-4 492 419 -14.84% -Benchmark64Sizes/4096-4 952 832 -12.61% -Benchmark64Sizes/8192-4 1879 1658 -11.76% -Benchmark128Sizes/32-4 28.5 13.6 -52.28% -Benchmark128Sizes/64-4 35.7 18.7 -47.62% -Benchmark128Sizes/128-4 49.8 30.3 -39.16% -Benchmark128Sizes/256-4 78.0 54.2 -30.51% -Benchmark128Sizes/512-4 135 105 -22.22% -Benchmark128Sizes/1024-4 250 209 -16.40% -Benchmark128Sizes/2048-4 489 419 -14.31% -Benchmark128Sizes/4096-4 959 831 -13.35% -Benchmark128Sizes/8192-4 1885 1659 -11.99% -BenchmarkNoescape128-4 3226 1824 -43.46% +32Sizes/32-12 3.00GB/s ± 1% 2.12GB/s ±11% -29.24% (p=0.000 n=9+10) +32Sizes/64-12 3.61GB/s ± 3% 2.79GB/s ± 8% -22.62% (p=0.000 n=10+10) +32Sizes/128-12 3.47GB/s ± 8% 2.79GB/s ± 4% -19.47% (p=0.000 n=10+10) +32Sizes/256-12 3.66GB/s ± 4% 3.25GB/s ± 6% -11.09% (p=0.000 n=10+10) +32Sizes/512-12 3.78GB/s ± 3% 3.54GB/s ± 4% -6.30% (p=0.000 n=9+9) +32Sizes/1024-12 3.86GB/s ± 3% 3.69GB/s ± 5% -4.46% (p=0.000 n=10+10) +32Sizes/2048-12 3.85GB/s ± 3% 3.81GB/s ± 3% ~ (p=0.079 n=10+9) +32Sizes/4096-12 3.90GB/s ± 3% 3.82GB/s ± 2% -2.14% (p=0.029 n=10+10) +32Sizes/8192-12 3.82GB/s ± 3% 3.78GB/s ± 7% ~ (p=0.529 n=10+10) ``` -The speedup for large inputs levels out around ~1.12x. Additionally, -this code avoids allocating stack slices unnecessarily for the 128 -algorithm, unlike `spaolacci/murmur3`. +64/128 bit sums, non-amd64: + +``` +64Sizes/32-12 2.34GB/s ± 5% 2.64GB/s ± 9% +12.87% (p=0.000 n=10+10) +64Sizes/64-12 3.62GB/s ± 5% 3.96GB/s ± 4% +9.41% (p=0.000 n=10+10) +64Sizes/128-12 5.12GB/s ± 3% 5.44GB/s ± 4% +6.09% (p=0.000 n=10+9) +64Sizes/256-12 6.35GB/s ± 2% 6.27GB/s ± 9% ~ (p=0.796 n=10+10) +64Sizes/512-12 6.58GB/s ± 7% 6.79GB/s ± 3% ~ (p=0.075 n=10+10) +64Sizes/1024-12 7.49GB/s ± 3% 7.55GB/s ± 9% ~ (p=0.393 n=10+10) +64Sizes/2048-12 8.06GB/s ± 2% 7.90GB/s ± 6% ~ (p=0.156 n=9+10) +64Sizes/4096-12 8.27GB/s ± 6% 8.22GB/s ± 5% ~ (p=0.631 n=10+10) +64Sizes/8192-12 8.35GB/s ± 4% 8.38GB/s ± 6% ~ (p=0.631 n=10+10) +128Sizes/32-12 2.27GB/s ± 2% 2.68GB/s ± 5% +18.00% (p=0.000 n=10+10) +128Sizes/64-12 3.55GB/s ± 2% 4.00GB/s ± 3% +12.47% (p=0.000 n=8+9) +128Sizes/128-12 5.09GB/s ± 1% 5.43GB/s ± 3% +6.65% (p=0.000 n=9+9) +128Sizes/256-12 6.33GB/s ± 3% 5.65GB/s ± 4% -10.79% (p=0.000 n=9+10) +128Sizes/512-12 6.78GB/s ± 3% 6.74GB/s ± 6% ~ (p=0.968 n=9+10) +128Sizes/1024-12 7.46GB/s ± 4% 7.56GB/s ± 4% ~ (p=0.222 n=9+9) +128Sizes/2048-12 7.99GB/s ± 4% 7.96GB/s ± 3% ~ (p=0.666 n=9+9) +128Sizes/4096-12 8.20GB/s ± 2% 8.25GB/s ± 4% ~ (p=0.631 n=10+10) +128Sizes/8192-12 8.24GB/s ± 2% 8.26GB/s ± 5% ~ (p=0.673 n=8+9) +``` + +64/128 bit sums, amd64: + +``` +64Sizes/32-12 2.34GB/s ± 5% 4.36GB/s ± 3% +85.86% (p=0.000 n=10+10) +64Sizes/64-12 3.62GB/s ± 5% 6.27GB/s ± 3% +73.37% (p=0.000 n=10+9) +64Sizes/128-12 5.12GB/s ± 3% 7.70GB/s ± 6% +50.27% (p=0.000 n=10+10) +64Sizes/256-12 6.35GB/s ± 2% 8.61GB/s ± 3% +35.50% (p=0.000 n=10+10) +64Sizes/512-12 6.58GB/s ± 7% 8.59GB/s ± 4% +30.48% (p=0.000 n=10+9) +64Sizes/1024-12 7.49GB/s ± 3% 8.81GB/s ± 2% +17.66% (p=0.000 n=10+10) +64Sizes/2048-12 8.06GB/s ± 2% 8.90GB/s ± 4% +10.49% (p=0.000 n=9+10) +64Sizes/4096-12 8.27GB/s ± 6% 8.90GB/s ± 4% +7.54% (p=0.000 n=10+10) +64Sizes/8192-12 8.35GB/s ± 4% 9.00GB/s ± 3% +7.80% (p=0.000 n=10+9) +128Sizes/32-12 2.27GB/s ± 2% 4.29GB/s ± 9% +88.75% (p=0.000 n=10+10) +128Sizes/64-12 3.55GB/s ± 2% 6.10GB/s ± 8% +71.78% (p=0.000 n=8+10) +128Sizes/128-12 5.09GB/s ± 1% 7.62GB/s ± 9% +49.63% (p=0.000 n=9+10) +128Sizes/256-12 6.33GB/s ± 3% 8.65GB/s ± 3% +36.71% (p=0.000 n=9+10) +128Sizes/512-12 6.78GB/s ± 3% 8.39GB/s ± 6% +23.77% (p=0.000 n=9+10) +128Sizes/1024-12 7.46GB/s ± 4% 8.70GB/s ± 4% +16.70% (p=0.000 n=9+10) +128Sizes/2048-12 7.99GB/s ± 4% 8.73GB/s ± 8% +9.26% (p=0.003 n=9+10) +128Sizes/4096-12 8.20GB/s ± 2% 8.86GB/s ± 6% +8.00% (p=0.000 n=10+10) +128Sizes/8192-12 8.24GB/s ± 2% 9.01GB/s ± 3% +9.30% (p=0.000 n=8+10) +```