-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: Array/slice clearing idiom inconsistently recognized #56954
Comments
It's not clear to me that any of these are particularly important. Certainly it is easy to work around if you notice it. It's not clear to me that 1 is faster than 2. I think it is on some more recent machines, but that's not why we chose it - sometimes we do memclrs in situations where calls are not allowed, so things like
This can be subtle. For instance,
We can't rewrite to a clear of some sort, because if |
My benchmarks suggest that they're about equally fast, on an Intel with the ERMS feature. I'll patch that code, feel free to close if you deem this too unimportant. |
Benchmark results on amd64 below. These do not take into account klauspost#701. They're on Go 1.19; Go 1.20 produces slightly better asm for the old code, but still produces terrible asm on 32-bit platforms. See also golang/go#56954. name old speed new speed delta Encoder_EncodeAllXML-8 283MB/s ± 1% 284MB/s ± 0% ~ (p=0.026 n=30+20) Encoder_EncodeAllSimple/fastest-8 111MB/s ± 0% 111MB/s ± 1% ~ (p=0.011 n=28+20) Encoder_EncodeAllSimple/default-8 78.4MB/s ± 1% 78.3MB/s ± 1% ~ (p=0.572 n=30+19) Encoder_EncodeAllSimple/better-8 65.9MB/s ± 1% 66.2MB/s ± 1% +0.53% (p=0.009 n=30+20) Encoder_EncodeAllSimple/best-8 11.1MB/s ± 1% 11.6MB/s ± 3% +4.42% (p=0.000 n=27+28) Encoder_EncodeAllSimple4K/fastest-8 911MB/s ± 1% 914MB/s ± 1% +0.31% (p=0.004 n=29+20) Encoder_EncodeAllSimple4K/default-8 73.1MB/s ± 1% 73.6MB/s ± 1% +0.67% (p=0.000 n=29+20) Encoder_EncodeAllSimple4K/better-8 60.5MB/s ± 1% 62.7MB/s ± 1% +3.64% (p=0.000 n=29+17) Encoder_EncodeAllSimple4K/best-8 8.62MB/s ± 3% 10.11MB/s ± 1% +17.24% (p=0.000 n=30+27) Encoder_EncodeAllHTML-8 133MB/s ± 1% 133MB/s ± 1% ~ (p=0.101 n=30+19) Encoder_EncodeAllTwain-8 84.8MB/s ± 1% 86.2MB/s ± 3% +1.63% (p=0.000 n=24+20) Encoder_EncodeAllPi-8 62.6MB/s ± 1% 62.7MB/s ± 0% ~ (p=0.102 n=30+20) Random4KEncodeAllFastest-8 2.50GB/s ± 1% 2.50GB/s ± 1% ~ (p=0.449 n=29+20) Random10MBEncodeAllFastest-8 2.39GB/s ± 2% 2.52GB/s ± 6% +5.23% (p=0.000 n=27+20) name old alloc/op new alloc/op delta Encoder_EncodeAllXML-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple/fastest-8 2.73B ±27% 3.00B ± 0% ~ (p=0.018 n=30+18) Encoder_EncodeAllSimple/default-8 4.00B ± 0% 4.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/better-8 5.00B ± 0% 5.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/best-8 19.5B ± 3% 19.0B ± 0% -2.40% (p=0.000 n=30+24) Encoder_EncodeAllSimple4K/fastest-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/default-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/better-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/best-8 2.00B ± 0% 1.43B ±40% -28.33% (p=0.000 n=30+30) Encoder_EncodeAllHTML-8 2.37B ±27% 2.25B ±33% ~ (p=0.398 n=30+20) Encoder_EncodeAllTwain-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllPi-8 12.4B ± 5% 12.2B ± 6% ~ (p=0.283 n=30+20) Random4KEncodeAllFastest-8 0.00B 0.00B ~ (all equal) Random10MBEncodeAllFastest-8 31.9kB ± 2% 30.5kB ± 9% -4.27% (p=0.002 n=28+20)
Benchmark results on amd64 below. These do not take into account klauspost#701. They're for Go 1.19; Go 1.20 produces slightly better asm for the old code, but still produces terrible asm on 32-bit platforms. See also golang/go#56954. name old speed new speed delta Encoder_EncodeAllXML-8 283MB/s ± 1% 284MB/s ± 0% ~ (p=0.026 n=30+20) Encoder_EncodeAllSimple/fastest-8 111MB/s ± 0% 111MB/s ± 1% ~ (p=0.011 n=28+20) Encoder_EncodeAllSimple/default-8 78.4MB/s ± 1% 78.3MB/s ± 1% ~ (p=0.572 n=30+19) Encoder_EncodeAllSimple/better-8 65.9MB/s ± 1% 66.2MB/s ± 1% +0.53% (p=0.009 n=30+20) Encoder_EncodeAllSimple/best-8 11.1MB/s ± 1% 11.6MB/s ± 3% +4.42% (p=0.000 n=27+28) Encoder_EncodeAllSimple4K/fastest-8 911MB/s ± 1% 914MB/s ± 1% +0.31% (p=0.004 n=29+20) Encoder_EncodeAllSimple4K/default-8 73.1MB/s ± 1% 73.6MB/s ± 1% +0.67% (p=0.000 n=29+20) Encoder_EncodeAllSimple4K/better-8 60.5MB/s ± 1% 62.7MB/s ± 1% +3.64% (p=0.000 n=29+17) Encoder_EncodeAllSimple4K/best-8 8.62MB/s ± 3% 10.11MB/s ± 1% +17.24% (p=0.000 n=30+27) Encoder_EncodeAllHTML-8 133MB/s ± 1% 133MB/s ± 1% ~ (p=0.101 n=30+19) Encoder_EncodeAllTwain-8 84.8MB/s ± 1% 86.2MB/s ± 3% +1.63% (p=0.000 n=24+20) Encoder_EncodeAllPi-8 62.6MB/s ± 1% 62.7MB/s ± 0% ~ (p=0.102 n=30+20) Random4KEncodeAllFastest-8 2.50GB/s ± 1% 2.50GB/s ± 1% ~ (p=0.449 n=29+20) Random10MBEncodeAllFastest-8 2.39GB/s ± 2% 2.52GB/s ± 6% +5.23% (p=0.000 n=27+20) name old alloc/op new alloc/op delta Encoder_EncodeAllXML-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple/fastest-8 2.73B ±27% 3.00B ± 0% ~ (p=0.018 n=30+18) Encoder_EncodeAllSimple/default-8 4.00B ± 0% 4.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/better-8 5.00B ± 0% 5.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/best-8 19.5B ± 3% 19.0B ± 0% -2.40% (p=0.000 n=30+24) Encoder_EncodeAllSimple4K/fastest-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/default-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/better-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/best-8 2.00B ± 0% 1.43B ±40% -28.33% (p=0.000 n=30+30) Encoder_EncodeAllHTML-8 2.37B ±27% 2.25B ±33% ~ (p=0.398 n=30+20) Encoder_EncodeAllTwain-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllPi-8 12.4B ± 5% 12.2B ± 6% ~ (p=0.283 n=30+20) Random4KEncodeAllFastest-8 0.00B 0.00B ~ (all equal) Random10MBEncodeAllFastest-8 31.9kB ± 2% 30.5kB ± 9% -4.27% (p=0.002 n=28+20)
Benchmark results on amd64 below. These do not take into account klauspost#701. They're for Go 1.19; Go 1.20 produces slightly better asm for the old code, but still produces pretty bad asm on 32-bit platforms. See also golang/go#56954. name old speed new speed delta Encoder_EncodeAllXML-8 283MB/s ± 1% 284MB/s ± 0% ~ (p=0.026 n=30+20) Encoder_EncodeAllSimple/fastest-8 111MB/s ± 0% 111MB/s ± 1% ~ (p=0.011 n=28+20) Encoder_EncodeAllSimple/default-8 78.4MB/s ± 1% 78.3MB/s ± 1% ~ (p=0.572 n=30+19) Encoder_EncodeAllSimple/better-8 65.9MB/s ± 1% 66.2MB/s ± 1% +0.53% (p=0.009 n=30+20) Encoder_EncodeAllSimple/best-8 11.1MB/s ± 1% 11.6MB/s ± 3% +4.42% (p=0.000 n=27+28) Encoder_EncodeAllSimple4K/fastest-8 911MB/s ± 1% 914MB/s ± 1% +0.31% (p=0.004 n=29+20) Encoder_EncodeAllSimple4K/default-8 73.1MB/s ± 1% 73.6MB/s ± 1% +0.67% (p=0.000 n=29+20) Encoder_EncodeAllSimple4K/better-8 60.5MB/s ± 1% 62.7MB/s ± 1% +3.64% (p=0.000 n=29+17) Encoder_EncodeAllSimple4K/best-8 8.62MB/s ± 3% 10.11MB/s ± 1% +17.24% (p=0.000 n=30+27) Encoder_EncodeAllHTML-8 133MB/s ± 1% 133MB/s ± 1% ~ (p=0.101 n=30+19) Encoder_EncodeAllTwain-8 84.8MB/s ± 1% 86.2MB/s ± 3% +1.63% (p=0.000 n=24+20) Encoder_EncodeAllPi-8 62.6MB/s ± 1% 62.7MB/s ± 0% ~ (p=0.102 n=30+20) Random4KEncodeAllFastest-8 2.50GB/s ± 1% 2.50GB/s ± 1% ~ (p=0.449 n=29+20) Random10MBEncodeAllFastest-8 2.39GB/s ± 2% 2.52GB/s ± 6% +5.23% (p=0.000 n=27+20) name old alloc/op new alloc/op delta Encoder_EncodeAllXML-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple/fastest-8 2.73B ±27% 3.00B ± 0% ~ (p=0.018 n=30+18) Encoder_EncodeAllSimple/default-8 4.00B ± 0% 4.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/better-8 5.00B ± 0% 5.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/best-8 19.5B ± 3% 19.0B ± 0% -2.40% (p=0.000 n=30+24) Encoder_EncodeAllSimple4K/fastest-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/default-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/better-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/best-8 2.00B ± 0% 1.43B ±40% -28.33% (p=0.000 n=30+30) Encoder_EncodeAllHTML-8 2.37B ±27% 2.25B ±33% ~ (p=0.398 n=30+20) Encoder_EncodeAllTwain-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllPi-8 12.4B ± 5% 12.2B ± 6% ~ (p=0.283 n=30+20) Random4KEncodeAllFastest-8 0.00B 0.00B ~ (all equal) Random10MBEncodeAllFastest-8 31.9kB ± 2% 30.5kB ± 9% -4.27% (p=0.002 n=28+20)
Benchmark results on amd64 below. These do not take into account #701. They're for Go 1.19; Go 1.20 produces slightly better asm for the old code, but still produces pretty bad asm on 32-bit platforms. See also golang/go#56954. name old speed new speed delta Encoder_EncodeAllXML-8 283MB/s ± 1% 284MB/s ± 0% ~ (p=0.026 n=30+20) Encoder_EncodeAllSimple/fastest-8 111MB/s ± 0% 111MB/s ± 1% ~ (p=0.011 n=28+20) Encoder_EncodeAllSimple/default-8 78.4MB/s ± 1% 78.3MB/s ± 1% ~ (p=0.572 n=30+19) Encoder_EncodeAllSimple/better-8 65.9MB/s ± 1% 66.2MB/s ± 1% +0.53% (p=0.009 n=30+20) Encoder_EncodeAllSimple/best-8 11.1MB/s ± 1% 11.6MB/s ± 3% +4.42% (p=0.000 n=27+28) Encoder_EncodeAllSimple4K/fastest-8 911MB/s ± 1% 914MB/s ± 1% +0.31% (p=0.004 n=29+20) Encoder_EncodeAllSimple4K/default-8 73.1MB/s ± 1% 73.6MB/s ± 1% +0.67% (p=0.000 n=29+20) Encoder_EncodeAllSimple4K/better-8 60.5MB/s ± 1% 62.7MB/s ± 1% +3.64% (p=0.000 n=29+17) Encoder_EncodeAllSimple4K/best-8 8.62MB/s ± 3% 10.11MB/s ± 1% +17.24% (p=0.000 n=30+27) Encoder_EncodeAllHTML-8 133MB/s ± 1% 133MB/s ± 1% ~ (p=0.101 n=30+19) Encoder_EncodeAllTwain-8 84.8MB/s ± 1% 86.2MB/s ± 3% +1.63% (p=0.000 n=24+20) Encoder_EncodeAllPi-8 62.6MB/s ± 1% 62.7MB/s ± 0% ~ (p=0.102 n=30+20) Random4KEncodeAllFastest-8 2.50GB/s ± 1% 2.50GB/s ± 1% ~ (p=0.449 n=29+20) Random10MBEncodeAllFastest-8 2.39GB/s ± 2% 2.52GB/s ± 6% +5.23% (p=0.000 n=27+20) name old alloc/op new alloc/op delta Encoder_EncodeAllXML-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple/fastest-8 2.73B ±27% 3.00B ± 0% ~ (p=0.018 n=30+18) Encoder_EncodeAllSimple/default-8 4.00B ± 0% 4.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/better-8 5.00B ± 0% 5.00B ± 0% ~ (all equal) Encoder_EncodeAllSimple/best-8 19.5B ± 3% 19.0B ± 0% -2.40% (p=0.000 n=30+24) Encoder_EncodeAllSimple4K/fastest-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/default-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/better-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllSimple4K/best-8 2.00B ± 0% 1.43B ±40% -28.33% (p=0.000 n=30+30) Encoder_EncodeAllHTML-8 2.37B ±27% 2.25B ±33% ~ (p=0.398 n=30+20) Encoder_EncodeAllTwain-8 0.00B 0.00B ~ (all equal) Encoder_EncodeAllPi-8 12.4B ± 5% 12.2B ± 6% ~ (p=0.283 n=30+20) Random4KEncodeAllFastest-8 0.00B 0.00B ~ (all equal) Random10MBEncodeAllFastest-8 31.9kB ± 2% 30.5kB ± 9% -4.27% (p=0.002 n=28+20) Co-authored-by: Klaus Post <klauspost@gmail.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What did you do?
What did you expect to see?
Same code generated for all three.
What did you see instead?
arrayclear1 becomes REP STOSQ on amd64, REP STOSL on 386, a loop doing STP from ZR on arm64 and a simple loop on arm.
arrayclear2 becomes a call to runtime.memclrNoHeapPointers.
arrayclear3, which represents a library function that I found to be a hot spot, becomes a loop that sets one element at a time. On 386 and arm, that loop even includes a bounds check.
I suspected this might have to do with conversion to slice, but slice clearing is similarly inconsistent, producing different code depending on whether [:] (or [:n]) is present or not.
The text was updated successfully, but these errors were encountered: