Improve performance of `ncodeunits(::Char)` #54001

Seelengrab · 2024-04-09T07:23:49Z

This improves performance of ncodeunits(::Char) by simply counting the number of non-zero bytes (except for \0, which is encoded as all zero bytes). For a performance comparison, see this gist; there's an up to 10x improvement here for collections of Char, with a minor improvement for single Char (with much smaller spread). The version in this PR is called nbytesencoded in the benchmarks.

Correctness has been verified with Supposition.jl, using the existing implementation as an oracle:

julia> using Supposition

julia> const chars = Data.Characters()

julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
           c = reinterpret(Char, i)
           ncodeunits(c) == nbytesdiv(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.0s

julia> ncodeunits('\0') == nbytesencoded('\0')
true

Let's see if CI agrees!

Notably, neither the existing nor the new implementation check whether the given Char is valid or not, since the only thing that matters is how many bytes are written out.

@check

This improves performance of `ncodeunits(::Char)` by simply counting the number of non-zero bytes (except for `\0`, which is encoded as all zero bytes). For a performance comparison, see https://gist.github.com/Seelengrab/ebb02d4b8d754700c2869de8daf88cad. The version in this PR is called `nbytesencoded` in the benchmarks. Correctness has been verified with Supposition.jl, using the existing implementation as an oracle: ``` julia> using Supposition julia> const chars = Data.Characters() julia> @check max_examples=1_000_000 function bytesenc(c=chars) ncodeunits(c) == nbytesencoded(c) end; Test Summary: | Pass Total Time bytesenc | 1 1 1.2s julia> ncodeunits('\0') == nbytesencoded('\0') true ``` Notably, neither the existing nor the new implementation check whether the given `Char` is valid or not, since the only thing that matters is how many bytes are written out.

jakobnissen

Are you sure it's faster? On my x86 computer, your code is not faster than the old implementation, even when looping over a text with a varying number of codeunits per character (to defeat the branch predictor). It's possible that it's faster in your benchmark because your benchmark has an unrealistically unpredictable distribution of codeunits per char.

Also, some typos

base/char.jl

jakobnissen · 2024-04-09T08:30:45Z

Also this implementation is incorrect - it fails for some malformed chars like reinterpret(Char, 0xff0000ff).

Perhaps this implementation is correct?

function ncodeunits(c::Char)
    u = reinterpret(UInt32, c)
    4 - div(trailing_zeros(u), 8) + iszero(u)
end

Seelengrab · 2024-04-09T08:39:47Z

Are you sure it's faster? On my x86 computer, your code is not faster than the old implementation, even when looping over a text with a varying number of codeunits per character (to defeat the branch predictor).

Could you share that benchmark? As far as I know, rand(Char) should produce a good sampling of inputs here.

Also this implementation is incorrect - it fails for some malformed chars like reinterpret(Char, 0xff0000ff).

Good catch! I'll look into that.

jakobnissen · 2024-04-09T08:50:42Z

I took some Greek text (which has a mix of 1- and 2-byte chars), looped over the chars and then summed the codeunits. The idea was that randomly generated chars have an unrealistic distribution of codepoint lengths, and therefore unfairly punishes branchful code since the branch predictor will be much better on real strings with a more predictable number of codeunits per char.
However, thinking more about it, it's probably not a big deal. Performance is on par and your code relies on less magical compiler optimisation so it's better. By proposed one using trailing_zeros is even faster.

Seelengrab · 2024-04-09T09:53:31Z

Perhaps this implementation is correct?

Indeed it is, and it's faster than all of the masking I've done because of the heavy lifting in trailing_zeros being done by the CPU. This ends up as a tzcnt, which is unlikely to be beat. The division afterwards is also efficient, since it ends up as a shr ecx, 3, i.e. x >> 0x3; everything here is in integers after all.

If it's alright with you, I'll use your implementation in the PR? The fault in my original implementation was assuming that the leading bytes must all be nonzero, which is of course not true for malformed Char. The original @check didn't catch this because Data.Characters only produces well-formed Char, and no malformed ones. Seems like a missing feature switch on my end.

Results on my machine:

julia> @code_native debuginfo=:none nbytesdiv('a')
	.text
	.file	"nbytesdiv"
	.globl	julia_nbytesdiv_15365           # -- Begin function julia_nbytesdiv_15365
	.p2align	4, 0x90
	.type	julia_nbytesdiv_15365,@function
julia_nbytesdiv_15365:                  # @julia_nbytesdiv_15365
; Function Signature: nbytesdiv(Char)
# %bb.0:                                # %top
	#DEBUG_VALUE: nbytesdiv:c <- $edi
	push	rbp
	tzcnt	ecx, edi
	xor	eax, eax
	mov	rbp, rsp
	shr	ecx, 3
	cmp	edi, 1
	adc	rax, 4
	sub	rax, rcx
	pop	rbp
	ret
.Lfunc_end0:
	.size	julia_nbytesdiv_15365, .Lfunc_end0-julia_nbytesdiv_15365
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

julia> @benchmark nbytesdiv(c) setup=(c=rand(Char))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.689 ns … 12.760 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.720 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.729 ns ±  0.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                           █       ▁                          
  ▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▂▇▁▁▁▁▁▁▂█▁▁▁▁▁▁▂█▁▁▁▁▁▁▂▅▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃ ▂
  1.69 ns        Histogram: frequency by time        1.76 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(nbytesdiv, c) setup=(c=rand(Char, 10_000))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.115 μs …  2.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.152 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.156 μs ± 38.131 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▄▅▅▆▇██▆▆▅▃▁▁  ▁ ▂▂▂                        ▂
  ▄▅▇▇▆▅▄▁▃▁▄▅███████████████████████▇██▇▃▁▄▄▄▃▅▅▆▅▃▃▃▃▁▃▅▅▇ █
  1.12 μs      Histogram: log(frequency) by time     1.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @check max_examples=1_000_000 function bytesenc(i=Data.Integers{UInt32}())
           c = reinterpret(Char, i)
           ncodeunits(c) == nbytesdiv(c)
       end;
Test Summary: | Pass  Total  Time
bytesenc      |    1      1  1.0s

@jakobnissen

This came up in the context of JuliaLang/julia#54001, where doing the correct thing for malformed `Char` slipped through. Thanks, @jakobnissen!

jakobnissen · 2024-04-09T11:08:00Z

Yep, you can use that implementation 👍

@jakobnissen

The previous implementation assumed that all `Char` are well-formed, which is of course not guaranteed to be the case (and which is also correctly handled by the existing implementation). On top of that, this is even faster, since counting the number of trailing zeros has hardware support on a wide range of architectures. Implementation based on a suggestion by @jakobnissen. Thanks!

tecosaur · 2024-04-09T13:31:04Z

Style nit: I'm not sure the blank lines do anything for readability, how about just:

function ncodeunits(c::Char)
    u = reinterpret(UInt32, c)
    # We care about how many trailing bytes are all zero
    n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
    # Take care of '\0', which has an all-zero bitpattern
    n_nonzero_bytes + iszero(u)
end

Seelengrab · 2024-04-09T14:16:31Z

Failures seem to be due to Github/CI hiccups, unrelated to this PR.

Style nit: I'm not sure the blank lines do anything for readability, how about just:

Sure, that's reasonable. I've also added a clarification that the line doesn't just compute the number of trailing nonzero bytes.

Seelengrab · 2024-04-09T19:37:29Z

Failure seems to be the Profile thing again - see #53984. Otherwise IMO good to merge.

@jakobnissen

This came up in the context of JuliaLang/julia#54001, where doing the correct thing for malformed `Char` slipped through. Thanks, @jakobnissen!

Seelengrab added the performance Must go faster label Apr 9, 2024

jakobnissen reviewed Apr 9, 2024

View reviewed changes

base/char.jl Outdated Show resolved Hide resolved

base/char.jl Outdated Show resolved Hide resolved

tecosaur added the strings "Strings!" label Apr 9, 2024

Seelengrab force-pushed the ncodeunits_speedup branch from 9c9b6f6 to 19b29df Compare April 9, 2024 11:20

Incorporate code review

409e7e7

oscardssmith merged commit d183ee1 into JuliaLang:master Apr 9, 2024
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `ncodeunits(::Char)` #54001

Improve performance of `ncodeunits(::Char)` #54001

Seelengrab commented Apr 9, 2024 •

edited

Loading

jakobnissen left a comment •

edited

Loading

jakobnissen commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading

jakobnissen commented Apr 9, 2024

Seelengrab commented Apr 9, 2024 •

edited

Loading

jakobnissen commented Apr 9, 2024

tecosaur commented Apr 9, 2024

Seelengrab commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading

Improve performance of ncodeunits(::Char) #54001

Improve performance of ncodeunits(::Char) #54001

Conversation

Seelengrab commented Apr 9, 2024 • edited Loading

jakobnissen left a comment • edited Loading

Choose a reason for hiding this comment

jakobnissen commented Apr 9, 2024 • edited Loading

Seelengrab commented Apr 9, 2024 • edited Loading

jakobnissen commented Apr 9, 2024

Seelengrab commented Apr 9, 2024 • edited Loading

jakobnissen commented Apr 9, 2024

tecosaur commented Apr 9, 2024

Seelengrab commented Apr 9, 2024 • edited Loading

Seelengrab commented Apr 9, 2024 • edited Loading

Improve performance of `ncodeunits(::Char)` #54001

Improve performance of `ncodeunits(::Char)` #54001

Seelengrab commented Apr 9, 2024 •

edited

Loading

jakobnissen left a comment •

edited

Loading

jakobnissen commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading

Seelengrab commented Apr 9, 2024 •

edited

Loading