Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show(io::IO, int) optimization #41415

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Conversation

green-nsk
Copy link

This is for #41396

As you'll notice, I couldn't quite figure out the threading/initialization part. Two main problems there:

  • The Threads module isn't loaded yet at the point intfuncs.jl is being loaded
  • If you make code work before Threads is loaded, it's not clear how to switch it to thread-compatible code after Threads is loaded

I could use some help there.

Otherwise, I think change is quite straightforward. Some benchmarks:

# master:  61.000 ns (2 allocations: 88 bytes)
# after:   30.409 ns (0 allocations: 0 bytes)
@btime show($iobuf, 123456)
# master:  94.616 ns (2 allocations: 104 bytes)
# after:   73.560 ns (0 allocations: 0 bytes)
@btime show($iobuf, UInt(123456))
# master:  502.962 μs (32 allocations: 29.31 KiB)
# after:   505.108 μs (29 allocations: 29.17 KiB)
@btime show($iobuf, Ptr{Nothing}())

# master:  81.733 ns (2 allocations: 88 bytes)
# master, print(io, signed) = show(io, signed): 61.129 ns (2 allocations: 88 bytes)
# after:   30.411 ns (0 allocations: 0 bytes)
@btime print($iobuf, 123456)
# master:  61.317 ns (2 allocations: 88 bytes)
# after:   31.791 ns (0 allocations: 0 bytes)
@btime print($iobuf, UInt(123456))
# master:  501.052 μs (32 allocations: 29.31 KiB)
# after:   505.025 μs (29 allocations: 29.17 KiB)
@btime print($iobuf, Ptr{Nothing}())

The string(::Integer) call benchmarks about the same as master, as expected.

base/intfuncs.jl Outdated Show resolved Hide resolved
@JeffBezanson
Copy link
Member

If you look at the top of pcre.jl, it had a similar problem of needing the thread id before Threads is loaded. It has its own implementation of threadid for bootstrapping purposes that you could copy to Base.jl to use more widely.

@JeffBezanson JeffBezanson added io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster labels Jun 30, 2021
@JeffBezanson
Copy link
Member

Ah right, this has to be task-local, not thread-local, since a task can block when trying to print something. Here's one way we used to do it:

function getbuf()

@green-nsk
Copy link
Author

green-nsk commented Jul 1, 2021

Hooking up with TLS really slows it down. How about using alloca() instead?

UPD: alloca isn't really compatible with how llvmcall() works, as llvmcall() creates a new stackframe, and allocated storage is released right away. Nevermind reviewing that.

@green-nsk
Copy link
Author

So this time it works as expected. I also moved string(int) to use stack-allocated buffers, helps with benchmarks too:

# master:  36.241 ns (2 allocations: 104 bytes)
# After:   24.208 ns (1 allocation: 40 bytes)
@btime string(123456; base = 2)
# master:  36.595 ns (2 allocations: 88 bytes)
# After:   25.597 ns (1 allocation: 24 bytes)
@btime string(123456; base = 8)
# master:  45.193 ns (2 allocations: 88 bytes)
# After:   27.618 ns (1 allocation: 24 bytes)
@btime string(123456)
# master:  35.552 ns (2 allocations: 88 bytes)
# After:   25.723 ns (1 allocation: 24 bytes)
@btime string(123456; base = 16)
# master:  36.212 ns (2 allocations: 104 bytes)
# After:   26.046 ns (1 allocation: 40 bytes)
@btime string(Unsigned(123456); base = 2)
# master:  36.160 ns (2 allocations: 88 bytes)
# After:   25.882 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 8)
# master:  45.268 ns (2 allocations: 88 bytes)
# After:   27.702 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456))
# master:  35.605 ns (2 allocations: 88 bytes)
# After:   24.031 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 16)

@vtjnash
Copy link
Member

vtjnash commented Jul 2, 2021

alloca is also thread-local, and is therefore prohibited for this use case for the same reason

@green-nsk
Copy link
Author

green-nsk commented Jul 2, 2021

Sorry, it's been a moving target, but I think that's all I wanted to push for this PR. Some benchmark for float:

# before:  147.485 ns (2 allocations: 432 bytes)
# after:   120.259 ns (1 allocation: 24 bytes)
# inbounds:  113.345 ns (1 allocation: 24 bytes)
@btime string(123.456)
# before:  184.321 ns (2 allocations: 432 bytes)
# after:   142.257 ns (0 allocations: 0 bytes)
# inbounds:  140.736 ns (0 allocations: 0 bytes)
@btime print($iobuf, 123.456)
# before:  154.228 ns (2 allocations: 432 bytes)
# after:   119.835 ns (0 allocations: 0 bytes)
# inbounds:  112.298 ns (0 allocations: 0 bytes)
@btime show($iobuf, 123.456)

base/ryu/exp.jl Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

I worry about the Scratch type being useful and therefore leaking out of Base, so it seems like it would be good to pick a clearer name and maybe even consider exporting it. We have a Scratch package that is part of the stdlib, which obviously implements something quite different. Maybe call this type ScratchBuffer?

@green-nsk
Copy link
Author

I've renamed Scratch -> ScratchBuffer as well as related names.

If exporting anything, I'd say export with_scratch_buffer(n). But we can always make this call later, it seems orthogonal to the purpose of this PR

base/scratch_buffer.jl Outdated Show resolved Hide resolved
@green-nsk
Copy link
Author

Is there anything else required?

@StefanKarpinski
Copy link
Member

Bump.

@vtjnash
Copy link
Member

vtjnash commented Aug 16, 2021

As I mentioned above in #41415 (comment), this is still attempting to use stack-allocated buffers in most cases which is unsafe for this and therefore prohibited

@green-nsk
Copy link
Author

green-nsk commented Aug 18, 2021

This is how the buffer was allocated before the change: a = StringVector(n), here's how the buffer is allocated after the change: buf = Ref(ScratchBufferInline()). For all I know, neither form prescribes a certain way of allocating it.

A naive compiler would allocate both in the heap. On the contrary, Julia 2.0 could optimize out heap operations for "small scope-contained vectors of fixed size" or String's. The compiler we're currently using optimizes Ref where it can prove there're no references to it outside of the scope, but doesn't do the same for vectors or strings. So it makes sense to prefer the Ref way.

This optimization is exploited widely in the codebase I'm working on, and I'm sure I can find examples where it's already being used in Julia Base if I tried. There's no reason Julia Base library shouldn't benefit from the optimization the Julia Compiler team has delivered.

@vtjnash
Copy link
Member

vtjnash commented Aug 18, 2021

Hopefully you don't find any places where we use stack memory for IO, as I have taken great pains to ensure that does not happen.

@green-nsk
Copy link
Author

can you elaborate, what's inherently unsafe in using stack-allocated buffers as opposed to heap-allocated?

@vtjnash
Copy link
Member

vtjnash commented Aug 18, 2021

stack memory can become inaccessible while the Task is not running and is therefore forbidden from being passed across Tasks, while heap memory is stable and can thus be freely moved

@green-nsk
Copy link
Author

Thank you, I think that makes sense. So we could actually use this trick for converting to string, but not for the IO part? And all because some hypothetical IO could decide to do the actual writing from another task?

stack memory can become inaccessible while the Task is not running

I struggle to think of a real-world situation where this is the case. Is this non-x86?

@green-nsk
Copy link
Author

More interestingly, if tomorrow Julia Compiler made an optimization where short-scope-local-string's to be stack-allocated, we'd not be able to print() them?

@vtjnash
Copy link
Member

vtjnash commented Aug 18, 2021

Most (system) IO is done from another Task. And yes, if we do that optimization, it will cause that �problem, so we will need to address that first.

@green-nsk
Copy link
Author

I don't really know what to do about this. Preformance gains are substantial when using stack-buffers, at least in applications I tried. It may not be the case for everyone, but for people writing a lot of JSON or logs it can add up quickly. Not sure it all has to be in Base though, not my cal to make.

I see the following options:

  1. We find a way to identify "unsafe" IO whether by platform or by some other attributes/environment. Then we keep using stack-buffers when safe, others will use heap buffers (or convert to string prior to printing). String conversion can keep using stack buffers everywhere.
  2. Use heap-buffers via TaskLocalStorage for everything. Not ideal for my use-cases, but safe.
  3. Abandon this effort altogether and keep changes private in the codebase they came from

I'd prefer option 1, but again not my call to make as I am still not sure whether it's even possible to identify platforms/situations where stack memory is truly inaccessible from other tasks.

@green-nsk
Copy link
Author

with these changes, ScratchBuffer is always dynamically allocated, but we avoid (relatively) expensive TLS lookups and key on threadid() instead.

@green-nsk
Copy link
Author

I appreciate it's been a long while since this PR started, but I think at this point, all comments have been addressed. Also, it's compatible with migrating threads.

@green-nsk
Copy link
Author

@vtjnash @StefanKarpinski can you have another look at this? Based on previous comments I'm reasonably sure there's nothing else required to merge it.

@KristofferC
Copy link
Member

KristofferC commented Jun 15, 2022

Sorry for the slow replies here. I think the reason for this is mostly that this type of code is kind of "scary" since it deals with buffers shared between threads/tasks etc which is quite hard to get right so it might take a while before someone with enough confidence takes a look at it.

Perhaps one useful thing to add here is a more adverse test that does its best to break this. Start many threads and use scratch buffers while writing integers to see if there is some data races that can be detected etc.

@green-nsk
Copy link
Author

Thank you Kristoffer! I appreciate this is a sensitive part of codebase, and reviewing can take time. I'm happy as long as it's on someone's radar.

@mentics
Copy link

mentics commented Feb 17, 2023

I'm not sure if this is the right place to bring this up, but it seemed pertinent to this issue.

I ran into this allocation issue today and found this PR.

If covering only the built in primitive types (and not a numbers with unlimited size), then it could use a fixed size buffer because there is an absolute max length of a string for an int of a particular size. A fixed size buffer doesn't have to allocate. I understand this code couldn't depend on StaticArrays, but I would expect that however it is accomplishing its magic would be reproducible for this. Here's some code that demonstrates it (not optimized):

import Base:_dec_d100
using StaticArrays
const SomeInts = Union{UInt128, UInt16, UInt32, UInt64, UInt8}
const MAX_DIGITS = ndigits(typemax(Int128))
function dec_io(io::IO, x::SomeInts, pad::Int, neg::Bool)
    if neg; write(io, 0x2d); end
    n = ndigits(x)
    npads = max(0, pad - n)
    for _ in 1:npads
        write(io, '0') # TODO: use correct way to get the pad char
    end
    a = MVector{MAX_DIGITS,UInt8}(undef)
    i = n
    @inbounds while i >= 2
        d, r = divrem(x, 0x64)
        d100 = _dec_d100[(r % Int)::Int + 1]
        a[i-1] = d100 % UInt8
        a[i] = (d100 >> 0x8) % UInt8
        x = oftype(x, d)
        i -= 2
    end
    if i > 0
        @inbounds a[1] = 0x30 + (rem(x, 0xa) % UInt8)::UInt8
    end
    GC.@preserve a unsafe_write(io, pointer(a), n) # copied from Base.write(IO, String)
    return
end

Another option is to change the algorithm. A quick and dirty recursive example has no allocations:

function dec_io2(io::IO, x::SomeInts, pad::Int, neg::Bool)
    if neg; write(io, 0x2d); end
    padchar = '0' # TODO: use correct way to get the pad char
    ndigs = ndigits(x)
    npads = max(0, pad - ndigs)
    for _ in 1:npads
        write(io, padchar)
    end
    help1(io, x)
    return
end
@inline function help1(io::IO, x::SomeInts)
    if x < 10
        write(io, '0' + x)
    else
        help1(io, div(x, 10))
        write(io, '0' + (x % 10))
    end
end

Another option would be to reverse the algorithm. Currently it only needs a buffer because it's going from least significant digit to most. With an algorithm that goes most to least, it could write to the stream without any buffer.

Actual benchmarks are a little tricky because if you use Base.devnull for the io argument, the above examples are very fast, even the recursive one. However, writing out one character at a time to IOBuffer appears to be a performance issue. There are potential solutions to that on the IOBuffer side, or something in between.

@green-nsk
Copy link
Author

@mentics, what you're suggesting is somewhere in the first iteration of this PR. It wasn't accepted because it's not cool to pass stack data to Julia IO operations. The current version uses heap memory (but avoids allocations in most cases) and practically works, but:

  1. needs more stringent testing as it's a core piece of functionality
  2. needs to take care of the jl_adopt_thread feature added in 1.9.0
  3. has some (presumably, minor) conflicts with the current master.

@mentics
Copy link

mentics commented Feb 18, 2023

I take it that stack data is a problem because the IO contract requires stable data? Is that because of the current implementation of some IO code, or is it a deliberate contract? This seems a bit surprising. I would expect an optimization like this PR to be in an implementation of IO, not its callers. All callers would benefit from optimization, so why not just put it in IO? And not all IO implementations would have this problem, would they?

So, could a simple IO implementation that never yields do so without this extra optimization (neither in itself nor in its callers)? And one that does yield internally, could it copy the passed in data to a buffer (ring buffer, non-blocking queue, thread local, etc., depending on published thread safety for that implementation) without yielding?

end

function with_scratch_buffer(f, n::Int)
buf, buf_inuse = @inbounds SCRATCH_BUFFERS[_tid()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of threads can now change, so we should check here instead of using __init__.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants