-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
show(io::IO, int) optimization #41415
base: master
Are you sure you want to change the base?
Conversation
If you look at the top of pcre.jl, it had a similar problem of needing the thread id before Threads is loaded. It has its own implementation of threadid for bootstrapping purposes that you could copy to Base.jl to use more widely. |
Ah right, this has to be task-local, not thread-local, since a task can block when trying to print something. Here's one way we used to do it: Line 30 in e66a39b
|
Hooking up with TLS really slows it down. How about using UPD: |
So this time it works as expected. I also moved # master: 36.241 ns (2 allocations: 104 bytes)
# After: 24.208 ns (1 allocation: 40 bytes)
@btime string(123456; base = 2)
# master: 36.595 ns (2 allocations: 88 bytes)
# After: 25.597 ns (1 allocation: 24 bytes)
@btime string(123456; base = 8)
# master: 45.193 ns (2 allocations: 88 bytes)
# After: 27.618 ns (1 allocation: 24 bytes)
@btime string(123456)
# master: 35.552 ns (2 allocations: 88 bytes)
# After: 25.723 ns (1 allocation: 24 bytes)
@btime string(123456; base = 16)
# master: 36.212 ns (2 allocations: 104 bytes)
# After: 26.046 ns (1 allocation: 40 bytes)
@btime string(Unsigned(123456); base = 2)
# master: 36.160 ns (2 allocations: 88 bytes)
# After: 25.882 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 8)
# master: 45.268 ns (2 allocations: 88 bytes)
# After: 27.702 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456))
# master: 35.605 ns (2 allocations: 88 bytes)
# After: 24.031 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 16) |
|
Sorry, it's been a moving target, but I think that's all I wanted to push for this PR. Some benchmark for float: # before: 147.485 ns (2 allocations: 432 bytes)
# after: 120.259 ns (1 allocation: 24 bytes)
# inbounds: 113.345 ns (1 allocation: 24 bytes)
@btime string(123.456)
# before: 184.321 ns (2 allocations: 432 bytes)
# after: 142.257 ns (0 allocations: 0 bytes)
# inbounds: 140.736 ns (0 allocations: 0 bytes)
@btime print($iobuf, 123.456)
# before: 154.228 ns (2 allocations: 432 bytes)
# after: 119.835 ns (0 allocations: 0 bytes)
# inbounds: 112.298 ns (0 allocations: 0 bytes)
@btime show($iobuf, 123.456) |
This reverts commit f864790.
I worry about the |
I've renamed Scratch -> ScratchBuffer as well as related names. If exporting anything, I'd say export |
Is there anything else required? |
Bump. |
As I mentioned above in #41415 (comment), this is still attempting to use stack-allocated buffers in most cases which is unsafe for this and therefore prohibited |
This is how the buffer was allocated before the change: A naive compiler would allocate both in the heap. On the contrary, Julia 2.0 could optimize out heap operations for "small scope-contained vectors of fixed size" or String's. The compiler we're currently using optimizes This optimization is exploited widely in the codebase I'm working on, and I'm sure I can find examples where it's already being used in Julia Base if I tried. There's no reason Julia Base library shouldn't benefit from the optimization the Julia Compiler team has delivered. |
Hopefully you don't find any places where we use stack memory for IO, as I have taken great pains to ensure that does not happen. |
can you elaborate, what's inherently unsafe in using stack-allocated buffers as opposed to heap-allocated? |
stack memory can become inaccessible while the Task is not running and is therefore forbidden from being passed across Tasks, while heap memory is stable and can thus be freely moved |
Thank you, I think that makes sense. So we could actually use this trick for converting to string, but not for the IO part? And all because some hypothetical IO could decide to do the actual writing from another task?
I struggle to think of a real-world situation where this is the case. Is this non-x86? |
More interestingly, if tomorrow Julia Compiler made an optimization where short-scope-local-string's to be stack-allocated, we'd not be able to print() them? |
Most (system) IO is done from another Task. And yes, if we do that optimization, it will cause that �problem, so we will need to address that first. |
I don't really know what to do about this. Preformance gains are substantial when using stack-buffers, at least in applications I tried. It may not be the case for everyone, but for people writing a lot of JSON or logs it can add up quickly. Not sure it all has to be in Base though, not my cal to make. I see the following options:
I'd prefer option 1, but again not my call to make as I am still not sure whether it's even possible to identify platforms/situations where stack memory is truly inaccessible from other tasks. |
with these changes, ScratchBuffer is always dynamically allocated, but we avoid (relatively) expensive TLS lookups and key on |
I appreciate it's been a long while since this PR started, but I think at this point, all comments have been addressed. Also, it's compatible with migrating threads. |
@vtjnash @StefanKarpinski can you have another look at this? Based on previous comments I'm reasonably sure there's nothing else required to merge it. |
Sorry for the slow replies here. I think the reason for this is mostly that this type of code is kind of "scary" since it deals with buffers shared between threads/tasks etc which is quite hard to get right so it might take a while before someone with enough confidence takes a look at it. Perhaps one useful thing to add here is a more adverse test that does its best to break this. Start many threads and use scratch buffers while writing integers to see if there is some data races that can be detected etc. |
Thank you Kristoffer! I appreciate this is a sensitive part of codebase, and reviewing can take time. I'm happy as long as it's on someone's radar. |
I'm not sure if this is the right place to bring this up, but it seemed pertinent to this issue. I ran into this allocation issue today and found this PR. If covering only the built in primitive types (and not a numbers with unlimited size), then it could use a fixed size buffer because there is an absolute max length of a string for an int of a particular size. A fixed size buffer doesn't have to allocate. I understand this code couldn't depend on StaticArrays, but I would expect that however it is accomplishing its magic would be reproducible for this. Here's some code that demonstrates it (not optimized):
Another option is to change the algorithm. A quick and dirty recursive example has no allocations:
Another option would be to reverse the algorithm. Currently it only needs a buffer because it's going from least significant digit to most. With an algorithm that goes most to least, it could write to the stream without any buffer. Actual benchmarks are a little tricky because if you use Base.devnull for the io argument, the above examples are very fast, even the recursive one. However, writing out one character at a time to IOBuffer appears to be a performance issue. There are potential solutions to that on the IOBuffer side, or something in between. |
@mentics, what you're suggesting is somewhere in the first iteration of this PR. It wasn't accepted because it's not cool to pass stack data to Julia IO operations. The current version uses heap memory (but avoids allocations in most cases) and practically works, but:
|
I take it that stack data is a problem because the IO contract requires stable data? Is that because of the current implementation of some IO code, or is it a deliberate contract? This seems a bit surprising. I would expect an optimization like this PR to be in an implementation of IO, not its callers. All callers would benefit from optimization, so why not just put it in IO? And not all IO implementations would have this problem, would they? So, could a simple IO implementation that never yields do so without this extra optimization (neither in itself nor in its callers)? And one that does yield internally, could it copy the passed in data to a buffer (ring buffer, non-blocking queue, thread local, etc., depending on published thread safety for that implementation) without yielding? |
end | ||
|
||
function with_scratch_buffer(f, n::Int) | ||
buf, buf_inuse = @inbounds SCRATCH_BUFFERS[_tid()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of threads can now change, so we should check here instead of using __init__
.
This is for #41396
As you'll notice, I couldn't quite figure out the threading/initialization part. Two main problems there:
Threads
is loaded, it's not clear how to switch it to thread-compatible code afterThreads
is loadedI could use some help there.
Otherwise, I think change is quite straightforward. Some benchmarks:
The
string(::Integer)
call benchmarks about the same as master, as expected.