Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: non-locking FastRand #18514

Closed
funny-falcon opened this issue Jan 4, 2017 · 17 comments
Closed

proposal: non-locking FastRand #18514

funny-falcon opened this issue Jan 4, 2017 · 17 comments

Comments

@funny-falcon
Copy link
Contributor

Sometimes there is a need to generate short-living non-ovelapping ids from several goroutines ie. request_id for some kind of RPC protocol.

"Obvious" way is to use atomic increment. But it forces write contention on cache-line.

Better way could be to allow to use set of per-thread non-overlapping generators with relatively long period.

Example implementation is here https://go-review.googlesource.com/#/c/34781/ . It implements non-overlapping per-thread generators with period 1<<44 of non-repeating values. So one may safely rely on generation of 1<<44 non-repeating values.

Implementation puts method into runtime package, but it certainly should be in math/rand.

BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkMathRand-4             10000000               162 ns/op
BenchmarkMathRand-4             10000000               159 ns/op
BenchmarkRandSource-4           300000000                5.26 ns/op
BenchmarkRandSource-4           300000000                5.27 ns/op
BenchmarkRandAtomic-4           50000000                32.0 ns/op
BenchmarkRandAtomic-4           50000000                31.3 ns/op
@bradfitz bradfitz added this to the Proposal milestone Jan 4, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Jan 4, 2017

Is "sometimes" enough to warrant it being in the standard library?

See https://golang.org/doc/faq#x_in_std

Could this live outside the standard library?

@funny-falcon
Copy link
Contributor Author

If there is a way to obtain thread-local value in a fast way, then yes.

Thread-local value could be safely accessed without lock (given function doesn't call any other function).

Currently there is no way to write-access any shared value without lock outside runtime package (but this access could be transferred to other package in standard library, like math/rand).

@funny-falcon
Copy link
Contributor Author

funny-falcon commented Jan 4, 2017

Most of "code" could be implemented outside of standard library.

But the need of fast non-locking random could not be thrown away.

Probably, if runtime.fastrand() will be runtime.FastRand() it will be already enough, despite it doesn't provide "long non-overlapping period".

@randall77
Copy link
Contributor

Your RPC example isn't motivating me.
I find it hard to believe that doing a single atomic increment is going to be noticeable compared to the cost of an RPC.
I also don't think that random is really what you want here - RPC IDs typically need to be distinct, at least within a session or whatever. It sounds like you're relying on the non-repeating of this generator, but that's not "random". And if it isn't random, it's hard to fix a requirement for what the non-random requirements are if we are to bake it into the standard library.

@dsnet
Copy link
Member

dsnet commented Jan 4, 2017

I much rather see #8281 be solved and see something like this appear as a third-party package.

@davecheney
Copy link
Contributor

davecheney commented Jan 4, 2017 via email

@dr2chase
Copy link
Contributor

dr2chase commented Jan 4, 2017

Is a Spliterator anything like a Deterministic Parallel RNG? I think a DPRNG might be what we want, not sure it needs to be in the library.

@minux
Copy link
Member

minux commented Jan 4, 2017 via email

@funny-falcon
Copy link
Contributor Author

@minux In example implementation I used construction similar to SPLITMIX (I even add link to Java's SplittableRandom).

Problem with "splittable random" without per-state cpu is how to get that "split" without lock contention? How to get safely "per-goroutine" RNG?

Common techinuque to reduce lock contention is sharding. But there is no way to choose "random" shard in Golang without another lock or atomic operation.

Probably, this original proposal is overkill.

What about proposal this way:

  • expose runtime.fastrand() as runtime.FastRand() with comment about "low-quality" of this generator (short period, predictable dependency between values).
  • expose getg().m.id as runtime.MID() or runtime.ThreaID().

Given this methods are exposed, it will be quite easy to perform light-weight lock-sharding. It will be certainly enough for my usecase.

@dsnet perhaps, it could reduce need for real cpu-local storage of #8281 by light-weight lock-sharding.

PS. Why runtime.fastrand() is still in assembly? My version of FastRand in inlined, and I think, fastrand will be inlined too if implemented in Go. Assembly contains "conditional load" instruction, so perhaps main barrier is ability of Go optimiser to generate this instruction, yes?

@josharian
Copy link
Contributor

Why runtime.fastrand() is still in assembly?

Good question. Want to try rewriting it in Go and benchmark the results? Having it in pure Go would make this optimization less ugly as well.

@funny-falcon
Copy link
Contributor Author

If there will be a way to get cpu number, as suggested in #8281 , it will also help to reduce lock-contention.

@funny-falcon
Copy link
Contributor Author

I've implemented fastrand in go, and it looks to be faster on i386 and x86_64.
Changeset: https://go-review.googlesource.com/34782

@dr2chase
Copy link
Contributor

dr2chase commented Jan 5, 2017 via email

@RLH
Copy link
Contributor

RLH commented Jan 5, 2017 via email

@rsc
Copy link
Contributor

rsc commented Jan 9, 2017

@robpike has a pending CL for a PCG-based generator (https://go-review.googlesource.com/#/c/10161/), which would be very lightweight and maybe cheap enough to duplicate and create a new one per-goroutine. @dr2chase's suggestion is also, as I understand it, lightweight and cheap to pass around.

There is a separate question in this proposal about whether this is important enough to have a per-thread instance hiding in the runtime to avoid even needing to thread one through your code explicitly. That's more compelling with the giant rand state we have today, but if you're using PCG, it's much less necessary. In general it seems like we should avoid new magical per-thread state and prefer explicit notation in the code for what is going on.

Like @randall77 said, it's hard to believe that threading one of these through the code (or a single atomic increment) is more expensive than the other work required in an RPC.

It doesn't seem like we should do anything to support these cases in the runtime.

@funny-falcon
Copy link
Contributor Author

@rsc @robpike If choose lightweight general purpose prng than I'd prefer xoroshiro:
http://xoroshiro.di.unimi.it
http://xoroshiro.di.unimi.it/xoroshiro128plus.c
It is faster than PCG, has 128 bit state, and has jump function to effectively split state.

In this proposal I used SPLITMIX variant cause it could be easily tweaked to produce non-intersecting values from splitted states.

But either way, now I think it is better to close this proposal, and open new: for exporting getg().m.id as runtime.MID(), and probably exposing runtime.fastrand() as runtime.Fastrand() (mixed with secure key to hide its original value).

@bradfitz
Copy link
Contributor

Closing per OP's request.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests