Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: expose current thread id or processor id #18590

Closed
funny-falcon opened this issue Jan 10, 2017 · 68 comments
Closed

proposal: runtime: expose current thread id or processor id #18590

funny-falcon opened this issue Jan 10, 2017 · 68 comments
Labels
Milestone

Comments

@funny-falcon
Copy link
Contributor

Often there is a need to reduce lock contention in algorithms.
Usual way to acomplish this is sharding, ie several separate locks protects non-intersecting parts of greater state.
But then there is a need for fast (and preferably non-locking) way to determine shard id.
And it will be great, if such shard id will further reduce lock contention.

  • Best way to do it is exposing current cpu number. On Intel cpu it could be done with CPUID instruction.
    CPUID it returns APIC-ID which are not consecutive, so probably runtime should analyze CPU topology to convert APIC-ID to consecutive cpu-number. On the other way, given some architectures allows hot cpu swapping, probably it is better to return APIC-ID as is.
  • Simple way is to expose thread id getg().m.id as runtime.Mid() . Cause program could contain a lot of threads (due to system calls, or cgo calls), there is a need to additional fast random value, so I propose to additionally expose runtime.fastrand() as runtime.Fastrand() (mixing it with secure random key to hide original fastrand value).

"Best way" could be implemented in some "golang/x/" package. "Simple way" needs to implement in "runtime" at least wrapper around "getg().m.id", to link with this wrapper from external library (if there is a way to call private "runtime" function from external package).

@ianlancetaylor ianlancetaylor changed the title Proposal: expose current thread id or even processor id to user-space proposal: expose current thread id or even processor id to user-space Jan 10, 2017
@ianlancetaylor
Copy link
Member

Historically we have always rejected goroutine-local information.

Your proposal is incomplete: you need to define exactly when the value is permitted to change, and you need to do so without overly constraining the runtime. Right now a g is permitted to move to a different m or p at any function call, which includes several implicit function calls inserted by the compiler for things like map opertions. In order to address #10958 it is likely that in future a releases a g will be permitted to move to a different m or p in any loop. It's not enough to define runtime.Mid without also providing a clear explanation for when it will return a different value.

The discussion in #17973 may be of interest.

@funny-falcon
Copy link
Contributor Author

Result of runtime.Mid() could be changed at any moment.
Value, returned from Mid() doesn't eliminate need of locking: since number of threads could be much larger than preallocated number of shards, locks are inevitable.
And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.
Mid() combined with Fastrand() are just a hint for reducing lock contention, but doesn't eliminates locks completely.

@mdempsky
Copy link
Contributor

mdempsky commented Jan 10, 2017

And since several threads could map into same shard number, there is a need for runtime.Fastrand(): shards are bucketized, bucket is choosen by Mid, shard in the bucket by Fastrand.

If runtime.Mid changes (e.g., because the goroutine is now being run on a new OS thread), how do you make sure a goroutine continues to use the correct bucket?

In case the answer is that each goroutine only calls runtime.Mid once, then why not just use math/rand.Int?

@funny-falcon
Copy link
Contributor Author

make sure a goroutine continues to use the correct bucket?

It is just hint to select a bucket. Per-shard mutex still will be acquired, so there is no "fail" if runtime.Mid will be changed after call. But most of time it will be not changed, and this will help to reduce mutex contention.

Problem with math/rand.Int is single global mutex around this call. See my previous closed proposal: #18514

BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkFastRand-4             500000000                3.49 ns/op
BenchmarkMathRand-4             10000000               162 ns/op
BenchmarkMathRand-4             10000000               159 ns/op

@quentinmit
Copy link
Contributor

Does syscall.Gettid not give you a thread ID that roughly meets this criteria already? Why does the runtime need to provide it?

@funny-falcon
Copy link
Contributor Author

I bet syscall.Gettid is expensive. No?

@kostya-sh
Copy link
Contributor

kostya-sh commented Jan 10, 2017 via email

@funny-falcon
Copy link
Contributor Author

Probably, syscall.Gettid() is not so expensive on Linux, but it is much more expensive on other platforms (and probably doesn't exists on Windows).

@funny-falcon
Copy link
Contributor Author

@kostya-sh Here is example of usage of runtime.Mid() and runtime.Fastrand() based on your benchmark: https://gist.github.com/funny-falcon/d8adc39e777b79e56274a65abca29e78

results are:

BenchmarkRandInt63_Global-4                     10000000               154 ns/op
BenchmarkRandInt63_Channel-4                    10000000               220 ns/op
BenchmarkRandInt63_Pool-4                       50000000                28.8 ns/op
BenchmarkSourceInt63_Pool-4                     50000000                34.9 ns/op
BenchmarkRandInt63_Source-4                     300000000                5.71 ns/op
BenchmarkRandInt63_ShardsGettid-4               50000000                29.2 ns/op
BenchmarkRandInt63_ShardsMid-4                  100000000               14.2 ns/op
BenchmarkRandInt63_ShardsFastrand-4             30000000                49.0 ns/op
BenchmarkRandInt63_ShardsMidFastrand-4          100000000               17.1 ns/op

It shows that using Mid() (and combination with Fastrand()) could be almost twice faster than single Pool.

@funny-falcon
Copy link
Contributor Author

Looks like exposing getg().m.p.ptr().id gives even better results.

@funny-falcon
Copy link
Contributor Author

Looks like sync.Pool already uses hiddenly exposed getg().m.p.ptr().id (through runtime_procPin). So, effectiveness of sharding by P.id is known to core team.

So why not expose it to regular Go users?

(of cause, runtime_procPin does more than simply returning current P.id.
But just returning P.id with runtime.Pid() will be already great help).

@bradfitz
Copy link
Contributor

bradfitz commented Jan 10, 2017

So why not expose it to regular Go users?

Because Go considers simplicity a virtue. When there are trade-offs between simplicity and performance, Go considers simplicity. You seem to have a different value trade-off, and it's obvious you prefer performance above all else. That's probably why many of your proposals to Go end up getting rejected. Please keep that in mind for future proposals.

We understand that things can be always be faster, and we want things to be fast, but only if things are also simple and have clean APIs. We do not want to document and support and export all possible innards.

@funny-falcon
Copy link
Contributor Author

But such thing as P.id is very simple thing. Does anyone expect changing of Go runtime scheduler core schema? Will P ever be removed?

@funny-falcon
Copy link
Contributor Author

That's probably why many of your proposals to Go end up getting rejected.

"just keep pushing" :-)

@bradfitz
Copy link
Contributor

Pushing the same ideas over and over is actually more likely to annoy people. Please try to understand the space of trade-offs that Go strives for.

But such thing as P.id is very simple thing.

That's not the point. We don't expose every "simple thing" just because it's simple. We also have to then document it, support it, test it, and keep it available forever. And we have to consider whether it's redundant with other API, or will be misused, etc.

Does anyone expect changing of Go runtime scheduler core schema?

It has in the past.

@mdempsky
Copy link
Contributor

@funny-falcon My main impression from this thread is it's not obvious that the design space has been exhausted. E.g., you could already locklessly assign a unique ID to each worker yourself. Why have you ruled that out?

@funny-falcon
Copy link
Contributor Author

@mdempsky single lockless approach awailable in Go is atomic.AddUintxx , but it is also doesn't scale with a lot of cpu cores cause of cacheline contention.

Any "lock" or "lockless" technique will fail to scale without hint about current scheduling item. That is why it were used for sync.Pool, yes?

@jonhoo uses CPUID instruction and parsing '/proc/cpuinfo' , but it works only on linux x86 : https://github.com/jonhoo/drwmutex

@ianlancetaylor
Copy link
Member

The Go runtime is happy to use per-P or per-M data in support of specific, simple, well-defined functionality that can, if necessary, be provided in other ways. If you look back at the history of sync.Pool, you will see a lot of discussion about exactly these issues: is sync.Pool simple enough, well-defined enough, for people to use successfully.

To me your proposal seems very loosely defined. It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't." That's not a well-defined API that people can use to write good code, because the expectations are not clear. Since the expectations are not clear, people will rely on the implementation, and that means that if we change the implementation the performance characteristics of people's code will change. That's very problematic for a feature that exists only for its effect on performance.

@funny-falcon
Copy link
Contributor Author

Pushing the same ideas over and over is actually more likely to annoy people.

I pushed different ideas. Yes, I am annoying a bit. But someone still should push strange ideas. At least, there will be documented rejection of such idea.

We don't expose every "simple thing" just because it's simple.

That is why here is discussion of proposal: to decide worthwhile it or not.

It has in the past.

Given two most successful current green thread runtimes (Go and Erlang) has similare structure (at least, they both has per-core schedulers - P ), it is quite unfortunately for next big change. I could be mistaken.

@davecheney
Copy link
Contributor

davecheney commented Jan 10, 2017 via email

@funny-falcon
Copy link
Contributor Author

@davecheney Original my use-case is still rpc. I just realized that runtime.Pid() is simply better.
Database, for which I'm maintaining connector, may handle 1Mrps easily, and at this rate every possibility for improvement is worthwhile.

I saw "per-cpu local info" requests at least several times during discussion of this and previous proposal (walking by links), so it doesn't look like it is just my crazy need. And sharding by runtime.Pid() looks like very close approximation, even if Mutex is still involved (ie no pinning like for sync.Pool).

@funny-falcon
Copy link
Contributor Author

@ianlancetaylor

It's nearly indistinguishable from "give me a random number in the range 0 to GOMAXPROCS, and try to give me the same random number every time, but don't worry if you don't."

Api for "Pid()" is "give me a number, that approximates scheduling internals: cpu number or scheduler (P), - so that two goroutines that calls this method simultaneously (ie in parallel) will receive different numbers with very high probability, so if I use it for choosing mutex or other synchronization primitive, it will likely to be non-contended. Preferable it should be bound with understandable small value (num-of-cpu cores or GOMAXPROCS)."

@davecheney
Copy link
Contributor

davecheney commented Jan 10, 2017 via email

@dsnet dsnet added this to the Proposal milestone Jan 10, 2017
@funny-falcon
Copy link
Contributor Author

Currently I choose shard for putting request into by atomic increment. But it means, even if goroutine is not preempted, it will spread its requests across many shards. (Shard - is datastructure that aggregates requests before network writer writes them into connection).

With Pid() cache locality will be much improved and lock contention decreased.

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

@davecheney
Copy link
Contributor

Another obvious use-case for Pid is statistic collector: without such hint even atomic operations will fail to scale on multi-core system. With such hint statistic will scale smoothly.

Please, can we talk about your specific problem, not other problems that other people may be able to solve with this feature.

@funny-falcon
Copy link
Contributor Author

funny-falcon commented Jan 10, 2017

Let it be called not Pid but ProcHint() :

// ProcHint gives a hint about current scheduler item
// (for example, cpu nunber, or id of internal scheduler).
// ProcHint is bounded by 0 <= ProcHint < ProcHintMax
// You may use it to reduce lock contention if you may allocate
// ProcHintMax independent mutexes or channels and
// then choose them using ProcHint.
func ProcHint() int

// ProcHintMax gives upper bound of ProcHint.
// Note: it could be changed in runtime, for example,
// if you change GOMAXPROCS or if you hot-swap cpu.
func ProcHintMax() int

@minux
Copy link
Member

minux commented Jan 11, 2017 via email

@funny-falcon
Copy link
Contributor Author

I understood, fixed allocator api could be simpler:

//NewFixSharded preallocates all values by calling alloc function, and returns new FixSharded.
//FixSharded never changes its size, ie never allocates new value after construction.
NewFixShareded(alloc func() interface) *FixSharded {}
//NewFixShardedN preallocates exactly n values by calling alloc function, and returns new FixSharded.
NewFixSharededN(n int, alloc func() interface) *FixSharded {}
func (a *FixSharded) Get() interface{} {}

@funny-falcon
Copy link
Contributor Author

Edit: FixSharded.Get is non-locking, so application code should contain synchronization around value usage.

@bcmills
Copy link
Contributor

bcmills commented Jan 26, 2017

@funny-falcon

No, I'm talking about connector to external database capable to serve > 1Mrps.

I still don't know what that means. How about some pseudocode, or (better still) a link to some real code?

@funny-falcon
Copy link
Contributor Author

That is current code:

It is capable to send 1Mrps requests and waits for 1Mrps responses through one connection using 4 cores. Database ( https://tarantool.org ) does not even uses 100% of single core at this request rate, so there is room for improvements.

@bcmills
Copy link
Contributor

bcmills commented Jan 26, 2017

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway? (How are you keeping those calls colocated to the same thread as the putFuture call?)

At any rate: that's very complicated, heavily asynchronous code. I would argue that it's not idiomatic for a Go program, and we should be optimizing for making idiomatic code more efficient rather than speeding up complicated code.

A much simpler example (e.g. a self-contained benchmark with a similar request flow) would be helpful.

@funny-falcon
Copy link
Contributor Author

"Simple should be simple, and complex should be possible".

If you will not add possibilities for "complex cases", they still will be written in C++.

@funny-falcon
Copy link
Contributor Author

funny-falcon commented Jan 26, 2017

  • fetching "request" from hash map
  • reader fetches reads response

Don't those steps introduce cross-thread contention anyway?

Yes. But they are sharded by incoming response id, and it is much shorter than sending request, cause doesn't contain deserialization of response (and sending contains serialization of request).

Cause of serialization into reusable batch buffer, CPU alignment is more important on request sending than on filling request with response bytes.

@cznic
Copy link
Contributor

cznic commented Jan 26, 2017

If you will not add possibilities for "complex cases", they still will be written in C++.

A language trying too hard to be loved by everyone ends loved by few.

@funny-falcon
Copy link
Contributor Author

@cznic two simple methods lowlevel could not destroy love for Golang. Impossibility to do something efficiently can.

Number of those, who will directly write code using proposed methods, will be small. And this people actually knows what they wants. For those people low-level methods are usually better.

But number of indirect users are much larger, cause number of users who will use libraries written by former people is more countable.

@cznic
Copy link
Contributor

cznic commented Jan 26, 2017

Number of those, who will directly write code using proposed methods, will be small.

I rest my case 😉

@bcmills
Copy link
Contributor

bcmills commented Jan 26, 2017

@funny-falcon

Impossibility to do something efficiently can [destroy love for Go].

The example you've provided is so complex that it's very difficult to see whether your use-case is truly impossible to address efficiently (both with the standard library as it stands today and with the alternative proposals).

Which is to say: one of the great strengths of Go is that it is reasonably efficient for programs implemented in a direct, simple style. In many cases, attempts to "optimize" Go programs using idioms from other languages (e.g. futures and callbacks) end up fighting against optimizations in the runtime (escape analysis, goroutine scheduling, the transaction-oriented collector, etc.).

If you want to argue that something is necessary on efficiency grounds (and, let's be clear, the Go compiler and runtime have lots of room for improvement!), the first step is to describe the problem you're trying to solve as idiomatic code and show where the bottlenecks occur (e.g. using pprof profiles or similar).

@funny-falcon
Copy link
Contributor Author

About idiomatic Go code:

  • unsafe is not idiomatic, but many libraries ought to use it for performance
  • sync and sync/atomic is not idiomatic, but many libraries ought to use it for performance.
  • more and more often people uses fasthttp instead of idiomatic net/http . Why? Cause it is fast.

People choose Go cause it is simpler and faster. But it is faster not only cause it is "compiled and statically typed", but because there is possibility to write fast libraries in... and usage of sync, sync/atomic, unsafe is huge part of this possibility.

More possibilities to write faster libraries => more possibilities to write fast applications => more new users who want or need performance (hint: most of Go users).

@funny-falcon
Copy link
Contributor Author

@bcmills show me RPC written in idiomatic Go and capable to perform 1M requests per second. Then I will never write any new silly proposal. I will just use/copy code that you show me.

@bcmills
Copy link
Contributor

bcmills commented Jan 27, 2017

@funny-falcon There's no need to get hostile.

The point of asking for idiomatic code is not that it does, today, what you want it to do. The point is that it becomes much easier to see where and why it does not do what you want it to do, and the solution that enables your use-case is more likely to benefit the Go community as a whole rather than just one niche application.

And I disagree with what you wrote about unsafe, sync, and atomic. I would consider those — and even reflect — to be perfectly idiomatic in many situations. When I'm talking about "idiomatic Go code", I mean code with straightforward control flow, mostly synchronous function calls, mostly functional-style APIs, types with relatively few methods, methods with relatively few parameters, and so on.

The use of mostly synchronous functions is particularly important in that regard: synchronicity tends to produce good cache locality for server-style programs (with lots of hot "per-request" data, relatively little "background" data, and relatively few different kinds of tasks being performed), makes the behavior of the program clearer to the goroutine scheduler, and makes it a bit easier to thread parameters down through the call tree without introducing aliasing bugs (leading to, for example, the ability to allocate the "per thread" data at the top of a goroutine and pass it down through that goroutine rather than scattering map lookups throughout the code).

@funny-falcon
Copy link
Contributor Author

@bcmills I'm not even-tempered, but I'm not hostile.

@funny-falcon
Copy link
Contributor Author

I'm closing now. May be I'm mistaken somewhere.

@valyala
Copy link
Contributor

valyala commented Jan 28, 2017

I'd just expose a single function in runtime package:

// PID returns an id currently associated with the CPU core.
//
// The function may return distinct values for the same CPU core in the long run.
// There are no guarantees on the maximum and minimum id values.
func PID() uint32 {
  return getg().m.p.ptr().id
}

This functionality is the core of timers scalability CL - https://go-review.googlesource.com/#/c/34784/ .

@funny-falcon
Copy link
Contributor Author

@valyala, you did it! thank you for timers!

I'd just expose a single function in runtime package:

That is what I named ProcHint, cause core-team doesn't want for this to be named like a hook into internals.

I agree, single functions could be already enough.

@funny-falcon funny-falcon reopened this Jan 28, 2017
@funny-falcon
Copy link
Contributor Author

@bradfitz @bcmills technique used by @valyala to improve timers at https://go-review.googlesource.com/#/c/34784/ exactly shows why runtime.ProcHint() (or Pid()) is useful for high performance multi-core programming.

I agree with @valyala that ProcMaxHint most likely doesn't need. It is enough to have fixed size number of "shards".

@rsc
Copy link
Contributor

rsc commented Feb 13, 2017

The valid uses of this function are far outweighed by the buggy invalid uses. There are of course uses of per-P data inside the runtime. That's not an argument for exposing P numbers outside the runtime. I would rather add a few higher-level APIs that are easier to use correctly, like the ShardedValue proposal, than add one low-level API that is very hard to use correctly.

@happilymarrieddad
Copy link

Hey guys,

syscall.Gettid() works pretty well.

package main

import (
	"log"
	"runtime"
	"syscall"
)

func main() {
	runtime.GOMAXPROCS(runtime.NumCPU())

	num := 20
	sem := make(chan int, num)

	for i := 0; i < num; i++ {
		go func() {
			log.Println(syscall.Gettid())

			sem <- 0
		}()
	}

	for i := 0; i < num; i++ {
		<-sem
	}

	return
}

@happilymarrieddad
Copy link

selection_071

@funny-falcon
Copy link
Contributor Author

@happilymarrieddad , no it doesn't . There is only GOMAXPROCS of possible running schedulers (unless one use runtime.LockOsThread), and thousands of native threads (if some syscall or C library locks for a long time).

@golang golang locked as resolved and limited conversation to collaborators Feb 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests