sync: per-cpu storage #8281

hyangah · 2014-06-24T19:40:12Z

Provide a way to create/access cpu local objects. This may make it easy to  batch data
per-CPU to reduce global locking. This is a feature request.

Potential use case:

scalable stats across many goroutines - stat updates are batched in the per-CPU stat
object. Periodically or on-demand, the stat batches are aggregated to a global stat
object. 

This is similar to sync.Pool in that it will help reducing lock contention, but
different in that the object lifetime should be explicit and the per-CPU objects need to
be accessible from other CPUs if necessary.

ianlancetaylor · 2014-06-24T20:15:31Z

Comment 1:

Labels changed: added repo-main, release-go1.4.

dvyukov · 2014-06-24T20:38:45Z

Comment 2:

The stats here are not just counters, but rather complex objects indexed by dynamically
constructed strings and aggregating data into histograms, etc. Right?
So simple cpu-cpu counters won't do.
The API must be something like:
func acquire() T
func release(T)
func foreach(func(T))

hyangah · 2014-06-25T15:00:16Z

Comment 3:

Yes. I want more complex objects than a simple integer counter, with custom aggregation
logic. 
I assume acquire fetches an instance of T from the cpu local(?); release returns the
instance to the current cpu local. Since a goroutine can be moved to another cpu,
release may return the instance to a different cpu local. If my guess is correct, there
could be multiple T instances corresponding to a single cpu at a certain point.
Do you expect foreach can work as flush mechanism of Ts as well?

dvyukov · 2014-06-25T17:33:33Z

Comment 4:

Yes, there can more instances than CPUs (we can't pin arbitrary user code to CPU). I
would do something like at most 2*NCPU objects.
I don't like the idea of foreach being a flush. First it's not obvious. Second foreach
may not be called at all. Does it work for you if we just have at most 2*NCPU objects?

hyangah · 2014-06-25T17:48:16Z

Comment 5:

SGTM.

gopherbot · 2014-06-28T19:57:15Z

Comment 6 by ajmani:

Would this approach work for log buffers?  I'd like to allow request handlers to log to
buffers that are flushed asynchronously.

hyangah · 2014-07-02T11:33:30Z

Comment 7:

I guess T can represent a buffer. foreach caller can choose to empty the buffer. It may
be hard to guarantee the log entries are flushed in the logging time order.

rsc · 2014-07-02T15:19:38Z

Comment 8:

I can't see any connection between async log flushing and per-CPU storage.
Separately, I think that per-cpu storage is probably too low-level a primitive. It's
defining an implementation instead of a high-level goal. 
For example, the high-level goal of sync.Pool is to reclaim memory kept in caches. It
happens that we built a very fast sync.Pool by using per-CPU storage underneath, but the
per-CPU storage is not the point.
I suspect that the real thing you need is a better high-level target than per-CPU
storage.

Status changed to Thinking.

dvyukov · 2014-07-02T15:31:33Z

Comment 9:

The high-level idea here is a non-leaking pool of objects, with the interface along the
lines of:
type NonLeakingPool ...
func (*NonLeakingPool) Get() interface{}
func (*NonLeakingPool) Put(interface{})
func (*NonLeakingPool) ForEach(func(interface{}))

rsc · 2014-07-02T17:22:41Z

Comment 10:

That doesn't look like a high-level idea to me.

Sajmani · 2014-07-02T17:38:32Z

Comment 11:

For logging, I'd like to have an interface that lets multiple goroutines write lines of
text simultaneously while another goroutines consumes those lines, e.g., to flush them
off to disk or the network.  An implementation might provide per-CPU buffers to reduce
contention.
Hana's stats counters would be a different interface that might also leverage per-CPU
counters plus some aggregation.
In both cases, you have many writers and some aggregation in the consumer.

rsc · 2014-07-02T17:39:37Z

Comment 12:

Re #11 I really don't think you do. Logging needs backpressure or else you will fill all
your memory with logs when the writer is slow.

Sajmani · 2014-07-02T17:40:53Z

Comment 13:

I didn't say the buffer was unbounded.  I agree re: backpressure.

rsc · 2014-07-09T16:18:50Z

Comment 14:

per-cpu logging will also reorder log messages by a single goroutine. it's not a good
idea.

hyangah · 2014-07-15T15:09:01Z

Comment 15:

Agree that this is not a good fit for logging - need api to discourage such
use. I expect the buffered channel perf improvement work to help for this
case.
What about the stat collection application? We've tried a channel based
implementation (let many goroutines send stat update through channel and
have a receiver goroutine that updates the stats.)  and a mutex-based
implementation (let many goroutines fight for a single mutex to update the
stats). None was satisfactory.

dvyukov · 2014-07-16T09:01:42Z

Comment 16:

"buffered channel perf improvement work" is deferred, see:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/FtV82Tr3S00J
Your use case can potentially provide the necessary justification. You can compare
performance of chan-based solution with and without the change.

RLH · 2014-07-16T12:29:51Z

Comment 17:

Would returning the CPU id (index) be sufficient? The id could be mapped to whatever
data one wants and the interfaces envisioned here could be implemented.

hyangah · 2014-07-16T20:35:02Z

Comment 18:

Good point. 
I will need to protect the data myself from concurrent accesses (in case the goroutine
is migrated to other cpu in the middle, or the other reader goroutine iterates), but
that's not a big deal. Is it possible for runtime to return cpu index?

dvyukov · 2014-07-17T04:29:23Z

Comment 19:

From implementation point of view, it's trivial (I mean "virtual processor"
0..GOMAXRPOCS-1).

rsc · 2014-09-18T20:01:17Z

Comment 20:

Labels changed: added release-none, removed release-go1.4.

hyangah · 2015-07-08T22:12:43Z

after listening to Prometheus talk at gophercon, i remembered this issue (exactly same motivation) and I wonder if per-cpu storage can help reducing lock contention.

bcmills · 2017-01-25T19:27:27Z

func (*NonLeakingPool) Get() interface{}
func (*NonLeakingPool) Put(interface{})
func (*NonLeakingPool) ForEach(func(interface{}))

Get and ForEach suffice; you don't need Put unless you want threads to be able to deallocate their per-thread resources without using finalizers, and I think that's a pretty rare use-case.

(It turns out that if you're willing to abuse finalizers, you can mostly implement this API today in terms of sync.Pool. See https://go-review.googlesource.com/#/c/35676/.)

On the other hand, the use of Put makes it possible for values to migrate from one pool to another, which increases the maximum number of values in-flight (to an arbitrary number, if you're not careful and have a systematic bias in the scheduler) and decreases the effectiveness of any potential NUMA-aware allocation strategy.

bcmills · 2017-01-25T21:15:11Z

Hmm... I suppose that having Put does allow you to lock the goroutine to the OS thread for the duration of use of the value, which would prevent values from migrating but introduce surprising locking behavior as a side-effect, even in cases which don't strictly require it (e.g. pooled variables accessed using atomic ops).

Perhaps it would be clearer to structure the per-thread pool and the thread-locking behavior as separate APIs:

thread.Locked(func() {
  v := pool.Get()
  doSomethingExclusiveWith(v)
})

(thread.Locked can't be implemented in terms of runtime.LockOSThread and runtime.UnlockOSThread because the caller may have locked the thread already.)

Unfortunately, moving the thread-locking to a separate API makes ForEach racy: the guarantee of exclusiveness means you'd need some other mechanism to lock each thread out of its own value when ForEach iterates over it. So if you want Get to imply exclusiveness, you do still need some kind of Put or Release, and at that point you may as well push the thread-locking behavior into the pool API. And then you've also go the issue of global lock ordering to contend with (if the use of the pooled variable and the ForEach call may both need to acquire other locks or perform channel ops).

It seems simpler to only provide a per-thread API that doesn't guarantee exclusiveness.

hyangah added Thinking repo-main labels Sep 18, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed release-none labels Apr 10, 2015

dsnet mentioned this issue Jan 4, 2017

proposal: non-locking FastRand #18514

Closed

bcmills mentioned this issue Jan 25, 2017

proposal: runtime: expose current thread id or processor id #18590

Closed

aclements mentioned this issue Jan 26, 2017

proposal: sync: support for sharded values #18802

Open

ALTree mentioned this issue Oct 28, 2017

proposal: thread-local or per-cpu storage support #22476

Closed

WhoisZihan mentioned this issue Jul 23, 2019

runtime: P-local/M-local storage for goroutines? #33247

Closed

gabyhelp mentioned this issue Sep 3, 2024

proposal: sync: add PLocalCache #69229

Open

RaduBerinde mentioned this issue Dec 4, 2024

cralloc: add BatchAllocator cockroachdb/crlib#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: per-cpu storage #8281

sync: per-cpu storage #8281

hyangah commented Jun 24, 2014

ianlancetaylor commented Jun 24, 2014

dvyukov commented Jun 24, 2014

hyangah commented Jun 25, 2014

dvyukov commented Jun 25, 2014

hyangah commented Jun 25, 2014

gopherbot commented Jun 28, 2014

hyangah commented Jul 2, 2014

rsc commented Jul 2, 2014

dvyukov commented Jul 2, 2014

rsc commented Jul 2, 2014

Sajmani commented Jul 2, 2014

rsc commented Jul 2, 2014

Sajmani commented Jul 2, 2014

rsc commented Jul 9, 2014

hyangah commented Jul 15, 2014

dvyukov commented Jul 16, 2014

RLH commented Jul 16, 2014

hyangah commented Jul 16, 2014

dvyukov commented Jul 17, 2014

rsc commented Sep 18, 2014

hyangah commented Jul 8, 2015

bcmills commented Jan 25, 2017 •

edited

Loading

bcmills commented Jan 25, 2017 •

edited

Loading

sync: per-cpu storage #8281

sync: per-cpu storage #8281

Comments

hyangah commented Jun 24, 2014

ianlancetaylor commented Jun 24, 2014

dvyukov commented Jun 24, 2014

hyangah commented Jun 25, 2014

dvyukov commented Jun 25, 2014

hyangah commented Jun 25, 2014

gopherbot commented Jun 28, 2014

hyangah commented Jul 2, 2014

rsc commented Jul 2, 2014

dvyukov commented Jul 2, 2014

rsc commented Jul 2, 2014

Sajmani commented Jul 2, 2014

rsc commented Jul 2, 2014

Sajmani commented Jul 2, 2014

rsc commented Jul 9, 2014

hyangah commented Jul 15, 2014

dvyukov commented Jul 16, 2014

RLH commented Jul 16, 2014

hyangah commented Jul 16, 2014

dvyukov commented Jul 17, 2014

rsc commented Sep 18, 2014

hyangah commented Jul 8, 2015

bcmills commented Jan 25, 2017 • edited Loading

bcmills commented Jan 25, 2017 • edited Loading

bcmills commented Jan 25, 2017 •

edited

Loading

bcmills commented Jan 25, 2017 •

edited

Loading