Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: per-cpu storage #8281

Open
hyangah opened this issue Jun 24, 2014 · 23 comments
Open

sync: per-cpu storage #8281

hyangah opened this issue Jun 24, 2014 · 23 comments
Labels
Milestone

Comments

@hyangah
Copy link
Contributor

hyangah commented Jun 24, 2014

Provide a way to create/access cpu local objects. This may make it easy to  batch data
per-CPU to reduce global locking. This is a feature request.

Potential use case:

scalable stats across many goroutines - stat updates are batched in the per-CPU stat
object. Periodically or on-demand, the stat batches are aggregated to a global stat
object. 

This is similar to sync.Pool in that it will help reducing lock contention, but
different in that the object lifetime should be explicit and the per-CPU objects need to
be accessible from other CPUs if necessary.
@ianlancetaylor
Copy link
Contributor

Comment 1:

Labels changed: added repo-main, release-go1.4.

@dvyukov
Copy link
Member

dvyukov commented Jun 24, 2014

Comment 2:

The stats here are not just counters, but rather complex objects indexed by dynamically
constructed strings and aggregating data into histograms, etc. Right?
So simple cpu-cpu counters won't do.
The API must be something like:
func acquire() T
func release(T)
func foreach(func(T))

@hyangah
Copy link
Contributor Author

hyangah commented Jun 25, 2014

Comment 3:

Yes. I want more complex objects than a simple integer counter, with custom aggregation
logic. 
I assume acquire fetches an instance of T from the cpu local(?); release returns the
instance to the current cpu local. Since a goroutine can be moved to another cpu,
release may return the instance to a different cpu local. If my guess is correct, there
could be multiple T instances corresponding to a single cpu at a certain point.
Do you expect foreach can work as flush mechanism of Ts as well?

@dvyukov
Copy link
Member

dvyukov commented Jun 25, 2014

Comment 4:

Yes, there can more instances than CPUs (we can't pin arbitrary user code to CPU). I
would do something like at most 2*NCPU objects.
I don't like the idea of foreach being a flush. First it's not obvious. Second foreach
may not be called at all. Does it work for you if we just have at most 2*NCPU objects?

@hyangah
Copy link
Contributor Author

hyangah commented Jun 25, 2014

Comment 5:

SGTM.

@gopherbot
Copy link
Contributor

Comment 6 by ajmani:

Would this approach work for log buffers?  I'd like to allow request handlers to log to
buffers that are flushed asynchronously.

@hyangah
Copy link
Contributor Author

hyangah commented Jul 2, 2014

Comment 7:

I guess T can represent a buffer. foreach caller can choose to empty the buffer. It may
be hard to guarantee the log entries are flushed in the logging time order.

@rsc
Copy link
Contributor

rsc commented Jul 2, 2014

Comment 8:

I can't see any connection between async log flushing and per-CPU storage.
Separately, I think that per-cpu storage is probably too low-level a primitive. It's
defining an implementation instead of a high-level goal. 
For example, the high-level goal of sync.Pool is to reclaim memory kept in caches. It
happens that we built a very fast sync.Pool by using per-CPU storage underneath, but the
per-CPU storage is not the point.
I suspect that the real thing you need is a better high-level target than per-CPU
storage.

Status changed to Thinking.

@dvyukov
Copy link
Member

dvyukov commented Jul 2, 2014

Comment 9:

The high-level idea here is a non-leaking pool of objects, with the interface along the
lines of:
type NonLeakingPool ...
func (*NonLeakingPool) Get() interface{}
func (*NonLeakingPool) Put(interface{})
func (*NonLeakingPool) ForEach(func(interface{}))

@rsc
Copy link
Contributor

rsc commented Jul 2, 2014

Comment 10:

That doesn't look like a high-level idea to me.

@Sajmani
Copy link
Contributor

Sajmani commented Jul 2, 2014

Comment 11:

For logging, I'd like to have an interface that lets multiple goroutines write lines of
text simultaneously while another goroutines consumes those lines, e.g., to flush them
off to disk or the network.  An implementation might provide per-CPU buffers to reduce
contention.
Hana's stats counters would be a different interface that might also leverage per-CPU
counters plus some aggregation.
In both cases, you have many writers and some aggregation in the consumer.

@rsc
Copy link
Contributor

rsc commented Jul 2, 2014

Comment 12:

Re #11 I really don't think you do. Logging needs backpressure or else you will fill all
your memory with logs when the writer is slow.

@Sajmani
Copy link
Contributor

Sajmani commented Jul 2, 2014

Comment 13:

I didn't say the buffer was unbounded.  I agree re: backpressure.

@rsc
Copy link
Contributor

rsc commented Jul 9, 2014

Comment 14:

per-cpu logging will also reorder log messages by a single goroutine. it's not a good
idea.

@hyangah
Copy link
Contributor Author

hyangah commented Jul 15, 2014

Comment 15:

Agree that this is not a good fit for logging - need api to discourage such
use. I expect the buffered channel perf improvement work to help for this
case.
What about the stat collection application? We've tried a channel based
implementation (let many goroutines send stat update through channel and
have a receiver goroutine that updates the stats.)  and a mutex-based
implementation (let many goroutines fight for a single mutex to update the
stats). None was satisfactory.

@dvyukov
Copy link
Member

dvyukov commented Jul 16, 2014

Comment 16:

"buffered channel perf improvement work" is deferred, see:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/FtV82Tr3S00J
Your use case can potentially provide the necessary justification. You can compare
performance of chan-based solution with and without the change.

@RLH
Copy link
Contributor

RLH commented Jul 16, 2014

Comment 17:

Would returning the CPU id (index) be sufficient? The id could be mapped to whatever
data one wants and the interfaces envisioned here could be implemented.

@hyangah
Copy link
Contributor Author

hyangah commented Jul 16, 2014

Comment 18:

Good point. 
I will need to protect the data myself from concurrent accesses (in case the goroutine
is migrated to other cpu in the middle, or the other reader goroutine iterates), but
that's not a big deal. Is it possible for runtime to return cpu index?

@dvyukov
Copy link
Member

dvyukov commented Jul 17, 2014

Comment 19:

From implementation point of view, it's trivial (I mean "virtual processor"
0..GOMAXRPOCS-1).

@rsc
Copy link
Contributor

rsc commented Sep 18, 2014

Comment 20:

Labels changed: added release-none, removed release-go1.4.

@hyangah
Copy link
Contributor Author

hyangah commented Jul 8, 2015

after listening to Prometheus talk at gophercon, i remembered this issue (exactly same motivation) and I wonder if per-cpu storage can help reducing lock contention.

@bcmills
Copy link
Contributor

bcmills commented Jan 25, 2017

func (*NonLeakingPool) Get() interface{}
func (*NonLeakingPool) Put(interface{})
func (*NonLeakingPool) ForEach(func(interface{}))

Get and ForEach suffice; you don't need Put unless you want threads to be able to deallocate their per-thread resources without using finalizers, and I think that's a pretty rare use-case.

(It turns out that if you're willing to abuse finalizers, you can mostly implement this API today in terms of sync.Pool. See https://go-review.googlesource.com/#/c/35676/.)

On the other hand, the use of Put makes it possible for values to migrate from one pool to another, which increases the maximum number of values in-flight (to an arbitrary number, if you're not careful and have a systematic bias in the scheduler) and decreases the effectiveness of any potential NUMA-aware allocation strategy.

@bcmills
Copy link
Contributor

bcmills commented Jan 25, 2017

Hmm... I suppose that having Put does allow you to lock the goroutine to the OS thread for the duration of use of the value, which would prevent values from migrating but introduce surprising locking behavior as a side-effect, even in cases which don't strictly require it (e.g. pooled variables accessed using atomic ops).

Perhaps it would be clearer to structure the per-thread pool and the thread-locking behavior as separate APIs:

thread.Locked(func() {
  v := pool.Get()
  doSomethingExclusiveWith(v)
})

(thread.Locked can't be implemented in terms of runtime.LockOSThread and runtime.UnlockOSThread because the caller may have locked the thread already.)

Unfortunately, moving the thread-locking to a separate API makes ForEach racy: the guarantee of exclusiveness means you'd need some other mechanism to lock each thread out of its own value when ForEach iterates over it. So if you want Get to imply exclusiveness, you do still need some kind of Put or Release, and at that point you may as well push the thread-locking behavior into the pool API. And then you've also go the issue of global lock ordering to contend with (if the use of the pooled variable and the ForEach call may both need to acquire other locks or perform channel ops).

It seems simpler to only provide a per-thread API that doesn't guarantee exclusiveness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants