Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: sync: add GetLocal() method to sync.Pool in order to return only P-local objects #65104

Open
valyala opened this issue Jan 15, 2024 · 12 comments

Comments

@valyala
Copy link
Contributor

valyala commented Jan 15, 2024

Proposal Details

Currently sync.Pool.Get() works in the following way:

  1. It returns P-local object if there is one.
  2. Otherwise it steals an object from other P workers if there are no P-local objects.

This logic works great on systems with small number of CPU cores, but it may significantly slow down on systems with many CPU cores because of the following reasons:

  • Stealing an object from other P workers requires reading from the memory, which is missing in local CPU cache. This reading may be multiple orders of magnitude slower than reading from the local CPU cache. Then the CPU must remove the stolen object from foreign P queue. This requires writing to foreign memory, which, in turn, requires potentially very slow inter-CPU cache flush/update for the updated memory location.
  • When the stolen object is returned from sync.Pool.Get(), its contents may be still missing in local CPU cache. So further work with this object may be much slower comparing to the work with P-local object.
  • The probability of a inter-CPU object ping-pong between different P caches at sync.Pool increases with GOMAXPROCS (e.g. the number of CPU cores), since all the P-local caches will have at least a single object only after all the P workers simultaneously execute the same CPU-bound code between Get() and Put() calls. E.g. the problem with sync.Pool inefficiency increases with the number of CPU cores.

It looks like that the way to go is to disallow stealing objects from other P workers. But this may slow down some valid use cases for sync.Pool, when Pool.Get() and Pool.Put() are called for the same pool from different sets of goroutines, which run on different sets of P. Disallowing stealing objects from other P workers will result in excess memory allocations at Pool.Get() call side, since P-local cache will be always empty.

Given these facts, it would be great to add GetLocal() method to sync.Pool, which must return nil if the object isn't found in P-local cache, without an attempt to steal the object from other P caches. Then the GetLocal() method can be used instead of Get() in performance-critical code, which calls Put() for the object returned from the pool on the same goroutine (and P) most of the time.

Update: the current implementation of sync.Pool.Put() already puts the object only to P-local storage, but this isn't documented and can be changed in the future, so it may be worth adding PutLocal() method in order to make the API more consistent.

Summary

  • The proposal doesn't change the behaviour and the performance of the existing code.
  • The proposal allows developers improving sync.Pool scalability and performance on systems with high number of CPU cores, by switching from Pool.Get to Pool.GetLocal at CPU-bound code where the object is returned to the pool at the same goroutine where it has been obtained from the pool.
@gopherbot gopherbot added this to the Proposal milestone Jan 15, 2024
@Jorropo
Copy link
Member

Jorropo commented Jan 15, 2024

To me it looks like a worst version of #18802 because it unnecessarily tie this logic to sync.Pool and it's implementation.

@valyala
Copy link
Contributor Author

valyala commented Jan 15, 2024

To me it looks like a worst version of #18802 because it unnecessarily tie this logic to sync.Pool and it's implementation.

The current proposal is aimed at solving a single practical task - to have an ability to efficiently use sync.Pool on systems with big number of CPU cores, when Pool.Put() is called at the same goroutine as Pool.Get() after performing some CPU-bound task. It doesn't require changing the semantics of sync.Pool. It just adds an additional method - GetLocal() with very clear purpose - to return P-local object, if the pool contains it, otherwise to return nil.

The #18802 proposal, on the other hand, has completely different purpose - to provide access to per-CPU objects. This proposal has no clear vision yet, not saying about the implementation. For example, as I understood from the #18802 proposal, it wants creating some magic sharded object, which will have exactly one shard per P. On the other hand, the sync.Pool allows storing multiple local objects at the same P. It also allows dropping the stored objects at any time, so the user must expect that sync.Pool.Get() or sync.Pool.GetLocal() may return nil.

@Jorropo
Copy link
Member

Jorropo commented Jan 16, 2024

@valyala thx for the explanation in what way it is different.
To me they still seem extremely similar, I could create GetLocal with #18802 by storing a pointer to a slice in the sharded value. (so can store more than one value).

About your proposal I don't think returning nil feel sync.Poolish, imo it should call sync.Pool.New.

@qiulaidongfeng
Copy link
Member

Note that if sync.Pool also supports stealing objects from other P cache, a goroutine Local store or P store may be required, to prevent one goroutine from getting objects in a P-local and another goroutine from stealing objects from that P-local., so that {Get,Put}Local does not require atomic operation synchronization, because this way, the same block of memory will not have unlocked or not have atomic synchronized reads and writes.

@prattmic
Copy link
Member

cc @aclements @mknyszek

@aclements
Copy link
Member

You've identified an interesting issue, but I don't think putting the problem on the user by adding an API is the right answer. I think it would be really unclear to users when to call Get versus GetLocal, and would likely depend on the properties of the system their code is running on.

Instead, I think we could improve sync.Pool's implementation without exposing this complexity to users. For example, Get is always allowed to return a new object, so we could certainly implement a policy where it steals less aggressively (or even never!) from distant caches. I think the main barrier to doing something like that would be that we don't have any sort of CPU/memory affinity right now, and I would worry that doing it strictly based on local/non-local P would interfere too much with existing uses. (Besides, stealing from other CPUs that share an LLC is still pretty cheap compared to stealing from a remote node.)

@valyala
Copy link
Contributor Author

valyala commented Jan 18, 2024

I don't think putting the problem on the user by adding an API is the right answer. I think it would be really unclear to users when to call Get versus GetLocal, and would likely depend on the properties of the system their code is running on.

The sync.Pool isn't a trivial concept. It already requires some additional qualification for the efficient usage. For example:

  • If the code forgets returning the object to the pool, then the sync.Pool usage degrades to the code, which just allocates new object instead of calling sync.Pool.Get.
  • If the code stores an object by value in the pool, then this usually mean an additional allocation and copy at sync.Pool.Put() call, which makes questionable the pool usage.
  • If the code obtains many objects from the pool in one goroutine and returns them to the pool in another goroutine after performing some IO and passing objects between goroutines via channels, then it is likely the efficiency of sync.Pool usage is close to zero comparing to the usual work of garbage collector, e.g. to just allocate objects in one goroutine and do not bother with manual re-cycling of these objects. Moreover, the code with sync.Pool can be slower than the code with the regular object allocation, when running on systems with many CPU cores, because of inter-CPU memory access issues described in the original post above.
  • If the code continues using the object returned to the pool, then this usually results to various data races and random crashes.
  • If the re-cycled object is improperly cleared before the next use, then this may result in various interesting bugs.

So I don't think that adding sync.Pool.GetLocal and sync.Pool.PutLocal will complicate developer's life, who is already aware of potential issues of the sync.Pool mentioned above. On the other hand, it will allow optimizing the cases when the object returned to the pool on the same goroutine where it has been obtained from the pool after performing some CPU-bound work like in the code below:

var p sync.Pool

func f() {
    v := p.Get()
    doSomeCPUBoundWork(v)
    p.Put(v)
}

Then this case can be optimized to:

var p sync.Pool

func f() {
  v := p.GetLocal()
  doSomeCPUBoundWork(v)
  p.PutLocal(v)
}

The code with GetLocal / PutLocal is clear enough to understand what actually it does if you are familiar with the sync.Pool concept.

@Jorropo
Copy link
Member

Jorropo commented Jan 18, 2024

Is there any situation where someone would mix GetLocal and Get calls to the same pool ?

@valyala
Copy link
Contributor Author

valyala commented Jan 18, 2024

Is there any situation where someone would mix GetLocal and Get calls to the same pool ?

In theory such situations can occur, while they are pretty rare. For example, if the pool is simultaneously used in CPU-bound and IO-bound cases:

var p sync.Pool

func f1() {
  v := p.GetLocal()
  doSomeCPUBoundWork(v)
  p.PutLocal(v)
}

func f2() {
  v := p.Get()
  doSomeIOBoundWork(v)
  p.Put(v)
}

The Get() and Put() can still be used in worker concept, where the object is obtained from the pool in one gorotuine and is returned to the pool in another goroutine:

func f() {
  v := p.Get()
  workerCh <- v
}

func worker() {
  for v := range workerCh {
    doSomeWork(v)
    p.Put(v)
  }
}

var workerCh = func() chan any {
  ch := make(chan any)
  go worker()
  return ch
}()

@valyala
Copy link
Contributor Author

valyala commented Jan 18, 2024

After closer look at the current sync.Pool implementation, it looks like the IO-bound case also improves when using Pool.GetLocal instead of Pool.Get:

func f() {
  v := p.GetLocal()
  doSomeIOBoundWork(v)
  p.PutLocal(v)
}

var p sync.Pool

If this code is executed concurrently by all the P workers, then it is faster to get P-local object instead of stealing an object from other P caches in the pool - there are high chances that P-local cache already contains an object put there via PutLocal() call. The only valid remaining case for Get() is the worker pattern described above. But even this pattern can be switched to GetLocal() if GOMAXPROCS concurrent workers read from workerCh - this increases chances that an object already exists in the P-local cache.

So, probably, it is good idea to disable stealing objects from other P caches at sync.Pool by default, and, instead, add an ability to explicitly enable object stealing for some rare cases, by setting AllowStealing option at sync.Pool:

var p = &sync.Pool{
  // Explicitly allow stealing an object from other P caches
  AllowStealing: true,
}

func f() {
  v := p.Get()
  workCh <- v
}

var workCh = func() chan any {
  ch := make(chan any)
  go func() {
    for v := range ch {
      doSomeWork(v)
      p.Put(v)
    }
  }()
  return ch
}()

If the AllowStealing option isn't set for this case, then it will degrade to new object allocation at f(). Probably, this is not so bad, since it may still work faster if Go runtime prefers allocating objects from P-local memory. Then the initial access to the allocated object will be faster than the initial access to the object stolen from other P caches in sync.Pool.

@valyala
Copy link
Contributor Author

valyala commented Jan 18, 2024

It looks like the following idea may automatically cover the worker pattern case without the need to introduce AllowStealing option:

  1. The sync.Pool must contain per-P counter of local cache misses.

Then Get() can be implemented in the following way:

  1. If an object exists in P-local cache, then it is immediately returned and the per-P cache miss counter is reset if it doesn't exceed some threshold N.
  2. If an object isn't found in the local cache, then per-P counter of cache misses is incremented.
  3. If the per-P counter doesn't exceed some threshold N, then nil is returned.
  4. Otherwise an attempt to steal an object from other P caches is performed. The per-P cache miss counter isn't reset even if an object is successfully stolen from other P cache.

This algorithm prefers returning P-local objects and returning nil up until N repeated cache misses, in the hope that the user code returns an object to P-local cache via sync.Pool.Put() soon. If N repeated cache misses are reached, then we can assume that pool is used in some kind of worker pattern. So fall back to object stealing for all the subsequent calls to Get().

The question is which N threshold should be chosen:

  • 1 is a good value for CPU-bound code, which tends to run a single goroutine per P, which obtains an object from the pool, works with it and then returns the object to the pool.
  • The 1 threshold may be not enough for IO-bound code where multiple goroutines can obtain an object from the pool at the same P and then perform some IO work before returning the object to the pool. Too small threshold may result into switching to object stealing for goroutines, which perform some IO work.
  • Too big threshold may postpone the switch to object stealing at worker pattern. This will result in excess object allocations before the switch. I don't think this is a big deal. So, how about setting N to some pretty high arbitrary chosen value such as 42? :)

@someview
Copy link

someview commented Feb 6, 2024

Good idea. We have met some similiar problem. When put rate is slower than get rate, the sync.pool would hold much memory until next gc. This method v := p.GetLocal() give us more opportunity to control memory usage for pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Incoming
Development

No branches or pull requests

8 participants