Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Garbage Collection pressure #1010

Closed
2opremio opened this issue Feb 23, 2016 · 10 comments
Closed

Reduce Garbage Collection pressure #1010

2opremio opened this issue Feb 23, 2016 · 10 comments
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality performance Excessive resource usage and latency; usually a bug or chore
Milestone

Comments

@2opremio
Copy link
Contributor

We can see from #812 (specifically #812 (comment)) and more importantly from #854 (specifically #854 (comment) ) that the current main performance bottleneck is the Garbage collector pressure:

Note how runtime.scanobject accounts to almost ~50% of the CPU time (scanobject+mallocgc) and how ps.Cons+ps.Set generate almost 30% of the garbage.

This was referenced Feb 23, 2016
@2opremio 2opremio added the performance Excessive resource usage and latency; usually a bug or chore label Feb 25, 2016
@2opremio 2opremio added this to the 0.14.0 milestone Mar 4, 2016
@tomwilkie tomwilkie removed this from the 0.14.0 milestone Mar 17, 2016
@2opremio 2opremio modified the milestone: July2016 Jun 30, 2016
@rade rade added the chore Related to fix/refinement/improvement of end user or new/existing developer functionality label Jul 4, 2016
@rade
Copy link
Member

rade commented Jul 19, 2016

I think I have identified a potentially quite significant improvement here...

While LatestMap.Merge is commutative, the order nevertheless matters - to performance. When newer entries are in the target map rather than the argument map we avoid allocation altogether.

AFAICT, both collector and awsCollector invoke the merger with a list of reports that is largely in oldest to youngest order. SmartMerger does some hashing, as a result of which reports will be merged in random order.

Ideally SmartMerger would first merge reports from the same probe in reverse timestamp order, and only then merge them together with its tree-merge.

Another (tiny) optimisation: SmartMerger has this line:

report.MakeReport().Merge(left.rpt).Merge(right.rpt),

but note that report merging already copies the lhs, so the above really should just be

left.rpt.Merge(right.rpt)

(This won't save much since presumably copying an empty report is cheap).

Yet another, more substantial optimisation: merging should be n-ary, not binary, to avoid creation and copying of intermediate data structures.

@rade
Copy link
Member

rade commented Jul 19, 2016

re merging reports in newest-to-oldest order... Note that it's actually the probe-side timestamps that matter here, since that is what LatestMap.Merge is comparing. Alas in the awsCollector reports are keyed on the app timestamp. The potential is there for successive reports from a single probe to hit different app instances and receive timestamps that effectively swap their order. Hopefully that is rare.

@rade
Copy link
Member

rade commented Jul 19, 2016

Yet another, more substantial optimisation: merging should be n-ary, not binary, to avoid creation and copying of intermediate data structures.

Doing that will make SmartMerger superfluous. Which in turn means we can do a dumb merge in reverse report timestamp order. i.e. start with a blank report, merge in the youngest report (this will end up copying that report), then the 2nd youngest (this will end up copying new entries and overwrite existing entries where we see a newer timestamp, which should be rare), etc.

In the optimal case this will only allocate memory equivalent to final merged report size. And we will usually get close to that optimum.

As an additional optimisation, note that the blank report we start with and modify does not have to be a persistent map. Including all its components. Though perhaps it needs "freezing" i.e. converting to a ps map at the end... depends on what happens to the result.

@2opremio
Copy link
Contributor Author

Another thing which could potentially make a difference: Field allocations on Node initialization.

All topologies use the same generic type of node, but their field usage is different. However, some fields (e.g. Metrics, Controls) are initialized with non-nil values, which will cause allocations even if they're not used.

It might be worth initializing those fields to nil (or migrating the Metrics and StringSets to immutable data structures, whose nil value is global).

@2opremio
Copy link
Contributor Author

2opremio commented Jul 25, 2016

Here are some notes I've been collecting over the weekend. The Scope App flamegraphs/profiles I've obtained show thatSet() operation of the immutable maps we use generates a lot of garbage. It allocates a new node for each element in the tree-path to the target key. In the best case scenario it will make one allocation (for the root node). Here some potential improvements:

  • Try other immutable maps (radix tree, HAMT). They may generate less garbage (although I doubt it). Here are some examples https://github.com/hashicorp/go-immutable-radix https://github.com/mediocregopher/seq . See how it's done in clojure: http://clojure.org/reference/transients http://nyeggen.com/post/2013-08-07-persistent/ http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice.html https://groups.google.com/forum/m/#!topic/clojure/4kUQ9JmljyA
  • We should verify trees are being properly balanced (could the hashing operation used be leading to unbalanced trees causing more allocations due to longer paths?)
  • To construct maps, we incrementally call set Set() for all the map elements. This causes a lot of garbage. Instead, we can use "fromarray" constructor in which all initial keys and values are provided at once:
    • It lets you use mutable operations during creation since you know you have the only reference to the map.
    • It lets you allocate all tree entries at once in an array. However, this can will result in undesired (and maybe even intolerable?) leaks since individual tree entries cannot be deallocated individually anymore by the GC. For instance, when a map (B) is created (e.g. using Set) from a given map A, it will still reference the original array even if a whole subtree is not needed anymore. However, when A goes out of scope, those subtree entries cannot be deallocated by the GC since they are in the same array. This may be a showstopper due to long-term in-process caches (rendering cache,
    • It can let you run the operation in O(n) instead of O(n*log n) (when creating the map through multiple external Set() operations). Depending on the persistent map implementation, this may have some requirements, like providing an array sorted by the keys (that's not the case in the hash tree we are using since it's not sorted)
  • In Set(), we could allocate all new tree entries at once in an array to reduce garbage. This may be difficult/inefficient (since you probably don't know the amount of new entries needed in advance). It also suffers from memory leak problems (like the "fromarray" constructor) since all arrays allocated in Set() may end up chained (referencing each other), not allowing the GC to deallocate
  • Introduce a a new Set(keyValue1, keyValue2 …. ) operation to allocate all new tree entries at once. It also has memory leak problems, see above. (This could be particularly beneficial in the case of persistent lists)
  • To avoid the leak problems above, maybe there’s a way to efficiently allocate multiple individual objects (from the GC’s perspective) at once? It will reduce the time spent allocating memory byt won’t reduce GC pressure though.
  • Use a key+value array instead of a tree when/if the number of keys is small. In that case the O(log n) tree lookups might be not be worth it and I think we would safe garbage. We could copy the whole array in every call to Set() which would require a single allocation.
  • Another option is to keep the keys sorted. (and keep lookups in log(n) but updates in linear time)
  • Use native go-maps with size-limited byte arrays (instead of strings) to implement immutable maps. When using sizes below 128-byte keys/values are inlined and don’t incur in extra memory allocations, see https://github.com/golang/go/blob/master/src/runtime/hashmap.go#L70-L75
  • Forget about immutable data structures since Go's GC is not optimized for them (see notes below on Haskell's collector and how it exploits properties of immutable references to deal with huge amounts of intermediate garbage). As understand it, our use-case is about simplifying reasoning about reports and not so much about about performance (sharing them without locks across goroutines). We will have to live with mutable types if we don't find a solution for the performance problems caused by immutable data structures. I recall @tomwilkie mentioned multiple times that immutable data structures were providing performance gains but I think that was only because we were initially copying mutable data structures without paying attention to allocations (i.e. not using make) to implement immutable operations. It would be worth revisiting where those performance improvements were coming from.
  • As a middle ground solution we can have mutable variants of merge operations, which should only be used when we own the the only reference to the destination report. As brought up by @paulbellamy , we could implement something like Clojure’s transients http://clojure.org/reference/transients . To prevent the mutable trees from escaping by-error, we could try to imitate Haskell’s runST and only let you run mutable operations in a specific function, something like func fromNewUnsafeMap(func(unsafeMutableMap a)) safeImmutableMap which implicitly freezes the mutable value at the end. Unlike in Haskell, nothing prevents you from explicitly copying the mutable map, but at least it simplifies ensuring that we only apply mutable operations to trees we create from scratch.

Tongue-in-cheek:

  • Use a language with a generational GC since immutable data structures comply with the high infant mortality hypothesis.
  • Use a GC which lets you exploit the properties of immutable data structures (references can only be linked to values previously allocated see https://wiki.haskell.org/GHC/Memory_Management and https://blogs.janestreet.com/building-a-lower-latency-gc/ ). Would be cool to implement a separate allocator/GC which exploited this property. We would need help from Go's GC to track deallocation of escaping roots, with something like finalizers.
  • Use a language with a reference counting GC, which can let us resort to mutable operations if the reference count of the map is 1.

rade added a commit that referenced this issue Jul 28, 2016
Merge() is always returning a copy, so there is no need to Copy()
struct fields first before calling Merge() on them.

This reduces GC pressure (#1010) and helps overall app performance
(#1457).
@2opremio
Copy link
Contributor Author

2opremio commented Jul 29, 2016

Here's the object allocation profile and flamegraph of the Scope standalone app while monitoring Weave Cloud after the recent optimizations (namely #1728 #1720 #1709 )

pprof.localhost:4040.alloc_objects.alloc_space.001.pb.gz

app_allocs

I think I am going to implement mutable variants of the merge functions next

@2opremio
Copy link
Contributor Author

2opremio commented Jul 30, 2016

After seeing the positive results in #1732 and the lack of results in #1724 we have learned that:

  • Copying is remarkably cheaper than object allocation in Go (due to the characteristics of Go's GC which is optimized for latency, not-throughput, and is not-generational)
  • String allocations are dominating the garbage being generated.

Thus, it would be worth considering using key-value slices with fixed-sized strings as maps (e.g. LatestMaps). Something in the lines of:

const maxSize = 128

type fixedSizedString {
  content [maxSize]byte
  used    int
}

type kv struct {
  k, v fixedSizedString
}

type fixedMap []kv

To allow for unbounded-strings we could add an optional additional slice which we allocate if the string doesn't fit in maxSize (which should rarely happen if at all).

Also, for quicker access, the keys could be ordered (allowing for O(log(n)) binary search) or we could keep a hash to speed up O(n) comparison.

const maxSize = 128

type fixedSizedString {
  content [maxSize]byte
  used    int
  extra   []byte
}

type kv struct {
  k, v fixedSizedString
  kHash int64
}

type fixedMap []kv

@rade rade modified the milestones: July2016, August2016 Aug 4, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Aug 9, 2016

Here's another idea to reduce garbage collection pressure. Cache the keys in the LatestMap like @tomwilkie did here:

2f760f2

@rade rade modified the milestones: 0.18/1.0, October2016 Sep 15, 2016
@rade rade modified the milestones: October2016, 0.18/1.0 Sep 15, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Dec 5, 2016

Another possibility is to intern string keys in general

@bboreham
Copy link
Collaborator

Most of this has been done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Related to fix/refinement/improvement of end user or new/existing developer functionality performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
Development

No branches or pull requests

4 participants