Reduce Garbage Collection pressure #1010

2opremio · 2016-02-23T13:03:13Z

We can see from #812 (specifically #812 (comment)) and more importantly from #854 (specifically #854 (comment) ) that the current main performance bottleneck is the Garbage collector pressure:

Note how runtime.scanobject accounts to almost ~50% of the CPU time (scanobject+mallocgc) and how ps.Cons+ps.Set generate almost 30% of the garbage.

rade · 2016-07-19T07:38:31Z

I think I have identified a potentially quite significant improvement here...

While LatestMap.Merge is commutative, the order nevertheless matters - to performance. When newer entries are in the target map rather than the argument map we avoid allocation altogether.

AFAICT, both collector and awsCollector invoke the merger with a list of reports that is largely in oldest to youngest order. SmartMerger does some hashing, as a result of which reports will be merged in random order.

Ideally SmartMerger would first merge reports from the same probe in reverse timestamp order, and only then merge them together with its tree-merge.

Another (tiny) optimisation: SmartMerger has this line:

report.MakeReport().Merge(left.rpt).Merge(right.rpt),

but note that report merging already copies the lhs, so the above really should just be

left.rpt.Merge(right.rpt)

(This won't save much since presumably copying an empty report is cheap).

Yet another, more substantial optimisation: merging should be n-ary, not binary, to avoid creation and copying of intermediate data structures.

rade · 2016-07-19T07:52:27Z

re merging reports in newest-to-oldest order... Note that it's actually the probe-side timestamps that matter here, since that is what LatestMap.Merge is comparing. Alas in the awsCollector reports are keyed on the app timestamp. The potential is there for successive reports from a single probe to hit different app instances and receive timestamps that effectively swap their order. Hopefully that is rare.

rade · 2016-07-19T08:09:22Z

Yet another, more substantial optimisation: merging should be n-ary, not binary, to avoid creation and copying of intermediate data structures.

Doing that will make SmartMerger superfluous. Which in turn means we can do a dumb merge in reverse report timestamp order. i.e. start with a blank report, merge in the youngest report (this will end up copying that report), then the 2nd youngest (this will end up copying new entries and overwrite existing entries where we see a newer timestamp, which should be rare), etc.

In the optimal case this will only allocate memory equivalent to final merged report size. And we will usually get close to that optimum.

As an additional optimisation, note that the blank report we start with and modify does not have to be a persistent map. Including all its components. Though perhaps it needs "freezing" i.e. converting to a ps map at the end... depends on what happens to the result.

2opremio · 2016-07-19T17:23:13Z

Another thing which could potentially make a difference: Field allocations on Node initialization.

All topologies use the same generic type of node, but their field usage is different. However, some fields (e.g. Metrics, Controls) are initialized with non-nil values, which will cause allocations even if they're not used.

It might be worth initializing those fields to nil (or migrating the Metrics and StringSets to immutable data structures, whose nil value is global).

2opremio · 2016-07-25T08:53:22Z

Here are some notes I've been collecting over the weekend. The Scope App flamegraphs/profiles I've obtained show thatSet() operation of the immutable maps we use generates a lot of garbage. It allocates a new node for each element in the tree-path to the target key. In the best case scenario it will make one allocation (for the root node). Here some potential improvements:

Try other immutable maps (radix tree, HAMT). They may generate less garbage (although I doubt it). Here are some examples https://github.com/hashicorp/go-immutable-radix https://github.com/mediocregopher/seq . See how it's done in clojure: http://clojure.org/reference/transients http://nyeggen.com/post/2013-08-07-persistent/ http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice.html https://groups.google.com/forum/m/#!topic/clojure/4kUQ9JmljyA
We should verify trees are being properly balanced (could the hashing operation used be leading to unbalanced trees causing more allocations due to longer paths?)
To construct maps, we incrementally call set Set() for all the map elements. This causes a lot of garbage. Instead, we can use "fromarray" constructor in which all initial keys and values are provided at once:
- It lets you use mutable operations during creation since you know you have the only reference to the map.
- It lets you allocate all tree entries at once in an array. However, this can will result in undesired (and maybe even intolerable?) leaks since individual tree entries cannot be deallocated individually anymore by the GC. For instance, when a map (B) is created (e.g. using Set) from a given map A, it will still reference the original array even if a whole subtree is not needed anymore. However, when A goes out of scope, those subtree entries cannot be deallocated by the GC since they are in the same array. This may be a showstopper due to long-term in-process caches (rendering cache,
- It can let you run the operation in O(n) instead of O(n*log n) (when creating the map through multiple external Set() operations). Depending on the persistent map implementation, this may have some requirements, like providing an array sorted by the keys (that's not the case in the hash tree we are using since it's not sorted)
In Set(), we could allocate all new tree entries at once in an array to reduce garbage. This may be difficult/inefficient (since you probably don't know the amount of new entries needed in advance). It also suffers from memory leak problems (like the "fromarray" constructor) since all arrays allocated in Set() may end up chained (referencing each other), not allowing the GC to deallocate
Introduce a a new Set(keyValue1, keyValue2 …. ) operation to allocate all new tree entries at once. It also has memory leak problems, see above. (This could be particularly beneficial in the case of persistent lists)
To avoid the leak problems above, maybe there’s a way to efficiently allocate multiple individual objects (from the GC’s perspective) at once? It will reduce the time spent allocating memory byt won’t reduce GC pressure though.
Use a key+value array instead of a tree when/if the number of keys is small. In that case the O(log n) tree lookups might be not be worth it and I think we would safe garbage. We could copy the whole array in every call to Set() which would require a single allocation.
Another option is to keep the keys sorted. (and keep lookups in log(n) but updates in linear time)
Use native go-maps with size-limited byte arrays (instead of strings) to implement immutable maps. When using sizes below 128-byte keys/values are inlined and don’t incur in extra memory allocations, see https://github.com/golang/go/blob/master/src/runtime/hashmap.go#L70-L75
Forget about immutable data structures since Go's GC is not optimized for them (see notes below on Haskell's collector and how it exploits properties of immutable references to deal with huge amounts of intermediate garbage). As understand it, our use-case is about simplifying reasoning about reports and not so much about about performance (sharing them without locks across goroutines). We will have to live with mutable types if we don't find a solution for the performance problems caused by immutable data structures. I recall @tomwilkie mentioned multiple times that immutable data structures were providing performance gains but I think that was only because we were initially copying mutable data structures without paying attention to allocations (i.e. not using make) to implement immutable operations. It would be worth revisiting where those performance improvements were coming from.
As a middle ground solution we can have mutable variants of merge operations, which should only be used when we own the the only reference to the destination report. As brought up by @paulbellamy , we could implement something like Clojure’s transients http://clojure.org/reference/transients . To prevent the mutable trees from escaping by-error, we could try to imitate Haskell’s runST and only let you run mutable operations in a specific function, something like func fromNewUnsafeMap(func(unsafeMutableMap a)) safeImmutableMap which implicitly freezes the mutable value at the end. Unlike in Haskell, nothing prevents you from explicitly copying the mutable map, but at least it simplifies ensuring that we only apply mutable operations to trees we create from scratch.

Tongue-in-cheek:

Use a language with a generational GC since immutable data structures comply with the high infant mortality hypothesis.
Use a GC which lets you exploit the properties of immutable data structures (references can only be linked to values previously allocated see https://wiki.haskell.org/GHC/Memory_Management and https://blogs.janestreet.com/building-a-lower-latency-gc/ ). Would be cool to implement a separate allocator/GC which exploited this property. We would need help from Go's GC to track deallocation of escaping roots, with something like finalizers.
Use a language with a reference counting GC, which can let us resort to mutable operations if the reference count of the map is 1.

Merge() is always returning a copy, so there is no need to Copy() struct fields first before calling Merge() on them. This reduces GC pressure (#1010) and helps overall app performance (#1457).

2opremio · 2016-07-29T15:31:13Z

Here's the object allocation profile and flamegraph of the Scope standalone app while monitoring Weave Cloud after the recent optimizations (namely #1728 #1720 #1709 )

pprof.localhost:4040.alloc_objects.alloc_space.001.pb.gz

I think I am going to implement mutable variants of the merge functions next

2opremio · 2016-07-30T12:02:51Z

After seeing the positive results in #1732 and the lack of results in #1724 we have learned that:

Copying is remarkably cheaper than object allocation in Go (due to the characteristics of Go's GC which is optimized for latency, not-throughput, and is not-generational)
String allocations are dominating the garbage being generated.

Thus, it would be worth considering using key-value slices with fixed-sized strings as maps (e.g. LatestMaps). Something in the lines of:

const maxSize = 128

type fixedSizedString {
  content [maxSize]byte
  used    int
}

type kv struct {
  k, v fixedSizedString
}

type fixedMap []kv

To allow for unbounded-strings we could add an optional additional slice which we allocate if the string doesn't fit in maxSize (which should rarely happen if at all).

Also, for quicker access, the keys could be ordered (allowing for O(log(n)) binary search) or we could keep a hash to speed up O(n) comparison.

const maxSize = 128

type fixedSizedString {
  content [maxSize]byte
  used    int
  extra   []byte
}

type kv struct {
  k, v fixedSizedString
  kHash int64
}

type fixedMap []kv

2opremio · 2016-08-09T13:41:55Z

Here's another idea to reduce garbage collection pressure. Cache the keys in the LatestMap like @tomwilkie did here:

2f760f2

2opremio · 2016-12-05T13:42:40Z

Another possibility is to intern string keys in general

bboreham · 2019-09-26T14:32:26Z

Most of this has been done now.

This was referenced Feb 23, 2016

App using lots of CPU #854

Closed

Probe using 70% CPU #812

Closed

2opremio added the performance Excessive resource usage and latency; usually a bug or chore label Feb 25, 2016

2opremio added this to the 0.14.0 milestone Mar 4, 2016

tomwilkie removed this from the 0.14.0 milestone Mar 17, 2016

2opremio mentioned this issue Apr 25, 2016

Nodes appear and disappear intermittently #897

Closed

2opremio modified the milestone: July2016 Jun 30, 2016

rade added the chore Related to fix/refinement/improvement of end user or new/existing developer functionality label Jul 4, 2016

2opremio mentioned this issue Jul 11, 2016

Use a slice instead of a persistent list for temporary accumulation of lists #1660

Merged

rade mentioned this issue Jul 21, 2016

Custom encoder for latest maps #1709

Merged

2opremio mentioned this issue Jul 26, 2016

Improve performance of immutable maps #1720

Merged

rade added a commit that referenced this issue Jul 28, 2016

don't Copy() self on Merge()

c4fa721

Merge() is always returning a copy, so there is no need to Copy() struct fields first before calling Merge() on them. This reduces GC pressure (#1010) and helps overall app performance (#1457).

rade mentioned this issue Jul 28, 2016

don't Copy() self on Merge() #1728

Merged

2opremio mentioned this issue Jul 30, 2016

Use slices instead of linked lists for Metric #1732

Merged

1 task

2opremio mentioned this issue Aug 4, 2016

[WIP] Implement LatestMap as a slice of slab strings #1758

Closed

1 task

rade modified the milestones: July2016, August2016 Aug 4, 2016

rade modified the milestones: 0.18/1.0, October2016 Sep 15, 2016

rade modified the milestones: October2016, 0.18/1.0 Sep 15, 2016

This was referenced Sep 25, 2017

Unnecessary heap allocation via codec #2864

Open

"Intern" map keys #2865

Merged

Our use of ps.Map implies many heap allocations #2866

Open

bboreham closed this as completed Sep 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Garbage Collection pressure #1010

Reduce Garbage Collection pressure #1010

2opremio commented Feb 23, 2016

rade commented Jul 19, 2016 •

edited

Loading

rade commented Jul 19, 2016 •

edited

Loading

rade commented Jul 19, 2016 •

edited

Loading

2opremio commented Jul 19, 2016

2opremio commented Jul 25, 2016 •

edited

Loading

2opremio commented Jul 29, 2016 •

edited

Loading

2opremio commented Jul 30, 2016 •

edited

Loading

2opremio commented Aug 9, 2016

2opremio commented Dec 5, 2016

bboreham commented Sep 26, 2019

Reduce Garbage Collection pressure #1010

Reduce Garbage Collection pressure #1010

Comments

2opremio commented Feb 23, 2016

rade commented Jul 19, 2016 • edited Loading

rade commented Jul 19, 2016 • edited Loading

rade commented Jul 19, 2016 • edited Loading

2opremio commented Jul 19, 2016

2opremio commented Jul 25, 2016 • edited Loading

2opremio commented Jul 29, 2016 • edited Loading

2opremio commented Jul 30, 2016 • edited Loading

2opremio commented Aug 9, 2016

2opremio commented Dec 5, 2016

bboreham commented Sep 26, 2019

rade commented Jul 19, 2016 •

edited

Loading

rade commented Jul 19, 2016 •

edited

Loading

rade commented Jul 19, 2016 •

edited

Loading

2opremio commented Jul 25, 2016 •

edited

Loading

2opremio commented Jul 29, 2016 •

edited

Loading

2opremio commented Jul 30, 2016 •

edited

Loading