Probe using 70% CPU #812

tomwilkie · 2016-01-13T12:16:55Z

@2opremio reports the probe is using 70% CPU on the service hosts.

2opremio · 2016-01-13T12:40:46Z

I restarted the probes with profiling (--probe.http.listen :4041) and curled http://localhost:4041/debug/pprof/profile from one of the probes: http://filebin.ca/2TJeW7veYKFA/profile.gz

tomwilkie · 2016-01-13T13:12:05Z

tomwilkie · 2016-01-13T13:23:36Z

At a cursory analysis, the probes are spending:

30% of the CPU reading the /proc//net/tcp files
maybe <10% of cpu is spent on GC (<3 of 30s) according to that trace
10% on copying in the report structure

I suspect fixing the copying will also fix the GC, but I can’t think of anything we can do to ‘fix’ the proc cpu usage, as this has already been heavily optimised (we only read the tcp files for net namespaces we haven't seen before).

I guess we can decrease the polling frequency / proc spy interval, and maybe even come up with some rate limiting scheme (ie only read 10/s, which would get us ~constant cpu usage, and at larger scales would means the UI updates less frequently).

tomwilkie · 2016-01-13T13:27:01Z

@2opremio tried that suggestion (./scope -probe.spy.interval=3s) and confirmed CPU dropped to 20%, as expected.

2opremio · 2016-02-04T03:33:05Z

The situation improves considerably after #898 . Now the probes (2ed3968) are using 30-40% CPU with peaks of 50%

Here go two probe profiles I have obtained from two different probes:

pprof.localhost_4041.samples.cpu.001.pb.gz

pprof.localhost_4041.samples.cpu.002.pb.gz

The probes are spending a long time encoding reports. A non-negligible amount of time (10% in one of the profiles) is spent sending gob types, which can be brought to 0 through any of the proposed solutions from App using lots of CPU #854 (comment)
Although the Reads of /proc/*/net/tcp{,6}are not the bottleneck anymore thanks to Fix reading of network namespace inodes #898, the stat()s of /proc/*/fd/* are still quite expensive (10% and 35% in the reports above). So we should maybe reconsider [WIP] Rate limit reading expensive files from /proc #814, which I should had tested with more open files: sudo bash -c "echo /proc/*/fd/* | wc -w" prints numbers up to ~20K in the service machines, whereas my own machine only has 1K.
Kubernetes support (which is enabled in the probes) takes around 10% of the time in the second profile (I think this is the cause of the 10% CPU peaks), almost exclusively matching pods:

$ go tool pprof  ~/home-vagrant/scope-tested pprof.localhost_4041.samples.cpu.002.pb.gz 
Entering interactive mode (type "help" for commands)
(pprof) list podTopology.func1
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/probe/kubernetes.(*Reporter).podTopology.func1 in /go/src/github.com/weaveworks/scope/probe/kubernetes/reporter.go
      10ms      1.16s (flat, cum)  9.85% of Total
         .          .     51:}
         .          .     52:
         .          .     53:func (r *Reporter) podTopology(services []Service) (report.Topology, report.Topology, error) {
         .          .     54:   pods, containers := report.MakeTopology(), report.MakeTopology()
         .          .     55:   err := r.client.WalkPods(func(p Pod) error {
      10ms       10ms     56:           for _, service := range services {
         .      1.10s     57:                   if service.Selector().Matches(p.Labels()) {
         .          .     58:                           p.AddServiceID(service.ID())
         .          .     59:                   }
         .          .     60:           }
         .          .     61:           nodeID := report.MakePodNodeID(p.Namespace(), p.Name())
         .       40ms     62:           pods = pods.AddNode(nodeID, p.GetNode())
         .          .     63:
         .          .     64:           container := report.MakeNodeWith(map[string]string{
         .          .     65:                   PodID:     p.ID(),
         .          .     66:                   Namespace: p.Namespace(),
         .       10ms     67:           }).WithParents(report.EmptySets.Add(report.Pod, report.MakeStringSet(nodeID)))
         .          .     68:           for _, containerID := range p.ContainerIDs() {
         .          .     69:                   containers.AddNode(report.MakeContainerNodeID(containerID), container)
         .          .     70:           }
         .          .     71:           return nil
         .          .     72:   })
(pprof) list Selector
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/probe/kubernetes.(*service).Selector in /go/src/github.com/weaveworks/scope/probe/kubernetes/service.go
         0      1.04s (flat, cum)  8.83% of Total
         .          .     47:
         .          .     48:func (s *service) Selector() labels.Selector {
         .          .     49:   if s.Spec.Selector == nil {
         .          .     50:           return labels.Nothing()
         .          .     51:   }
         .      1.04s     52:   return labels.SelectorFromSet(labels.Set(s.Spec.Selector))
         .          .     53:}
         .          .     54:
         .          .     55:func (s *service) GetNode() report.Node {
         .          .     56:   return report.MakeNodeWith(map[string]string{
         .          .     57:           ServiceID:      s.ID(),

@paulbellamy Any ideas on how improve this? SelectorFromSet seems to be prohibitively expensive, mostly due to the regular expression matching (maybe we should fix that):

(pprof) list SelectorFromSet
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels.SelectorFromSet in /go/src/github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels/selector.go
      30ms      1.04s (flat, cum)  8.83% of Total
         .          .    719:func SelectorFromSet(ls Set) Selector {
         .          .    720:   if ls == nil {
         .          .    721:           return LabelSelector{}
         .          .    722:   }
         .          .    723:   var requirements []Requirement
      20ms       30ms    724:   for label, value := range ls {
         .      890ms    725:           if r, err := NewRequirement(label, EqualsOperator, sets.NewString(value)); err != nil {
         .          .    726:                   //TODO: double check errors when input comes from serialization?
         .          .    727:                   return LabelSelector{}
         .          .    728:           } else {
      10ms       80ms    729:                   requirements = append(requirements, *r)
         .          .    730:           }
         .          .    731:   }
         .       40ms    732:   return LabelSelector(requirements)
         .          .    733:}
(pprof) list NewRequirement
Total: 11.78s
ROUTINE ======================== github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels.NewRequirement in /go/src/github.com/weaveworks/scope/vendor/k8s.io/kubernetes/pkg/labels/selector.go
      20ms      660ms (flat, cum)  5.60% of Total
         .          .    105:// (5) The key is invalid due to its length, or sequence
         .          .    106://     of characters. See validateLabelKey for more details.
         .          .    107://
         .          .    108:// The empty string is a valid value in the input values set.
         .          .    109:func NewRequirement(key string, op Operator, vals sets.String) (*Requirement, error) {
      10ms      260ms    110:   if err := validateLabelKey(key); err != nil {
         .          .    111:           return nil, err
         .          .    112:   }
         .          .    113:   switch op {
         .          .    114:   case InOperator, NotInOperator:
         .          .    115:           if len(vals) == 0 {
         .          .    116:                   return nil, fmt.Errorf("for 'in', 'notin' operators, values set can't be empty")
         .          .    117:           }
      10ms       10ms    118:   case EqualsOperator, DoubleEqualsOperator, NotEqualsOperator:
         .          .    119:           if len(vals) != 1 {
         .          .    120:                   return nil, fmt.Errorf("exact-match compatibility requires one single value")
         .          .    121:           }
         .       10ms    122:   case ExistsOperator, DoesNotExistOperator:
         .          .    123:           if len(vals) != 0 {
         .          .    124:                   return nil, fmt.Errorf("values set must be empty for exists and does not exist")
         .          .    125:           }
         .          .    126:   default:
         .          .    127:           return nil, fmt.Errorf("operator '%v' is not recognized", op)
         .          .    128:   }
         .          .    129:
         .       20ms    130:   for v := range vals {
         .      320ms    131:           if err := validateLabelValue(v); err != nil {
         .          .    132:                   return nil, err
         .          .    133:           }
         .          .    134:   }
         .       40ms    135:   return &Requirement{key: key, operator: op, strValues: vals}, nil
         .          .    136:}
         .          .    137:
         .          .    138:// Matches returns true if the Requirement matches the input Labels.
         .          .    139:// There is a match in the following cases:
         .          .    140:// (1) The operator is Exists and Labels has the Requirement's key.
(pprof)

tomwilkie · 2016-02-04T08:20:21Z

We made the gob encoding once-per-report a while back (it used to be
onces-per-report-per-app). That was a significant improvement. If we want
to do gob-over-websocket, we'll have to go back to a separate encoder per
app (as different app connections will have different lifecycles), which I
worry will be more of a regression in CPU usage.

How many probes are there? How many apps? I suspect we might just be
hitting the O(n*m) nature of this architecture. Going for 2 apps and N
probes should be an improvement?

Other things you highlight also look good:

rate limit reading of FDs, although due to the race with
/proc/PID/net/tcp this will have to look like a periodic reading of
namespaces, with some dynamic-feedback based on the number of fds in a
namespace, which might be complicated
kubernetes optimisations

Thanks for working on this fons!

2opremio · 2016-02-04T17:47:44Z

We made the gob encoding once-per-report a while back (it used to be onces-per-report-per-app). That was a significant improvement. If we want to do gob-over-websocket, we'll have to go back to a separate encoder per app (as different app connections will have different lifecycles), which I worry will be more of a regression in CPU usage.

Thanks, I didn't think about the (apps * probes) scenario. This use-case involves multiple probes (5) reporting to a single app (which should be the most common one ... if we were not doing (apps*probes) in standalone mode).

However, take into account that, with websockets, the type-sending part of the encoding (aparently the most expensive one) will only be done once per app. With http, it's done once per report. My intuition says that with gob, until certain number (N) of reports, it will be cheaper to encode a report multiple types with existing encoder than encoding it only once with a new encoder each time.

Either way, the fact that we need to make clear-cut performance choices for the gob codec between ...

Apps connecting to multiple probes
Probes connecting to multiple apps.

... is another sign that gob is a bad fit due to it's stateful nature. We should be probably be using a codec which lets you optimize globally instead of per-stream.

How many probes are there? How many apps? I suspect we might just be hitting the O(n*m) nature of this architecture. Going for 2 apps and N probes should be an improvement?

5 probes 1 app. Yep, splitting the app would help performance-wise, but it would make Scope unusable on Kubernetes since the user normally doesn't have a saying about where the apps end up.

Using a tree-structure with intermediate report-merging nodes would also do but it may probe tricky to deploy. Either way, I would say 5 probes for a single app should be more than reasonable.

Thanks for working on this fons!

No problem, thanks for the feedback

The rate limiting from #912 helps quite a bit

Now the profile is dominated by garbage collection and the kubernetes problem.

2opremio · 2016-02-04T18:05:06Z

After cherry-picking @paul's https://github.com/weaveworks/scope/tree/k8s-efficiency on top of the rate limiting, the CPU consumption is between 25% and 35%:

Now the profile is completely dominated by encoding, copying and garbage collection. I am going to work on the generated json encoding, clean and get #912 in and then see how we can improve the garbage collection dominance if it's still there.

2opremio · 2016-02-18T18:11:04Z

After the codec improvements, the probes (a38db2e) still consume ~20% CPU in the service.

pprof.localhost:4041.samples.cpu.001.pb.gz

The main two places where to improve are:

Report copying. It consumes a lot of CPU and generates a lot of objects, putting a lot pressure in the GC
The codec. Yes, it still takes 17% of the time, but I think the solution here is to simply reduce the size of the reports or their frequency. I have created Send incremental reports probe->app #985

2opremio · 2016-02-18T18:25:09Z

@tomwilkie @paulbellamy @peterbourgon Any ideas on how to reduce or optimize the copying?

peterbourgon · 2016-02-18T19:12:13Z

We can save a ton of allocs and CPU time by leveraging sync.Pool. For sure we can use pools of []byte or *bytes.Buffer when we serialize reports (and deserialize on the other end). That's easy and may get us some quick wins.

Beyond that, it'd be great to use pools for the reports themselves, i.e. don't allocate a new report and fill it, but fetch one from the pool, reset it, and write into it. But, we'd need implement a Reset method, and I'm not sure how simple that would be given all of our nested maps. This also assumes that Resetting a map will maintain its allocated capacity; I think that's true but without a cap to test I'm not positive.

Other thoughts?

paulbellamy · 2016-02-18T19:13:29Z

It's a bit unclear to me why copying is taking that much CPU/making that much garbage. Copy should be almost a no-op. Unless there is stuff left which is being copied? It looks a bit from the graph like Metrics?

Alternatively we could ditch the whole immutability/copying thing and rethink how the pipeline works. Go doesn't exactly make working with immutable datastructures easy/elegant anyway.

peterbourgon · 2016-02-18T19:26:18Z

Alternatively we could ditch the whole immutability/copying thing and rethink how the pipeline works. Go doesn't exactly make working with immutable datastructures easy/elegant anyway.

👍 to this. The current report structure and pipeline architecture was designed for ease of use and expediency. If it's no longer easy to use (i.e. maintain and mutate with changing product requirements) and it's a performance bottleneck with no obvious fix, it's absolutely appropriate to revisit the overall design.

2opremio · 2016-02-18T20:13:43Z

We can save a ton of allocs and CPU time by leveraging sync.Pool. For sure we can use pools of []byte or *bytes.Buffer when we serialize reports (and deserialize on the other end).

AFAIK the new codec library already uses pools.

However, I believe that the fromIntermediate/toIntermediate methods used for custom serialization (e.g. the ones needed for mndrix/ps like LatestMap) may be expensive

Beyond that, it'd be great to use pools for the reports themselves

This sounds more like it!

peterbourgon · 2016-02-18T20:20:46Z

This sounds more like it!

Unfortunately, after a bit of research and discussion with the community, I don't think the current report structure is going to lend itself to Pooling very well. In the end it's a collection of maps, and maps only 'reset' in linear time, i.e.

for _, topo := range report.Topologies() {
    for k := range topo {
        delete(topo, k)
    }
}

It's worth a profile, of course, but the cost of reset increases as the map gets bigger, making it less and less appealing vs. GC. My intuition is that if we want to Pool reports, we need to get rid of most if not all of the maps, in favor of slices or something else — IMO worth considering together with @paulbellamy's bag-of-nodes approach to the pipelines.

2opremio · 2016-02-19T09:28:00Z

Actually, after having a closer look at the profile, one of the nodes (17.76%) spent in the codec doesn't come from Encoding but Decoding, so it can't originate from the reports being sent to the app. 10% of that time is spent reading http bodies, so the decoding itself is only 7%.

Docker/kubernetes info? It's hard to tell because apparently it's being done in a separate goroutine. I'll investigate but we may be pooling something way too often.

If needed, to accelerate the decoding of that information we can always generate Selfers for those types in the vendored directory.

2opremio · 2016-02-19T09:40:39Z

Also, to get a better sense of where we can reduce garbage generation, I just found out that pprof lets you obtain heap profiles with --alloc_objects. This will tell you how many objects have been allocated overall (not just in use). I will give that a try.

2opremio · 2016-02-19T10:14:12Z

Docker/kubernetes info? It's hard to tell because apparently it's being done in a separate goroutine. I'll investigate but we may be pooling something way too often.

Yep, it's docker stats being gathered:

(pprof) list StartGatheringStats
Total: 6.87s
ROUTINE ======================== github.com/weaveworks/scope/probe/docker.(*container).StartGatheringStats.func1 in /go/src/github.com/weaveworks/scope/probe/docker/container.go
         0      1.23s (flat, cum) 17.90% of Total
         .          .    202:           c.statsConn = nil
         .          .    203:       }()
         .          .    204:
         .          .    205:       var stats docker.Stats
         .          .    206:       decoder := codec.NewDecoder(resp.Body, &codec.JsonHandle{})
         .      1.23s    207:       for err := decoder.Decode(&stats); err != io.EOF; err = decoder.Decode(&stats) {
         .          .    208:           if err != nil {
         .          .    209:               log.Errorf("docker container: error reading event, did container stop? %v", err)
         .          .    210:               return
         .          .    211:           }
         .          .    212:

And I don't think it was done in a separate goroutine, pprof just omitted the parents in the tree.

paulbellamy · 2016-02-19T10:22:14Z

go-dockerclient (and docker API) has a streaming stats thing that might reduce this.

tomwilkie · 2016-02-19T11:20:33Z

Nah, there code is basically identical.

On Friday, 19 February 2016, Paul Bellamy notifications@github.com wrote:

go-dockerclient (and docker API) has a streaming stats thing
https://godoc.org/github.com/fsouza/go-dockerclient#Client.Stats that
might reduce this.

—
Reply to this email directly or view it on GitHub
#812 (comment).

2opremio · 2016-02-19T18:10:05Z

Here's the --alloc_objects heap profile for a probe (d744278) running in the service:

pprof.localhost:4041.alloc_objects.alloc_space.003.pb.gz

2opremio · 2016-02-22T16:00:53Z

After #1000, the probes are running stably under 20% (whilst kubelet 1.1 is running >=70% in all nodes) and the garbage generated is tolerable. However, the /proc read/stats are up at 26% . I will try to tweak the background reader a bit to see if I can lower that number.

CPU profile:

pprof.localhost:4041.samples.cpu.002.pb.gz

Objects allocations profile:

pprof.localhost:4041.alloc_objects.alloc_space.001.pb.gz

2opremio · 2016-02-23T13:07:59Z

After making another measurement and looking at the syscalls (-focus 'Syscall$'), I can see that the Syscall load is shared between the connection walker (probe/endpoint/procspy/proc_linux.go) and the process walker (probe/process/walker_linux.go), which after #961 also inspects the fd directories:

pprof.localhost:4041.samples.cpu.003.pb.gz

2opremio · 2016-02-23T13:09:36Z

Closing in favor of #1010 and #1011

tomwilkie added this to the 0.12.0 milestone Jan 13, 2016

2opremio mentioned this issue Jan 13, 2016

Incident: Kubernetes couldn't schedule pods weaveworks/service#287

Closed

2opremio closed this as completed Jan 13, 2016

2opremio reopened this Jan 13, 2016

2opremio mentioned this issue Jan 13, 2016

Speed issues under Virtualbox #769

Closed

tomwilkie self-assigned this Jan 13, 2016

tomwilkie mentioned this issue Jan 13, 2016

[WIP] Rate limit reading expensive files from /proc #814

Closed

This was referenced Jan 24, 2016

Memoise & cache the result of renderers, so we don't recalculate views multiple times. #851

Merged

Use ps.Map for Counters and Sets, remove Metadata in favour of Latest. #838

Merged

tomwilkie modified the milestones: 0.13.0, 0.12.0 Jan 24, 2016

tomwilkie assigned 2opremio and unassigned tomwilkie Jan 28, 2016

2opremio mentioned this issue Feb 4, 2016

Rate-limit reading proc files #912

Merged

This was referenced Feb 4, 2016

Compile k8s selectors once (not for each pod) #918

Merged

Improve codec performance #916

Merged

2opremio mentioned this issue Feb 18, 2016

App using lots of CPU #854

Closed

2opremio mentioned this issue Feb 19, 2016

Improve performance of docker stats obtention #989

Merged

This was referenced Feb 19, 2016

codec.(*simpleIoEncWriterWriter).WriteByte generating a lot of garbage ugorji/go#143

Closed

Reduce amount of objects allocated by the codec #1000

Merged

2opremio mentioned this issue Feb 23, 2016

Reduce Garbage Collection pressure #1010

Closed

2opremio mentioned this issue Feb 23, 2016

Merge common parts of process walker and connection walker #1011

Open

2opremio closed this as completed Feb 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probe using 70% CPU #812

Probe using 70% CPU #812

tomwilkie commented Jan 13, 2016

2opremio commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

2opremio commented Feb 4, 2016

tomwilkie commented Feb 4, 2016

2opremio commented Feb 4, 2016

2opremio commented Feb 4, 2016

2opremio commented Feb 18, 2016

2opremio commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

paulbellamy commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

2opremio commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 19, 2016

paulbellamy commented Feb 19, 2016

tomwilkie commented Feb 19, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 22, 2016

2opremio commented Feb 23, 2016

2opremio commented Feb 23, 2016

Probe using 70% CPU #812

Probe using 70% CPU #812

Comments

tomwilkie commented Jan 13, 2016

2opremio commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

tomwilkie commented Jan 13, 2016

2opremio commented Feb 4, 2016

tomwilkie commented Feb 4, 2016

2opremio commented Feb 4, 2016

2opremio commented Feb 4, 2016

2opremio commented Feb 18, 2016

2opremio commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

paulbellamy commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

2opremio commented Feb 18, 2016

peterbourgon commented Feb 18, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 19, 2016

paulbellamy commented Feb 19, 2016

tomwilkie commented Feb 19, 2016

2opremio commented Feb 19, 2016

2opremio commented Feb 22, 2016

2opremio commented Feb 23, 2016

2opremio commented Feb 23, 2016