Improve Hawkular metrics collection #159

yaacov · 2017-11-06T15:59:25Z

Description

Current metrics collection from Hawkualr produce metrics that differ from metrics produced by OCP consule ui. The metrics can also be inaccurate, for example cpu usage percent may be more then 100 percent.

This PR make use of the same metrics used by OCP.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186

Should also fix: https://bugzilla.redhat.com/show_bug.cgi?id=1517064

warn if no metric endpoint defined
log failure only once, currently we do a log line for each failed entitly.

yaacov · 2017-11-06T16:04:10Z

@simon3z @moolitayer @Ladas @agrare [ this is a WIP ] please look at the direction this is going and advise if this is not what you imagined this to be.

My plan is to do the containters/pods using the same pattern, and add some check for fallback in the end.

Ladas · 2017-11-07T08:35:05Z

...ageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_v15_capture_context.rb

+        # insert the raw metrics into the ts_values object
+        metrics['gauge'][full_key].each do |metric|
+          timestamp = Time.at(metric['start'] / 1.in_milliseconds).utc
+          @ts_values[timestamp][key] = metric['avg'] unless metric['empty']


I wonder what is metric['empty']? Does it mean the value is 0?

what is metric['empty']?

empty, mean that this time slot is missing data, for example a response struct [1] may looks like this:

[ {time: 1, value: 10, avg: 10, ... , empty: false}, {time: 2, empty: true}, ... ]

[1] http://www.hawkular.org/docs/rest/rest-metrics.html#NumericBucketPoint

right, do we know that that means? Can we fill it with 0 values, rather than ignoring it?

IIUC 0 value is problematic:

for example if we have 4 samples with 2 empty samples:
10 + 10 / 2 = 10 # we have only two valid samples
while
10 + 0 + 0 + 10 / 4 = 5 # we have 4 samples 2 are 0

right, so back to my question what empty means? Does just mean there was no sample in the bucket we asked for? Or that the value was 0?

Does just mean there was no sample in the bucket we asked for?

Yes, no samples in the bucket.

If value == 0 we have { empty=false, avg=0, ...}

@Ladas

p.s.
A bit off topic to this comment thread, we have another point where we lose metric buckets [1]

[1] https://github.com/ManageIQ/manageiq-providers-kubernetes/blob/master/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L38

right, so with interval size big enough (like 1.minute, or twice the scraping period), this should not happen, right?

this should not happen, right?

Right, but ... please 🙏 hard if you trust this ...

right, so I am finding a way for g-release to identify what we should fill as 0 (the pod was dead) and what is just a missing value that should not be 0. So at some point, I will need to trust something. :-)

yaacov · 2017-11-12T14:52:52Z

Collectors for containers and pods are ready for review:

Example of use for testing:

pod = ems.container_groups.first;
context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(pod, 5.minutes.ago, 0.minutes.ago, 30);
context.collect_metrics

@Ladas @cben please review the updated collectors

yaacov · 2017-11-12T14:54:42Z

p.s. if someone has an idea how to appease the mighty codeclimate , I will be happy get an advice :-)

cben · 2017-11-12T15:32:40Z

...ageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_v15_capture_context.rb

+      tenant = @tenant
+
+      # query capacity metrics from Hawkular server
+      @tenant = '_system'


👎 to passing this through a var. Makes code too hard to follow and it's not really stateful, there are 2 distinct code paths.

What do you think of adding optional tenant param to hawkular_client?

@cben Thanks 👍 , I will.

p.s
I wanted to avoid this, because everywhere the code relay on @tenant being global const, but it does look unavoidable, current code is ugly :-(

perhaps there should be 2 separate somethings — 2 clients, or 2 contexts?
(just a thought, no idea if it helps)

moolitayer · 2017-11-13T10:13:50Z

@yaacov to understand the scope of this change

In which ManageIQ versions do we have the metrics discrepancies you described above?
In which ocp versions will we be able to use the new capture context?

yaacov · 2017-11-13T11:59:50Z

In which ManageIQ versions do we have the metrics discrepancies you described above?

All, this is "drop in replacement" to current collector.

In which ocp versions will we be able to use the new capture context?

We know that Hawkular/Heapster bundled with OCP 1.5 and above is compatible with this requests. We also plane to check compatibility before each read and use the appropriate collector depending on the test, see the description of this PR, and discussions in the mailing lists.

yaacov · 2017-11-27T09:02:45Z

@Ladas @cben good news everyone, run tests on node, and the new collector was at worst 2x faster, and at best 10x faster.

Will run tests on pods and containers next.

reload!;
prov = ExtManagementSystem.find(81);
node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);

[56] pry(main)> Benchmark.bm { |x| (1..10).each { |i| sleep(1); x.report("run #{i}:") { context.collect_metrics; } } }
       user     system      total        real
  0.260000   0.010000   0.270000 (  1.034560)
run 2:  0.010000   0.000000   0.010000 (  0.615654)
run 3:  0.010000   0.010000   0.020000 (  0.127177)
run 4:  0.010000   0.000000   0.010000 (  0.130462)
run 5:  0.020000   0.000000   0.020000 (  0.120418)
run 6:  0.010000   0.010000   0.020000 (  0.118279)
run 7:  0.010000   0.000000   0.010000 (  0.154871)
run 8:  0.010000   0.000000   0.010000 (  0.138120)
run 9:  0.000000   0.010000   0.010000 (  0.117617)
run 10:  0.010000   0.000000   0.010000 (  0.127029)

reload!;
prov = ExtManagementSystem.find(81);
node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);

[54] pry(main)> Benchmark.bm { |x| (1..10).each { |i| sleep(1); x.report("run #{i}:") { context.collect_metrics; } } }
       user     system      total        real
  0.240000   0.010000   0.250000 (  1.793949)
run 2:  0.040000   0.010000   0.050000 (  1.130210)
run 3:  0.030000   0.010000   0.040000 (  0.479410)
run 4:  0.030000   0.000000   0.030000 (  0.512484)
run 5:  0.030000   0.010000   0.040000 (  0.548518)
run 6:  0.200000   0.000000   0.200000 (  0.551442)
run 7:  0.030000   0.010000   0.040000 (  1.382974)
run 8:  0.030000   0.010000   0.040000 (  0.444941)
run 9:  0.040000   0.000000   0.040000 (  1.095027)
run 10:  0.030000   0.010000   0.040000 (  0.491803)

Ladas · 2017-11-27T09:10:53Z

@yaacov awesome :-) Maybe try the tests also with bigger timespan (like 7.days.ago for the initial collection)

yaacov · 2017-11-27T09:13:29Z

@Ladas

a. I was happy too soon ... with containers I do not see an improvement in timing :-(

Maybe try the tests also with bigger timespan

doing it now :-)

yaacov · 2017-11-27T09:21:35Z

@Ladas , the longer the span, the improvment reduce :-(

prov = ExtManagementSystem.find(81); node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(node, 5.days.ago, 5.minutes.ago, 60);

vs
prov = ExtManagementSystem.find(81); node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 5.days.ago, 5.minutes.ago, 60);

5 days metrics:

New way:

run 2:  0.500000   0.060000   0.560000 (  4.119586)
run 3:  0.330000   0.050000   0.380000 (  2.632578)
run 4:  0.370000   0.050000   0.420000 (  2.683264)
run 5:  0.370000   0.050000   0.420000 (  3.630635)
run 6:  0.530000   0.050000   0.580000 (  3.176897)
run 7:  1.080000   0.060000   1.140000 (  3.834927)
run 8:  0.300000   0.060000   0.360000 (  2.592284)
run 9:  0.400000   0.060000   0.460000 (  3.894900)
run 10:  0.290000   0.050000   0.340000 (  3.290654)

Old way:

run 2:  0.930000   0.020000   0.950000 (  3.320927)
run 3:  0.690000   0.030000   0.720000 (  4.836110)
run 4:  0.380000   0.020000   0.400000 (  2.749169)
run 5:  0.320000   0.020000   0.340000 (  2.950035)
run 6:  0.350000   0.020000   0.370000 (  4.053061)
run 7:  0.320000   0.010000   0.330000 (  3.079904)
run 8:  0.330000   0.030000   0.360000 (  2.734735)
run 9:  0.290000   0.020000   0.310000 (  4.319131)
run 10:  0.330000   0.030000   0.360000 (  2.788591)

yaacov · 2017-11-27T09:25:40Z

In the container and pods I do a new request to get the node capacity, the old way took this values from the inventory ...

So I can also take this values from inventory instead of Hawkualr ... and this will improve timing for container and pods ...

@Ladas @cben, what is more important:
a. reduce time => take capacity from inventory.
b. improve accuracy => take capacity from hawkular
?

yaacov · 2017-11-27T09:55:44Z

@Ladas , talked to @cben will try to improve the container, pod collection times by down sample the node-capacity collection.

Currently for each container I also collect the node-capacity ... this is double the time as I do 2x requests, one for the container and one for the node ( to get the node capacity )

cben · 2017-12-10T09:06:12Z

app/models/manageiq/providers/kubernetes/container_manager/metrics_capture.rb

    require_nested :HawkularCaptureContext
    require_nested :PrometheusCaptureContext

-    INTERVAL = 20.seconds
+    INTERVAL = 60.seconds


This one line affects all 3 modes (old, new and prometheus) — can you extract it to separate PR?

👍 Makes sence to me, will move to a new PR

cben · 2017-12-10T09:12:18Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

+      calculate_fields(cpu_node_capacity, mem_node_capacity)
+    end
+
+    # Query the Hawkular server for endpoint "/m" availabel on new versions


s/availabel/available/

👍 Thanks, fixed !

cben

Reviewed fully this time. Overall good.
A few small suggestions, and two just explanations (marked ✓) for things we discussed offline.

cben · 2017-12-10T09:18:10Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

    end

    def collect_group_metrics
-      group_id = @target.ems_ref
+      @metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average)
+      host_id = @target.container_node.name


✓ Does container_node always exist here (and in collect_container_metrics)?
=> YES, context construction raises if not:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L51

cben · 2017-12-10T09:29:28Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

+    # @param host_id [String] host_id/url the identify a node in the Hawkular DB.
+    # @param pod_id [String] pod_id of a pod in the Hawkular DB.
+    # @param container_name [String] container_name of a container in the Hawkular DB.
+    def collect_metrics_for_object(type = 'node', host_id = nil, pod_id = nil, container_name = nil)


[cosmetic, your call] type is always passed, doesn't need a default.

cben · 2017-12-10T09:39:05Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

+    # Search for a full key name in the metrics hash.
+    #
+    # @param type [String] metrics type (e.g. gauge / counter).
+    # @param group_id [String] the metrics key/group_id (e.g. cpu/usage).


I think group_id is wrong name?
The values I see get passed here are things like "cpu/usage_rate" from METRICS_NODE_KEYS/METRICS_POD_KEYS/METRICS_CONTAINER_KEYS.

👍 it's key now ...

cben · 2017-12-10T09:45:02Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

+                    elsif type == 'pod_container'
+                      "#{tags},type:#{type},host_id:#{host_id},pod_id:#{pod_id},container_name:#{container_name}"
+                    end
+      @raw_metrics = query_metrics_by_tags(metric_tags)


@raw_metrics gets used in the insert_metrics called immediately below, via get_metrics_key and insert_metrics_key. It's not used afterwards AFAICT.

What do you think of passing it through as parameter?
Explicit data flow is easier to follow than instance variables, where possible.

cben · 2017-12-10T10:28:28Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

    def collect_node_metrics
-      cpu_resid = "machine/#{@target.name}/cpu/usage"
-      process_cpu_counters_rate(fetch_counters_rate(cpu_resid))
+      @metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average)


✓ These are necessary for ts_values accessor:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L34-L39
Not very elegant, but out of scope for this PR.
@yaacov says for one CaptureContext, only one of collect_{node,group,container}_metrics will be called.

cben · 2017-12-10T10:44:47Z

.../manageiq/providers/kubernetes/container_manager/metrics_capture/hawkular_capture_context.rb

+    METRICS_ENDPOINT = 'm/stats/query'.freeze
+    METRICS_NODE_TAGS = 'descriptor_name:' \
+      'network/tx_rate|network/rx_rate|' \
+      'cpu/usage_rate|memory/usage|'.freeze


is trailing | a bug (might match everything)?

Wow, Thanks 👍 fixed.

miq-bot · 2017-12-10T11:14:50Z

Some comments on commit yaacov@204be00

spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture_spec.rb

⚠️ - 135 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
⚠️ - 142 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
⚠️ - 157 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
⚠️ - 164 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.

miq-bot · 2017-12-10T11:14:55Z

Checked commit yaacov@204be00 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
24 files checked, 1 offense detected

**

💣 💥 🔥 🚒 - Linter/Yaml - missing config files

cben

LGTM

yaacov · 2017-12-10T15:54:36Z

@miq-bot add_label gaprindashvili/yes

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186 target 5.9

Improve Hawkular metrics collection (cherry picked from commit d960961) https://bugzilla.redhat.com/show_bug.cgi?id=1524626 https://bugzilla.redhat.com/show_bug.cgi?id=1524628

simaishi · 2017-12-11T18:29:13Z

Gaprindashvili backport details:

$ git log -1
commit 5eae3559b57e37710bd4417bd0a490dd9e4098c8
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Sun Dec 10 15:20:10 2017 +0200

    Merge pull request #159 from yaacov/add-hawkualr-v1.5-collector
    
    Improve Hawkular metrics collection
    (cherry picked from commit d960961068c94d0a199f1bd9edc296ef37d25a3f)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1524626
    https://bugzilla.redhat.com/show_bug.cgi?id=1524628

miq-bot added the wip label Nov 6, 2017

yaacov force-pushed the add-hawkualr-v1.5-collector branch 4 times, most recently from ff099b9 to 689b6b7 Compare November 6, 2017 16:22

Ladas reviewed Nov 7, 2017

View reviewed changes

moolitayer self-assigned this Nov 7, 2017

yaacov force-pushed the add-hawkualr-v1.5-collector branch 5 times, most recently from d483624 to b8b2b23 Compare November 12, 2017 14:39

cben reviewed Nov 12, 2017

View reviewed changes

yaacov force-pushed the add-hawkualr-v1.5-collector branch 6 times, most recently from 382c72d to 23ddc4d Compare November 26, 2017 15:47

yaacov force-pushed the add-hawkualr-v1.5-collector branch 2 times, most recently from bf161cf to e7014b2 Compare December 7, 2017 13:55

cben reviewed Dec 10, 2017

View reviewed changes

yaacov force-pushed the add-hawkualr-v1.5-collector branch 5 times, most recently from e3bf7b2 to cf6bc44 Compare December 10, 2017 11:02

add a hawkualr v1.5 collector

204be00

yaacov force-pushed the add-hawkualr-v1.5-collector branch from cf6bc44 to 204be00 Compare December 10, 2017 11:05

enoodle approved these changes Dec 10, 2017

View reviewed changes

cben approved these changes Dec 10, 2017

View reviewed changes

moolitayer added metrics bug labels Dec 10, 2017

moolitayer merged commit d960961 into ManageIQ:master Dec 10, 2017

moolitayer added this to the Sprint 75 Ending Dec 11, 2017 milestone Dec 10, 2017

yaacov mentioned this pull request Dec 10, 2017

Make capture interval a multiple of 30s #187

Merged

miq-bot added the gaprindashvili/yes label Dec 10, 2017

simaishi added gaprindashvili/backported and removed gaprindashvili/yes labels Dec 11, 2017

yaacov mentioned this pull request Dec 20, 2017

Improve Prometheus metrics collection #132

Merged

yaacov mentioned this pull request Jan 18, 2018

Radar Project POC ManageIQ/manageiq#16826

Merged

Improve Hawkular metrics collection #159

Improve Hawkular metrics collection #159

Conversation

yaacov commented Nov 6, 2017 • edited Loading

yaacov commented Nov 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaacov commented Nov 12, 2017

yaacov commented Nov 12, 2017

Choose a reason for hiding this comment

yaacov Nov 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moolitayer commented Nov 13, 2017

yaacov commented Nov 13, 2017

yaacov commented Nov 27, 2017

Ladas commented Nov 27, 2017

yaacov commented Nov 27, 2017

yaacov commented Nov 27, 2017

yaacov commented Nov 27, 2017

yaacov commented Nov 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cben left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miq-bot commented Dec 10, 2017

miq-bot commented Dec 10, 2017

cben left a comment

Choose a reason for hiding this comment

yaacov commented Dec 10, 2017

simaishi commented Dec 11, 2017

yaacov commented Nov 6, 2017 •

edited

Loading

yaacov Nov 12, 2017 •

edited

Loading