Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Hawkular metrics collection #159

Merged
merged 1 commit into from
Dec 10, 2017

Conversation

yaacov
Copy link
Member

@yaacov yaacov commented Nov 6, 2017

Description

Current metrics collection from Hawkualr produce metrics that differ from metrics produced by OCP consule ui. The metrics can also be inaccurate, for example cpu usage percent may be more then 100 percent.

This PR make use of the same metrics used by OCP.

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186

Should also fix: https://bugzilla.redhat.com/show_bug.cgi?id=1517064

  • warn if no metric endpoint defined
  • log failure only once, currently we do a log line for each failed entitly.

@miq-bot miq-bot added the wip label Nov 6, 2017
@yaacov
Copy link
Member Author

yaacov commented Nov 6, 2017

@simon3z @moolitayer @Ladas @agrare [ this is a WIP ] please look at the direction this is going and advise if this is not what you imagined this to be.

My plan is to do the containters/pods using the same pattern, and add some check for fallback in the end.

@yaacov yaacov force-pushed the add-hawkualr-v1.5-collector branch 4 times, most recently from ff099b9 to 689b6b7 Compare November 6, 2017 16:22
# insert the raw metrics into the ts_values object
metrics['gauge'][full_key].each do |metric|
timestamp = Time.at(metric['start'] / 1.in_milliseconds).utc
@ts_values[timestamp][key] = metric['avg'] unless metric['empty']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is metric['empty']? Does it mean the value is 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is metric['empty']?

empty, mean that this time slot is missing data, for example a response struct [1] may looks like this:

[
   {time: 1, value: 10, avg: 10, ... , empty: false},
   {time: 2, empty: true},
   ...
]

[1] http://www.hawkular.org/docs/rest/rest-metrics.html#NumericBucketPoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, do we know that that means? Can we fill it with 0 values, rather than ignoring it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC 0 value is problematic:

for example if we have 4 samples with 2 empty samples:
10 + 10 / 2 = 10 # we have only two valid samples
while
10 + 0 + 0 + 10 / 4 = 5 # we have 4 samples 2 are 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so back to my question what empty means? Does just mean there was no sample in the bucket we asked for? Or that the value was 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does just mean there was no sample in the bucket we asked for?

Yes, no samples in the bucket.

If value == 0 we have { empty=false, avg=0, ...}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so with interval size big enough (like 1.minute, or twice the scraping period), this should not happen, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not happen, right?

Right, but ... please 🙏 hard if you trust this ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so I am finding a way for g-release to identify what we should fill as 0 (the pod was dead) and what is just a missing value that should not be 0. So at some point, I will need to trust something. :-)

@moolitayer moolitayer self-assigned this Nov 7, 2017
@yaacov yaacov force-pushed the add-hawkualr-v1.5-collector branch 5 times, most recently from d483624 to b8b2b23 Compare November 12, 2017 14:39
@yaacov
Copy link
Member Author

yaacov commented Nov 12, 2017

Collectors for containers and pods are ready for review:

Example of use for testing:

pod = ems.container_groups.first;
context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(pod, 5.minutes.ago, 0.minutes.ago, 30);
context.collect_metrics

@Ladas @cben please review the updated collectors

@yaacov
Copy link
Member Author

yaacov commented Nov 12, 2017

p.s. if someone has an idea how to appease the mighty codeclimate , I will be happy get an advice :-)

tenant = @tenant

# query capacity metrics from Hawkular server
@tenant = '_system'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 to passing this through a var. Makes code too hard to follow and it's not really stateful, there are 2 distinct code paths.

What do you think of adding optional tenant param to hawkular_client?

Copy link
Member Author

@yaacov yaacov Nov 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cben Thanks 👍 , I will.

p.s
I wanted to avoid this, because everywhere the code relay on @tenant being global const, but it does look unavoidable, current code is ugly :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps there should be 2 separate somethings — 2 clients, or 2 contexts?
(just a thought, no idea if it helps)

@moolitayer
Copy link

@yaacov to understand the scope of this change

  • In which ManageIQ versions do we have the metrics discrepancies you described above?
  • In which ocp versions will we be able to use the new capture context?

@yaacov
Copy link
Member Author

yaacov commented Nov 13, 2017

In which ManageIQ versions do we have the metrics discrepancies you described above?

All, this is "drop in replacement" to current collector.

In which ocp versions will we be able to use the new capture context?

We know that Hawkular/Heapster bundled with OCP 1.5 and above is compatible with this requests. We also plane to check compatibility before each read and use the appropriate collector depending on the test, see the description of this PR, and discussions in the mailing lists.

@yaacov yaacov force-pushed the add-hawkualr-v1.5-collector branch 6 times, most recently from 382c72d to 23ddc4d Compare November 26, 2017 15:47
@yaacov
Copy link
Member Author

yaacov commented Nov 27, 2017

@Ladas @cben good news everyone, run tests on node, and the new collector was at worst 2x faster, and at best 10x faster.

Will run tests on pods and containers next.

reload!;
prov = ExtManagementSystem.find(81);
node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);

[56] pry(main)> Benchmark.bm { |x| (1..10).each { |i| sleep(1); x.report("run #{i}:") { context.collect_metrics; } } }
       user     system      total        real
  0.260000   0.010000   0.270000 (  1.034560)
run 2:  0.010000   0.000000   0.010000 (  0.615654)
run 3:  0.010000   0.010000   0.020000 (  0.127177)
run 4:  0.010000   0.000000   0.010000 (  0.130462)
run 5:  0.020000   0.000000   0.020000 (  0.120418)
run 6:  0.010000   0.010000   0.020000 (  0.118279)
run 7:  0.010000   0.000000   0.010000 (  0.154871)
run 8:  0.010000   0.000000   0.010000 (  0.138120)
run 9:  0.000000   0.010000   0.010000 (  0.117617)
run 10:  0.010000   0.000000   0.010000 (  0.127029)

reload!;
prov = ExtManagementSystem.find(81);
node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 10.minutes.ago, 5.minutes.ago, 60);

[54] pry(main)> Benchmark.bm { |x| (1..10).each { |i| sleep(1); x.report("run #{i}:") { context.collect_metrics; } } }
       user     system      total        real
  0.240000   0.010000   0.250000 (  1.793949)
run 2:  0.040000   0.010000   0.050000 (  1.130210)
run 3:  0.030000   0.010000   0.040000 (  0.479410)
run 4:  0.030000   0.000000   0.030000 (  0.512484)
run 5:  0.030000   0.010000   0.040000 (  0.548518)
run 6:  0.200000   0.000000   0.200000 (  0.551442)
run 7:  0.030000   0.010000   0.040000 (  1.382974)
run 8:  0.030000   0.010000   0.040000 (  0.444941)
run 9:  0.040000   0.000000   0.040000 (  1.095027)
run 10:  0.030000   0.010000   0.040000 (  0.491803)

@Ladas
Copy link
Contributor

Ladas commented Nov 27, 2017

@yaacov awesome :-) Maybe try the tests also with bigger timespan (like 7.days.ago for the initial collection)

@yaacov
Copy link
Member Author

yaacov commented Nov 27, 2017

@Ladas

a. I was happy too soon ... with containers I do not see an improvement in timing :-(

Maybe try the tests also with bigger timespan

doing it now :-)

@yaacov
Copy link
Member Author

yaacov commented Nov 27, 2017

@Ladas , the longer the span, the improvment reduce :-(

prov = ExtManagementSystem.find(81); node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularV15CaptureContext.new(node, 5.days.ago, 5.minutes.ago, 60);

vs
prov = ExtManagementSystem.find(81); node = prov.container_nodes.last; context = ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture::HawkularCaptureContext.new(node, 5.days.ago, 5.minutes.ago, 60);

5 days metrics:

New way:

run 2:  0.500000   0.060000   0.560000 (  4.119586)
run 3:  0.330000   0.050000   0.380000 (  2.632578)
run 4:  0.370000   0.050000   0.420000 (  2.683264)
run 5:  0.370000   0.050000   0.420000 (  3.630635)
run 6:  0.530000   0.050000   0.580000 (  3.176897)
run 7:  1.080000   0.060000   1.140000 (  3.834927)
run 8:  0.300000   0.060000   0.360000 (  2.592284)
run 9:  0.400000   0.060000   0.460000 (  3.894900)
run 10:  0.290000   0.050000   0.340000 (  3.290654)

Old way:

run 2:  0.930000   0.020000   0.950000 (  3.320927)
run 3:  0.690000   0.030000   0.720000 (  4.836110)
run 4:  0.380000   0.020000   0.400000 (  2.749169)
run 5:  0.320000   0.020000   0.340000 (  2.950035)
run 6:  0.350000   0.020000   0.370000 (  4.053061)
run 7:  0.320000   0.010000   0.330000 (  3.079904)
run 8:  0.330000   0.030000   0.360000 (  2.734735)
run 9:  0.290000   0.020000   0.310000 (  4.319131)
run 10:  0.330000   0.030000   0.360000 (  2.788591)

@yaacov
Copy link
Member Author

yaacov commented Nov 27, 2017

In the container and pods I do a new request to get the node capacity, the old way took this values from the inventory ...

So I can also take this values from inventory instead of Hawkualr ... and this will improve timing for container and pods ...

@Ladas @cben, what is more important:
a. reduce time => take capacity from inventory.
b. improve accuracy => take capacity from hawkular
?

@yaacov
Copy link
Member Author

yaacov commented Nov 27, 2017

@Ladas , talked to @cben will try to improve the container, pod collection times by down sample the node-capacity collection.

Currently for each container I also collect the node-capacity ... this is double the time as I do 2x requests, one for the container and one for the node ( to get the node capacity )

@yaacov yaacov force-pushed the add-hawkualr-v1.5-collector branch 2 times, most recently from bf161cf to e7014b2 Compare December 7, 2017 13:55
require_nested :HawkularCaptureContext
require_nested :PrometheusCaptureContext

INTERVAL = 20.seconds
INTERVAL = 60.seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line affects all 3 modes (old, new and prometheus) — can you extract it to separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sence to me, will move to a new PR

calculate_fields(cpu_node_capacity, mem_node_capacity)
end

# Query the Hawkular server for endpoint "/m" availabel on new versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/availabel/available/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks, fixed !

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed fully this time. Overall good.
A few small suggestions, and two just explanations (marked ✓) for things we discussed offline.

end

def collect_group_metrics
group_id = @target.ems_ref
@metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average)
host_id = @target.container_node.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✓ Does container_node always exist here (and in collect_container_metrics)?
=> YES, context construction raises if not:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L51

# @param host_id [String] host_id/url the identify a node in the Hawkular DB.
# @param pod_id [String] pod_id of a pod in the Hawkular DB.
# @param container_name [String] container_name of a container in the Hawkular DB.
def collect_metrics_for_object(type = 'node', host_id = nil, pod_id = nil, container_name = nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cosmetic, your call] type is always passed, doesn't need a default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

# Search for a full key name in the metrics hash.
#
# @param type [String] metrics type (e.g. gauge / counter).
# @param group_id [String] the metrics key/group_id (e.g. cpu/usage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think group_id is wrong name?
The values I see get passed here are things like "cpu/usage_rate" from METRICS_NODE_KEYS/METRICS_POD_KEYS/METRICS_CONTAINER_KEYS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it's key now ...

elsif type == 'pod_container'
"#{tags},type:#{type},host_id:#{host_id},pod_id:#{pod_id},container_name:#{container_name}"
end
@raw_metrics = query_metrics_by_tags(metric_tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raw_metrics gets used in the insert_metrics called immediately below, via get_metrics_key and insert_metrics_key. It's not used afterwards AFAICT.

What do you think of passing it through as parameter?
Explicit data flow is easier to follow than instance variables, where possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

def collect_node_metrics
cpu_resid = "machine/#{@target.name}/cpu/usage"
process_cpu_counters_rate(fetch_counters_rate(cpu_resid))
@metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✓ These are necessary for ts_values accessor:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L34-L39
Not very elegant, but out of scope for this PR.
@yaacov says for one CaptureContext, only one of collect_{node,group,container}_metrics will be called.

METRICS_ENDPOINT = 'm/stats/query'.freeze
METRICS_NODE_TAGS = 'descriptor_name:' \
'network/tx_rate|network/rx_rate|' \
'cpu/usage_rate|memory/usage|'.freeze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is trailing | a bug (might match everything)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, Thanks 👍 fixed.

@yaacov yaacov force-pushed the add-hawkualr-v1.5-collector branch 5 times, most recently from e3bf7b2 to cf6bc44 Compare December 10, 2017 11:02
@miq-bot
Copy link
Member

miq-bot commented Dec 10, 2017

Some comments on commit yaacov@204be00

spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture_spec.rb

  • ⚠️ - 135 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
  • ⚠️ - 142 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
  • ⚠️ - 157 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.
  • ⚠️ - 164 - Detected allow_any_instance_of. This RSpec method is highly discouraged, please only use when absolutely necessary.

@miq-bot
Copy link
Member

miq-bot commented Dec 10, 2017

Checked commit yaacov@204be00 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
24 files checked, 1 offense detected

**

  • 💣 💥 🔥 🚒 - Linter/Yaml - missing config files

Copy link
Contributor

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@moolitayer moolitayer merged commit d960961 into ManageIQ:master Dec 10, 2017
@moolitayer moolitayer added this to the Sprint 75 Ending Dec 11, 2017 milestone Dec 10, 2017
@yaacov
Copy link
Member Author

yaacov commented Dec 10, 2017

@miq-bot add_label gaprindashvili/yes

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186 target 5.9

@simaishi
Copy link

Gaprindashvili backport details:

$ git log -1
commit 5eae3559b57e37710bd4417bd0a490dd9e4098c8
Author: Mooli Tayer <mtayer@redhat.com>
Date:   Sun Dec 10 15:20:10 2017 +0200

    Merge pull request #159 from yaacov/add-hawkualr-v1.5-collector
    
    Improve Hawkular metrics collection
    (cherry picked from commit d960961068c94d0a199f1bd9edc296ef37d25a3f)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1524626
    https://bugzilla.redhat.com/show_bug.cgi?id=1524628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants