-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Hawkular metrics collection #159
Improve Hawkular metrics collection #159
Conversation
@simon3z @moolitayer @Ladas @agrare [ this is a WIP ] please look at the direction this is going and advise if this is not what you imagined this to be. My plan is to do the containters/pods using the same pattern, and add some check for fallback in the end. |
ff099b9
to
689b6b7
Compare
# insert the raw metrics into the ts_values object | ||
metrics['gauge'][full_key].each do |metric| | ||
timestamp = Time.at(metric['start'] / 1.in_milliseconds).utc | ||
@ts_values[timestamp][key] = metric['avg'] unless metric['empty'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is metric['empty']? Does it mean the value is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is metric['empty']?
empty, mean that this time slot is missing data, for example a response struct [1] may looks like this:
[
{time: 1, value: 10, avg: 10, ... , empty: false},
{time: 2, empty: true},
...
]
[1] http://www.hawkular.org/docs/rest/rest-metrics.html#NumericBucketPoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, do we know that that means? Can we fill it with 0 values, rather than ignoring it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC 0 value is problematic:
for example if we have 4 samples with 2 empty samples:
10 + 10 / 2 = 10 # we have only two valid samples
while
10 + 0 + 0 + 10 / 4 = 5 # we have 4 samples 2 are 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so back to my question what empty means? Does just mean there was no sample in the bucket we asked for? Or that the value was 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does just mean there was no sample in the bucket we asked for?
Yes, no samples in the bucket.
If value == 0 we have { empty=false, avg=0, ...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s.
A bit off topic to this comment thread, we have another point where we lose metric buckets [1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so with interval size big enough (like 1.minute, or twice the scraping period), this should not happen, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not happen, right?
Right, but ... please 🙏 hard if you trust this ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so I am finding a way for g-release to identify what we should fill as 0 (the pod was dead) and what is just a missing value that should not be 0. So at some point, I will need to trust something. :-)
d483624
to
b8b2b23
Compare
Collectors for containers and pods are ready for review: Example of use for testing:
|
p.s. if someone has an idea how to appease the mighty codeclimate , I will be happy get an advice :-) |
tenant = @tenant | ||
|
||
# query capacity metrics from Hawkular server | ||
@tenant = '_system' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👎 to passing this through a var. Makes code too hard to follow and it's not really stateful, there are 2 distinct code paths.
What do you think of adding optional tenant param to hawkular_client
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cben Thanks 👍 , I will.
p.s
I wanted to avoid this, because everywhere the code relay on @tenant
being global const, but it does look unavoidable, current code is ugly :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps there should be 2 separate somethings — 2 clients, or 2 contexts?
(just a thought, no idea if it helps)
@yaacov to understand the scope of this change
|
All, this is "drop in replacement" to current collector.
We know that Hawkular/Heapster bundled with OCP 1.5 and above is compatible with this requests. We also plane to check compatibility before each read and use the appropriate collector depending on the test, see the description of this PR, and discussions in the mailing lists. |
382c72d
to
23ddc4d
Compare
@Ladas @cben good news everyone, run tests on node, and the new collector was at worst 2x faster, and at best 10x faster. Will run tests on pods and containers next.
|
@yaacov awesome :-) Maybe try the tests also with bigger timespan (like 7.days.ago for the initial collection) |
a. I was happy too soon ... with containers I do not see an improvement in timing :-(
doing it now :-) |
@Ladas , the longer the span, the improvment reduce :-(
vs 5 days metrics: New way:
Old way:
|
In the container and pods I do a new request to get the So I can also take this values from inventory instead of Hawkualr ... and this will improve timing for container and pods ... @Ladas @cben, what is more important: |
@Ladas , talked to @cben will try to improve the container, pod collection times by down sample the node-capacity collection. Currently for each container I also collect the node-capacity ... this is double the time as I do 2x requests, one for the container and one for the node ( to get the node capacity ) |
bf161cf
to
e7014b2
Compare
require_nested :HawkularCaptureContext | ||
require_nested :PrometheusCaptureContext | ||
|
||
INTERVAL = 20.seconds | ||
INTERVAL = 60.seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one line affects all 3 modes (old, new and prometheus) — can you extract it to separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Makes sence to me, will move to a new PR
calculate_fields(cpu_node_capacity, mem_node_capacity) | ||
end | ||
|
||
# Query the Hawkular server for endpoint "/m" availabel on new versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/availabel/available/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks, fixed !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed fully this time. Overall good.
A few small suggestions, and two just explanations (marked ✓) for things we discussed offline.
end | ||
|
||
def collect_group_metrics | ||
group_id = @target.ems_ref | ||
@metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average) | ||
host_id = @target.container_node.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✓ Does container_node always exist here (and in collect_container_metrics)?
=> YES, context construction raises if not:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L51
# @param host_id [String] host_id/url the identify a node in the Hawkular DB. | ||
# @param pod_id [String] pod_id of a pod in the Hawkular DB. | ||
# @param container_name [String] container_name of a container in the Hawkular DB. | ||
def collect_metrics_for_object(type = 'node', host_id = nil, pod_id = nil, container_name = nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[cosmetic, your call] type
is always passed, doesn't need a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done
# Search for a full key name in the metrics hash. | ||
# | ||
# @param type [String] metrics type (e.g. gauge / counter). | ||
# @param group_id [String] the metrics key/group_id (e.g. cpu/usage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think group_id is wrong name?
The values I see get passed here are things like "cpu/usage_rate"
from METRICS_NODE_KEYS/METRICS_POD_KEYS/METRICS_CONTAINER_KEYS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it's key now ...
elsif type == 'pod_container' | ||
"#{tags},type:#{type},host_id:#{host_id},pod_id:#{pod_id},container_name:#{container_name}" | ||
end | ||
@raw_metrics = query_metrics_by_tags(metric_tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raw_metrics gets used in the insert_metrics
called immediately below, via get_metrics_key
and insert_metrics_key
. It's not used afterwards AFAICT.
What do you think of passing it through as parameter?
Explicit data flow is easier to follow than instance variables, where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done
def collect_node_metrics | ||
cpu_resid = "machine/#{@target.name}/cpu/usage" | ||
process_cpu_counters_rate(fetch_counters_rate(cpu_resid)) | ||
@metrics = %w(cpu_usage_rate_average mem_usage_absolute_average net_usage_rate_average) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✓ These are necessary for ts_values
accessor:
https://github.com/yaacov/manageiq-providers-kubernetes/blob/e7014b2faf/app/models/manageiq/providers/kubernetes/container_manager/metrics_capture/capture_context_mixin.rb#L34-L39
Not very elegant, but out of scope for this PR.
@yaacov says for one CaptureContext, only one of collect_{node,group,container}_metrics
will be called.
METRICS_ENDPOINT = 'm/stats/query'.freeze | ||
METRICS_NODE_TAGS = 'descriptor_name:' \ | ||
'network/tx_rate|network/rx_rate|' \ | ||
'cpu/usage_rate|memory/usage|'.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is trailing |
a bug (might match everything)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, Thanks 👍 fixed.
e3bf7b2
to
cf6bc44
Compare
cf6bc44
to
204be00
Compare
Some comments on commit yaacov@204be00 spec/models/manageiq/providers/kubernetes/container_manager/metrics_capture_spec.rb
|
Checked commit yaacov@204be00 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0 **
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@miq-bot add_label gaprindashvili/yes BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186 target 5.9 |
Improve Hawkular metrics collection (cherry picked from commit d960961) https://bugzilla.redhat.com/show_bug.cgi?id=1524626 https://bugzilla.redhat.com/show_bug.cgi?id=1524628
Gaprindashvili backport details:
|
Description
Current metrics collection from Hawkualr produce metrics that differ from metrics produced by OCP consule ui. The metrics can also be inaccurate, for example cpu usage percent may be more then 100 percent.
This PR make use of the same metrics used by OCP.
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1455186
Should also fix: https://bugzilla.redhat.com/show_bug.cgi?id=1517064