-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make capture interval a multiple of 30s #187
Make capture interval a multiple of 30s #187
Conversation
@miq-bot add_label bug, metrics @cben @moolitayer @Ladas @agrare please review |
SGTM based on verbal explanation of expected effects, but that doesn't mean much :-) Would love if we had any tests:
|
@cben 👍 about specs:
rspec is updated for new interval, it does not check actual rows written to DB, but it does check the methods that write data to the DB, this spec will fail if a different interval is used.
We do not have integration test framework (?) , you can check this is fixed by running commands in rails console: Before, we will get 40s, 20s intervals, we have missing interval between 1st and 2nd data points:
After fix, we will get 60s intervals consistently:
|
What I had in mind is precisely these gaps in metrics that previously resulted from some 20s intervals not containing any 30s-spaced samples. I think the existing test structure is good, doesn't have to be more "integration", just need more data.
P.S. tests do some stubbing on HawkularLegacyCaptureContext. Is that what tests actually run, or HawkularCaptureContext? Should they test both (or maybe all 3 including prometheus)? |
This is simulated data to show that we calculate correctly delta T of 60s instead of 20s. We do
We need to check this calculation, because and \delta T equals the interval (even if we have no data), so if we have missing values we have \delta X == 0, but \delta T still equals the interval, witch results in wrong diff value, more simulated data will not change the test of the calculation.
Yes, you can run the example code from #187 (comment) on a Hawkular or Prometheus provider, we also have VCR recording of the
No, only HawkularLegacyCaptureContext does the calculation, the new Hawkular and Prometheus code get the diff values pre-calculated, so they are immune to the missing data calculation error. [1] The rspec in the metrics capture sub classes [1] check that we do not have gaps: |
@yaacov can we align the data samples to be always at beginning of the minute? So 12:20:00, 12:21:00, etc. Since there is aligning code that will store them at closest 20s and that might lead to different values, that could theoretically break the rollup. @agrare @Fryguy I haven't found a code that would require for data to be 20s aligned, have I missed anything? (OpenShift provider has been breaking this condition for a while) But there is code relying on the alignment, like when we put multiple metrics into one sample (cpu, mem, net) and when we sum samples into 1 value (network inbound+outbound). So from what I see, alignment to 00s, should serve us the same as 20s alignment. Also, using rates, we should not have a problem with accuracy. But using the old way, I think we need to calculate with 'max' sample, instead of 'avg'. Otherwise the values could be again bigger than 100% for e.g. 1CPU core (as a visible bug in data), since we will have 2 samples in the 60s and a distance between avg of current and avg of previous is not 60s. |
@Ladas sure :-) done example:
|
This is not realy a problem as the avg == min == max when we have one sample :-) |
f64d75c
to
8775f1b
Compare
@@ -51,6 +51,9 @@ def log_severity | |||
} | |||
|
|||
def capture_context(ems, target, start_time, end_time) | |||
# make start_time align with INTERVAL | |||
start_time = (start_time.to_i / INTERVAL.to_i) * INTERVAL.to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this result in the same?
start_time = start_time.beginning_of_minute
looks more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ladas nice !
will fix specs and then use this 👍
@@ -110,7 +110,7 @@ def compute_derivative(counters) | |||
{ | |||
'start' => n['start'], | |||
'end' => n['end'], | |||
'avg' => n['avg'] - prv['avg'] | |||
'max' => n['max'] - prv['max'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is n['start']
the same as prv['end']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not if we are missing a window ... but usually yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so it it safer to use
'start' => prv['end']
?
although, we do not use this information in the calculation. Should we rather use that?
(n['max'] - prv['max']) / (1.kilobyte.to_f * (n['end'] - prv['end']))
here, instead of using @interval
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ladas 👍 nice catch, I saw that a couple of times, didn't fix that immediately, and forgot about that :-)
8775f1b
to
6aafa91
Compare
@ts_values[timestamp]['mem_usage_absolute_average'] = avg_usage | ||
end | ||
end | ||
|
||
def process_net_counters_rate(counters_rate) | ||
@metrics |= ['net_usage_rate_average'] unless counters_rate.empty? | ||
counters_rate.each do |x| | ||
interval = (x['end'] - x['start']) / 1.in_milliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, this is true only if n['start']
the same as prv['end']
, right?
otherwise we would need to move the calculation to below, where we have n
and prv
elements available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ladas 👍 fixed, now x['end'] is defined as next['start'] on line 114.
f593931
to
282e287
Compare
@miq-bot add_label compute/containers, gaprindashvili/yes BZ target 5.9 |
@yaacov Cannot apply the following label because they are not recognized: compute/containers |
'avg' => n['avg'] - prv['avg'] | ||
'start' => prv['start'], | ||
'end' => n['start'], | ||
'max' => n['max'] - prv['max'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, if we do <prv['start'], n['start']>
shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?
Also, it's still 'avg' => n['max'] - prv['max']
, right? Since we are computing avg between 2 samples?
the data look like this, right?
prv_start---------------(prv_end, n_start)---------------n_end
prv_min-----------------(prv_max, n_min)-----------------n_max
or I probably the actual min&max samples are not aligned?
prv_start---------------(prv_end, n_start)---------------n_end
----prv_min--------prv_max------------n_min--------n_max--
Can we actually get a real start time of each sample? Cause if we are using bucket start/end, that would explain why I saw bad values (like more than 100% cpu on 1core machine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, if we do <prv['start'], n['start']> shouldn't we use n['min'] - prv['min'], since samples are on the start timestamp?
Right 👍 will update to ['min'].
probably the actual min&max samples are not aligned?
Correct, but we are talking about 30s samples over 60s window:
a. we have 1.. 2 samples per window => min ~= max ~= avg
b. sample time error max is < 30s
Can we actually get a real time of each sample start time?
Hawkular:
Not without changing too much of the code, we use Hawkular to sort the samples (and it drops realtime info), and if we ask Hawkular to give as raw data, we will have to add the code to do the sorting ourselves.
Prometheus:
No way to know raw data time by design ( max sample time error is 30s :-) )
P.S
This code is obsolete, It is only a fullback in case we can not get the rate already calculated from Hawkular, openshift from v3.5 and above give us the already calculated rate, and this code is not called any more. see #159 .
@Ladas @moolitayer @cben @bazulay Points to note when reviewing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now. The legacy is as precise as it can be now.
@Ladas thanks 👍 cc @cben @moolitayer |
# ^ (real sample time) ^ (real sample time) ^ (real sample time) | ||
# ^ (real sample value) ^ (real sample value) ^ (real sample value) | ||
# we use: | ||
# (T = start of window timestamp, V = min value of winodw samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/winodw/window
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced only once :-)
7cbe3b8
to
fb03951
Compare
@@ -97,20 +99,29 @@ def compute_summation(data) | |||
# Add min, median, max, percentile95th, etc. if needed | |||
{ | |||
'start' => k, | |||
'end' => [sum['end'], n['end']].max, | |||
'avg' => sum['avg'] + n['avg'] | |||
'end' => [sum['end'], n['end']].min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one I'm not sure about — if end
is supposed to be the last timestamp of the summation period, this should remain max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/max/min/g :-)
'end' => [sum['end'], n['end']].max, | ||
'avg' => sum['avg'] + n['avg'] | ||
'end' => [sum['end'], n['end']].min, | ||
'min' => sum['min'] + n['min'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is weird both before and after.
average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r/max/min/g :-) again ...
this is weird both before and after.
place sad imoji here :-)
average over many windows is not sum of sub-window averages (should be divided by count)
minimum over many windows is not sum of sub-window minimums
?
This function just sum the tx + rx
values to give net_total
, no diff here, so summation is ok here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, that's what I missed, I thought it's summing along time axis. 👍
fb03951
to
e56db5d
Compare
Checked commit yaacov@e56db5d with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Make capture interval a multiple of 30s (cherry picked from commit 9be0e66) https://bugzilla.redhat.com/show_bug.cgi?id=1527094
Gaprindashvili backport details:
|
Description
TL;DR This PR change query rate from Hawkular/Prometheus to 60s.
Metrics are scraped from k8s/openshift once every 30s into Hawkular/Prometheus.
Currently we query data from Hawkular/Prometheus every 20s, this create a response with missing data.
After discussion we decided to query from Hawkular/Prometheus every 60s, because it is a multiple of base rate (e.g. 30s) and allow for missing heartbeat once in a while.
Removed from #159
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1524183