Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[suggestion] Omitting metrics if not success #579

Closed
zerkms opened this issue Mar 5, 2020 · 8 comments · Fixed by #865
Closed

[suggestion] Omitting metrics if not success #579

zerkms opened this issue Mar 5, 2020 · 8 comments · Fixed by #865

Comments

@zerkms
Copy link

zerkms commented Mar 5, 2020

At the moment if the http probe fails probe_success=0 - the exporter still exports

probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0

Which is a lie - none of those phases completed in a blazing fast 0 seconds.

Why this is a problem: if you perform any analytics over those metrics (eg percentiles, or min) it would be skewed.

Suggestion: what if the exporter does not export metrics it could not really scrape instead of exporting zero values?

@brian-brazil
Copy link
Contributor

That a probe failed doesn't mean that the HTTP request failed, the status code could have been wrong for example.

@zerkms
Copy link
Author

zerkms commented Mar 6, 2020

@brian-brazil that's right, but imagine the following query:

min_over_time(probe_http_duration_seconds{phase="connect"}[1h]) which semantically means: "the quickest tcp connect over period", practically it would return 0 though, which is wrong.

My core point here is - if a particular metric was not actually scraped (for whatever reason: be it a transport error, or http status code error, or any other error) - the complete metric must be removed from the response, not the zero value returned as it's now.

@brian-brazil
Copy link
Contributor

The TCP connection time doesn't become invalid because a subsequent HTTP request failed, this isn't sane in general.

@zerkms
Copy link
Author

zerkms commented Mar 7, 2020

@bbigras

The TCP connection time doesn't become invalid because a subsequent HTTP request failed, this isn't sane in general.

That's NOT what I said.

I probably am terrible at explaining, but I will try once again.

If a particular metric was not actually scraped for whatever reason - it must not be exported, at all, since 0 is a lie.

Returning to my initial example:

probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0

Here are the metrics exported for a target with not-existing DNS address. Do you think those zeroes make any sense?

@SuperQ
Copy link
Member

SuperQ commented Mar 7, 2020

@brian-brazil I think the point is that exposing TCP connection duration isn't valid if we fail early, for example at URL parsing or DNS lookup.

Currently we init all the label values far too early in the setup of the probe. This results in invalid (0) results for data when we haven't even reached the connect phase of the probe.

SuperQ added a commit that referenced this issue Mar 7, 2020
Don't init http probe phase labels for phases we haven't observed.
Avoids exposing 0 value samples for things we haven't measured.

Fixes: #579

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Mar 9, 2021
Don't init http probe phase labels for phases we haven't observed.
Avoids exposing 0 value samples for things we haven't measured.

Fixes: #579

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Jan 22, 2022
Don't init http probe phase labels until we're ready to send the
request.

Avoids exposing 0 value samples for things we haven't started to
measure.

Fixes: #579

Signed-off-by: SuperQ <superq@gmail.com>
@houstonheat
Copy link

houstonheat commented Jan 25, 2022

Hi, faced the same confusing situation:
https://groups.google.com/u/1/g/prometheus-users/c/NbyDxdk05K4

The metrics return the value 0, but phases, in principle, did not start processing at all (except dns lookup). This set us on the wrong track of latency diagnostics followed by metrics dashboards:
probe_duration_decrease
probe_duration
The check time of probe_duration_seconds will tend to the timeout value while the phases metrics will decrease to 0.

It would be much clearer if the phases that did not actually occur for the query were not present in the results but I'm not sure how this compares to the best practices for writing exporters.

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.003334816
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 5.000336144
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 0
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0.003334816
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
...
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0

SuperQ added a commit that referenced this issue Jan 27, 2022
Don't init http probe phase labels until we're ready to send the
request.

Avoids exposing 0 value samples for things we haven't started to
measure.

Fix http resolve timing to report lookup time even if it's an error
(ie not found).

Fixes: #579

Signed-off-by: SuperQ <superq@gmail.com>
roidelapluie pushed a commit that referenced this issue Feb 17, 2022
Don't init http probe phase labels until we're ready to send the
request.

Avoids exposing 0 value samples for things we haven't started to
measure.

Fix http resolve timing to report lookup time even if it's an error
(ie not found).

Fixes: #579

Signed-off-by: SuperQ <superq@gmail.com>
@IMAGNUMI
Copy link

How can i get metric probe_http_duration_seconds "time now" and " 24h previous"?

@zerkms
Copy link
Author

zerkms commented Sep 17, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants