Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http probe - fix how the tls phase is calculated #758

Merged
merged 2 commits into from
Mar 18, 2021

Conversation

kgersen
Copy link
Contributor

@kgersen kgersen commented Mar 14, 2021

I have issue with some tls phase with the http probe. It appears it is caused by the way it's calculated (trace.gotConn - trace.dnsDone ) which isn't always reliable.
This proposed fix uses TLSHandshakeStart and TLSHandshakeDone hooks.

Signed-off-by: kgersen <kgersen@hotmail.com>
@@ -411,6 +424,8 @@ func ProbeHTTP(ctx context.Context, target string, module config.Module, registr
ConnectDone: tt.ConnectDone,
GotConn: tt.GotConn,
GotFirstResponseByte: tt.GotFirstResponseByte,
TLSHandshakeStart: tt.TLSHandshakeStart,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got introduced in Go 1.8, which is fine, as we are already requiring at least Go 1.13

durationGaugeVec.WithLabelValues("tls").Add(trace.gotConn.Sub(trace.dnsDone).Seconds())
} else {
durationGaugeVec.WithLabelValues("connect").Add(trace.gotConn.Sub(trace.dnsDone).Seconds())
durationGaugeVec.WithLabelValues("tls").Add(trace.tlsDone.Sub(trace.tlsStart).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is subtly changing the meaning of the tls phase, but I think it's OK since this reflects more accurately the duration of the TLS handshake. The existing method is measuring the handshake plus whatever time it takes to finish setting the connection up plus whatever time it takes to get to the point where the TLS handshake starts after DNS is done.

@kgersen
Copy link
Contributor Author

kgersen commented Mar 16, 2021

with this fix I ran in a new issue: the global "probe_duration_seconds" metric doesn't match the sum of the phases anymore like before.
The tls phase duration is now counted approximately twice since it's also part of 'connect' phase.
A fix would be to subtract "tls" from "connect" or this will introduces a breaking change (breaking stacked graphs for instance).

edit: I just reverted to how 'connect' was calculated before so to not introduce a breaking change.

so to clarify this PR:
in case of tls, 'connect' is unchanged, it's still connectDone - dnsDone
and 'tls' is changed it's now tlsDone - tlsStart
in case of no tls, 'connect' is unchanged, it's still gotConn - dnsDone

Signed-off-by: kgersen <kgersen@hotmail.com>
@mem mem merged commit 7f5deb2 into prometheus:master Mar 18, 2021
@dswarbrick
Copy link
Contributor

Out of curiosity, is this somehow related to #485 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants