use exec probe for queue proxy readiness check #4148

joshrider · 2019-05-22T20:46:21Z

Proposed Changes

Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.

Also worth noting that this makes us dependent on how ko builds images. If the binary ever stops showing up at /ko-app/queue, this will break.

Release Note

NONE

knative-prow-robot

@joshrider: 0 warnings.

In response to this:

Fixes #3308

Proposed Changes

Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.

Release Note
NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

markusthoemmes · 2019-05-23T01:05:41Z

/assign

markusthoemmes

Thanks for doing this! 🎉 🎉

A few nits throughout.

cmd/queue/main.go

markusthoemmes

A few more in-depth comments on the retry logics at play.

We currently retry in at least two places:

In the newly added logic to retry on /health
In /health itself until the user-container is up

The timeouts of both need to be aligned for each respective mechanism to work correctly. With the proposed changes though it feels like we could maybe strip the retries from the /health route itself and send a single tcp probe per call to /health (if not already healthy). That way the only retry would be in the proposed logic.

Thoughts?

markusthoemmes · 2019-05-28T07:41:37Z

cmd/queue/main.go

+
+	var lastErr error
+
+	timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) {


Should we give this some more leeway? As it stands, this will fail if the queue-proxy doesn't actually come up for a second. It feels to me as if we could have a much tighter retry interval because the actual request will again block as soon as the queue-proxy comes up until the user-container was successfully probed.

As such, each retry is pretty cheap and we can likely this to 50ms or even lower. However, the upper-bound of this can I think be bumped towards 5 or even 10 seconds to make sure we practically never retry on kube's periodSeconds interval.

+1 - if we long poll here we can make this initial retry loop very tight.

Can we also pull this loop time out in to a constant?

Turned this up to a 10 second timeout with a 25 millisecond interval.

We should totally take advantage of the full 10 second timeout on the ExecProbe. That was an oversight in the initial implementation.

The 25ms retry interval is a totally made up number. 🤷‍♂️

25ms (40 times/sec) seems like it might be too aggressive. Help me convince myself that's not true.

Up to the provided frequency, we will:

Open a TCP socket

Send an HTTP request

Get an HTTP response

Check that the response code is 200

But, wait.PollImmediate will ignore timer ticks while the ConditionFunc is being evaluated, so if it takes (for example) 33ms to do 1-4, you will end up polling every 50ms. Additionally, this will only happen until we get a 200, so in the normal case, we don't care about the retry window.

The worst case for this would be something that actually spends a bunch of time working to get a good health check answer, and these requests end up clogging up the CPU, while blocking actual initialization from taking place (which would end the health check pounding).

Might be worth documenting next to the 25ms that the number was made up as a balance between "don't make the user wait" and "don't pummel the container too hard".

Added a comment explaining the "best guess" nature of the retry interval. Seems like something to keep an eye on and adjust in future.

Adding some detail for the sake of clarity:

In the queue-proxy we are probing here once a connection is able to be established we should hold it open for up to 10s. This means there are two cases for work we will do here that are a bit different than mentioned:

until queue-proxy is running we will:

open a tcp socket

get a RST (since the socket is not yet listening) and retry

This is the loop we want to optimize to retry quickly as it is very lightweight. Once queue proxy is up we do essentially as you describe (send a http req, get response), but waiting on the response will block up to 10s until the user container is ready so our poll frequency here will be mostly irrelevent.

markusthoemmes · 2019-05-28T07:42:53Z

cmd/queue/main.go

@@ -26,6 +26,8 @@ import (
 	"os"
 	"time"

+	"github.com/pkg/errors"


Do you mind collapsing the imports here? (we should have a bot for this... @mattmoor another socketpuppet?)

markusthoemmes · 2019-05-28T07:46:25Z

cmd/queue/main.go

+			// Do not use the cached connection
+			DisableKeepAlives: true,
+		},
+		Timeout: 500 * time.Millisecond,


Is this an end-to-end request timeout? If so, I think this is too short and doesn't make use of the queue-proxies mechanism to keep the connection on /health open for 10 seconds currently.

It feels like we might want these retries in only one place. Maybe we should strip them out of the prober itself and only have this health binary retry tightly?

practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.

Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?

Put this up to 10 seconds. Makes sense to use the full 10 seconds on /health.

I don't understand all of the moving parts, but it seems like we may want to keep the retry logic in the prober. The function with the prober (probeUserContainer) is also used by a handler on the queueServingPort when a request has a K-Network-Probe header. Do we care about retries there?

greghaynes

Thanks a ton for doing this!

cmd/queue/main.go

greghaynes · 2019-05-29T15:55:26Z

cmd/queue/main.go

+			// Do not use the cached connection
+			DisableKeepAlives: true,
+		},
+		Timeout: 500 * time.Millisecond,


practically I don't think it matters much either way so feel free to ignore, but my preference would be to long poll here as well. Better response, less network/cpu spamming, etc.

Regardless, can we pull this timeout (or retry interval if we go that route) out in to a constant?

greghaynes · 2019-05-29T16:00:41Z

cmd/queue/main.go

+
+	var lastErr error
+
+	timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) {


+1 - if we long poll here we can make this initial retry loop very tight.

Can we also pull this loop time out in to a constant?

markusthoemmes

One more comment from my end.

Have we verified via @greghaynes's tracing that this actually has an effect and is worth the added complexity/is behaving as intended?

markusthoemmes · 2019-05-31T07:10:18Z

cmd/queue/main.go

@@ -61,6 +60,10 @@ const (
 	// in the mesh.
 	quitSleepDuration = 20 * time.Second

+	// Set equal to the queue-proxy's ExecProbe timeout to take
+	// advantage of the full window
+	queueProbeTimeout = 10 * time.Second


Please use this in the actual probeUserContainer as well to make sure all of these values are properly aligned.

I wasn't sure whether it was suitable to use it there since probeUserContainer was also used in another handler and I didn't feel confident totally tying the concepts together.

How is the K-Network-Probe header used by the system with respect to the queue-proxy?

If we don't reuse the same probe in the /health check and the K-Network-Probe, we can leave the network probe's functionality as-is and then potentially remove the retry/timeout logic from the /health "server-side" TCP probe. Having all of the readinessProbe retry interval/timeout logic live in a single place on the -probe HTTP "client" side of things would make it easier to read/reason about. There would be a little bit of duplicated code and each retried TCP probe would then also have the additional overhead of an HTTP roundtrip. Thoughts?

If not, should I just make a duplicate probeUserContainer and use the constant there?

And I'd definitely appreciate someone else booting it up and taking it for a spin. 😁

Not sure I'm following. The inner TCP loop currently has exactly this 10s timeout as well and those timeouts are meant to align, I think?

Totally agree that they should align and I guess that's a great reason for me to stop complaining and make the change. My reluctance came from the fact that probeUserContainer is also used by handler for what appears to be another purpose, though I guess that can be worried about when it becomes an issue. I will make the swap now. 👍

@markusthoemmes @greghaynes
Is the above-mentioned tracing tool available somewhere that I can run myself?

@joshrider #3491 will give you the most level of detail if you're able to rebase on it. Otherwise using just master if you enable tracing in config/config-tracing.yaml you should see a more consistent / smaller probe span in zipkin

Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active.

…ath()

… imports

joshrider · 2019-06-17T18:13:44Z

/assign @tcnghia

greghaynes · 2019-06-18T17:31:21Z

/lgtm

tcnghia · 2019-06-18T17:51:11Z

/approve
/lgtm

knative-prow-robot · 2019-06-18T17:51:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joshrider, tcnghia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/queue/OWNERS~~ [tcnghia]
~~pkg/reconciler/OWNERS~~ [tcnghia]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tcnghia · 2019-06-18T18:36:45Z

/test pull-knative-serving-integration-tests

mattmoor · 2019-06-18T23:22:12Z

pkg/reconciler/revision/resources/queue.go

-				Port: intstr.FromInt(networking.QueueAdminPort),
-				Path: queue.RequestQueueHealthPath,
+			Exec: &corev1.ExecAction{
+				Command: []string{"/ko-app/queue", "-probe", "true"},


Can you confirm that it's our ultimate goal to:

Pass the user HTTP/TCP probe through for the queue to evaluate

Default the probe to a TCP probe

This seems like a reasonable first step, but I wanted to confirm we have a shared understanding of the end state.

That is the plan for #4014, but I don't know that this is necessary for that.

The hope was for this to address #3308 (with the possibility of maybe using it in #4014).

mattmoor · 2019-06-18T23:23:52Z

cmd/queue/main.go

@@ -64,6 +63,10 @@ const (
 	// in the mesh.
 	quitSleepDuration = 20 * time.Second

+	// Set equal to the queue-proxy's ExecProbe timeout to take
+	// advantage of the full window
+	probeTimeout = 10 * time.Second


Should we make this a flag so it can be kept consistent with TimeoutSeconds in the probe configuration? Maybe we shouldn't do this until we pass the entire probe descriptor through?

Yeah, the idea was that this was separate from the issue of passing in the full probe to have something user-defined applied via the queue-proxy. There are a bunch of things in this file that will change once we start passing all of the configuration in.

greghaynes · 2019-06-19T00:05:07Z

I just fixed my local setup to where im getting better tracing results now and I was wrong earlier to say this looks good. Heres a picture of what I am seeing with this patch + pod-tracing patch (using https://github.com/joshrider/serving/tree/execprobe-trace)

Note the container_startup spans, the bottom one is the user-application which goes ready as soon as it runs (due to no readiness probe) and the top one is the queue proxy. Theres about .6s delay between this (it really shouldn't take that long for queue-proxy to startup).

Heres one from master where I got lucky and Kubelet didnt send the networking probe until after queue-proxy had the socket open. Note how quickly after the user app goes ready the queue-proxy does as well. I'd think this is what a fix should look like:

I don't grok why this is happening though, maybe there's some other delay occurring before the exec probe starts running?

tcnghia · 2019-06-19T00:35:41Z

@greghaynes our queue-proxy also performs a TCP probe on user-container, so may be the 0.6 seconds is the time for user-container web server to finish starting up (and listening)?

greghaynes · 2019-06-19T00:53:23Z

@tcnghia It could be coming from between queue-proxy and user-container, but I am surprised it would take nearly that long - in this case im testing the helloworld-go sample app. The second set of traces I think is a good baseline for what startup cost should be (~100ms)

markusthoemmes · 2019-06-19T06:06:06Z

/hold

Until we figured out this is behaving as designed. Feel free to remove the hold if you are all fine with this and I misinterpreted the last comments.

joshrider · 2019-06-19T15:46:10Z

@greghaynes my tests with the pod-trace branch also looked a lot like your top graph. I'm going to play around a bit more with the tracing (and whatever other debug strategies I can cook up) to try to get a better idea of what is actually going on inside.

@markusthoemmes good call 👍

mattmoor · 2019-06-20T13:32:43Z

tl;dr Given the sad state of our CI, I agree with holding on this (and probably a lot of the rest of 0.7 items) until 0.8.

A few things come to mind:

600ms is faster than the 1s floor we get through kubelet (Something something bird in the hand),
I'd like to start baking this early in 0.8,
This is just the first piece of a bigger change to how we probe (I'd like to get all of it baking),
Getting it in enables us to parallelize better.
(bonus!) The traces above continue to be SUPER useful, if there are things we can do to incorporate them into our testing infrastructure, then shut up and take my money. :)

@greghaynes @markusthoemmes @tcnghia WDYT?

mattmoor · 2019-06-21T03:32:25Z

/retest

markusthoemmes · 2019-06-21T07:05:16Z

@mattmoor I agree, if it works and doesn't cause a regression, ship it!

mattmoor · 2019-06-21T18:49:54Z

Ok, I plan to /hold cancel after the cut Tuesday.

mattmoor · 2019-06-25T13:51:52Z

/hold cancel

* use exec probe for queue proxy readiness check Replaces HTTPGet ReadinessProbe with ExecAction. The ExecAction recreates the HTTPGet semantics within the queue proxy container and hits the same previous health check HTTP path. This approach was taken since the queue proxy process maintains state on whether it's active. * use descriptive name for error, pass port as int to probeQueueHealthPath() * pull probe timeout into constant, tighten up retry interval, clean up imports * add comment explaining queue probe retry interval * remove out of date comment * rename queueProbeTimeout, use constant in probeUserContainer

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 22, 2019

knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 22, 2019

knative-prow-robot reviewed May 22, 2019

View reviewed changes

knative-prow-robot requested review from dgerd and mdemirhan May 22, 2019 20:46

knative-prow-robot added area/API API objects and controllers area/networking labels May 22, 2019

knative-prow-robot assigned markusthoemmes May 23, 2019

markusthoemmes reviewed May 27, 2019

View reviewed changes

cmd/queue/main.go Show resolved Hide resolved

cmd/queue/main.go Outdated Show resolved Hide resolved

cmd/queue/main.go Outdated Show resolved Hide resolved

cmd/queue/main.go Outdated Show resolved Hide resolved

markusthoemmes suggested changes May 28, 2019

View reviewed changes

greghaynes reviewed May 29, 2019

View reviewed changes

markusthoemmes suggested changes May 31, 2019

View reviewed changes

joshrider force-pushed the execprobe branch from bf7226f to 18b311d Compare June 5, 2019 15:47

joshrider added 6 commits June 13, 2019 10:51

use descriptive name for error, pass port as int to probeQueueHealthP…

fd34450

…ath()

pull probe timeout into constant, tighten up retry interval, clean up…

bf6cd5e

… imports

add comment explaining queue probe retry interval

fdeb347

remove out of date comment

1828a41

rename queueProbeTimeout, use constant in probeUserContainer

75df793

joshrider force-pushed the execprobe branch from 18b311d to 75df793 Compare June 13, 2019 17:40

knative-prow-robot assigned tcnghia Jun 17, 2019

knative-prow-robot assigned greghaynes Jun 18, 2019

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2019

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2019

mattmoor reviewed Jun 18, 2019

View reviewed changes

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2019

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2019

knative-prow-robot merged commit 91c4e8e into knative:master Jun 25, 2019

joshrider deleted the execprobe branch June 25, 2019 15:36

tcnghia mentioned this pull request Sep 14, 2019

Expose QueueAdminPort to private Service of the Revision #5540

Merged

vagababov mentioned this pull request Feb 11, 2020

Queue Proxy Probe doesn't include an informative error message #6808

Closed


		var lastErr error

		timeoutErr := wait.PollImmediate(100*time.Millisecond, time.Second, func() (bool, error) {

use exec probe for queue proxy readiness check #4148

use exec probe for queue proxy readiness check #4148

Conversation

joshrider commented May 22, 2019 • edited Loading

Proposed Changes

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

markusthoemmes commented May 23, 2019

markusthoemmes left a comment

Choose a reason for hiding this comment

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghaynes May 30, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghaynes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshrider commented Jun 17, 2019

greghaynes commented Jun 18, 2019

tcnghia commented Jun 18, 2019

knative-prow-robot commented Jun 18, 2019

tcnghia commented Jun 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greghaynes commented Jun 19, 2019 • edited Loading

tcnghia commented Jun 19, 2019

greghaynes commented Jun 19, 2019

markusthoemmes commented Jun 19, 2019

joshrider commented Jun 19, 2019

mattmoor commented Jun 20, 2019

mattmoor commented Jun 21, 2019

markusthoemmes commented Jun 21, 2019

mattmoor commented Jun 21, 2019

mattmoor commented Jun 25, 2019

joshrider commented May 22, 2019 •

edited

Loading

greghaynes May 30, 2019 •

edited

Loading

greghaynes commented Jun 19, 2019 •

edited

Loading