Enforce bounds on MaxQueryTime #9064

pierreca · 2020-10-09T19:56:53Z

The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to
an invalid value that would throw off how RPC requests are retried.

This fix uses the same logic that enforces the MaxQueryTime bounds in the
blockingRPC() call.

The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.

tgross

Hi @pierreca! Thanks for opening this PR!

I'm not totally sure I understand the circumstances of the issue here.

I did a search for callers of that HasTimedOut method and that's only being called in the client RPC (which turns out to be from your original PR here #8921). The default configuration for that rpc hold timeout is much lower than it is for the server: only 5 seconds. Can you explain a bit more about what you're trying to do here?

pierreca · 2020-10-12T18:42:29Z

@tgross

Say you have a long-poll RPC request between a client and a server and MaxQueryTime is not set by the user of the API. the default behavior of the server is to pin it to a default value (300 seconds), but the default behavior of the client is to use it without checking its bounds, and if not set defaults to 0. IMHO this is problematic for 2 reasons:

It is unchecked, unbounded user input that can be used to throw off the retry code.
it is inconsistent, given if MaxQueryTime isn't set at all, then the default reverts to RPCHoldTimeout, which only accounts for 5 seconds of leader election, but is not an appropriate default for a long-poll request. If the server default setting of a long-poll request is a 300 seconds timeout, then the client should match that and right now it doesn't.

This PR fixes both problems and this is the most minimalistic way I have found to do it.

This is part of a larger scenario that i'm trying to fix: a user should be able to have a long-poll request opened with a client survive a server dying and leader election taking place. There's a very easy way to test this with a minimal cluster with 3 or 5 servers, a client, and a long-poll request made to the client for node state for example. Kill the leader and see the request fail with a wrapped RPC error.

Right now, this does not work for multiple reasons (and imho this breaks nomad's HA promise):

the retry maths weren't correct for long-poll requests and timed out too early: RPC Timeout/Retries account for blocking requests #8921
When a server dies and leader election happens, clients don't retry on retryable errors: Use errors.Is() in IsErrEOF() consul#8698
default values are inconsistent (this PR)

tgross · 2020-10-12T18:55:18Z

This is part of a larger scenario that i'm trying to fix: a user should be able to have a long-poll request opened with a client survive a server dying and leader election taking place. There's a very easy way to test this with a minimal cluster with 3 or 5 servers, a client, and a long-poll request made to the client for node state for example. Kill the leader and see the request fail with a wrapped RPC error.

I'm not sure that can be made to work, regardless of timeouts. In that scenario, the request is being forwarded through the leader (as all RPCs are) to the server connected to that client node, and from there forwarded to the client. If the leader is killed, there's no way to maintain that connection; it must be retried by the API client by sending the request again (either to the new leader or to be forwarded to the leader).

pierreca · 2020-10-12T23:21:12Z

the request is being forwarded through the leader (as all RPCs are) to the server connected to that client node, and from there forwarded to the client.

I agree with the path taken by the request. This is my understanding of a typical request path, with an optional follower in the middle:

api user -> client -> [server (follower) ->] server (leader)

and the response follows the same path in the opposite direction.

If the leader is killed, there's no way to maintain that connection;

There is existing code in the client that rotates the server and tries to survive leader election: https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L83

the RPCHoldTimeout documentation seems to support this idea:

nomad/client/config/config.go

Line 222 in c14c616

// RPCHoldTimeout is how long an RPC can be "held" before it is errored.

it must be retried by the API client by sending the request again (either to the new leader or to be forwarded to the leader).

I think in order to uphold the high availability promise of nomad, you have to make it the client responsibility to retry, not the user: the client has all the necessary information AND pre-existing code supporting this and that's what i'm trying to fix :)

tgross · 2020-10-13T13:36:09Z

(posting that whole comment here: config.go#L222-L226)

RPCHoldTimeout is how long an RPC can be "held" before it is errored.
This is used to paper over a loss of leadership by instead holding RPCs,
so that the caller experiences a slow response rather than an error.
This period is meant to be long enough for a leader election to take
place, and a small jitter is applied to avoid a thundering herd.

So by my reading of this, the only thing extending the RPCHoldTimeout gets us is longer timeouts if the leader election takes more than 5s?

pierreca · 2020-10-13T17:16:14Z

@tgross I'm not suggesting to extend RPCHoldTimeout. The goal of this constant if I understand its description is to extend any timeout by some arbitrary duration in order to take into account leader election. I think it is fine as is. Taking a step back:

Any retry attempt on a long-poll request has to deal with 3 variables:

the maximum duration of the request as specified by MaxQueryTime
some jitter to avoid thundering herd problems
some arbitrary time to allow for leader election in case the request is retried because leader was lost (that is RPCHoldTimeout)

what I'm trying to make consistent is 1.

On the server side (nomad/rpc.go), the MaxQueryTime value passed by the user is verified to make sure it is bound between 0 and the maximum allowed and if not set, sets it to the default (sane) value.

On the client side (client/rpc.go), the RPC code doesn't do any of that. It just uses what the user passes without checking it. This is not a sane (nor safe) default. This PR is just a way to align the client behavior with the server behavior.

tgross · 2020-10-15T12:42:15Z

Oh ok, I understand now. Apologies for my density, @pierreca, and thanks for walking me through it! 😀

tgross

LGTM 👍

The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.

github-actions · 2022-12-14T02:19:01Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Enforce bounds on MaxQueryTime

201c872

The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.

tgross requested changes Oct 12, 2020

View reviewed changes

tgross added the stage/waiting-reply label Oct 12, 2020

tgross approved these changes Oct 15, 2020

View reviewed changes

tgross merged commit 2b56475 into hashicorp:master Oct 15, 2020

joel0 mentioned this pull request Oct 17, 2020

RPC Timeout/Retries account for blocking requests hashicorp/consul#8978

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce bounds on MaxQueryTime #9064

Enforce bounds on MaxQueryTime #9064

pierreca commented Oct 9, 2020

tgross left a comment

pierreca commented Oct 12, 2020

tgross commented Oct 12, 2020

pierreca commented Oct 12, 2020 •

edited

Loading

tgross commented Oct 13, 2020

pierreca commented Oct 13, 2020

tgross commented Oct 15, 2020

tgross left a comment

github-actions bot commented Dec 14, 2022

Enforce bounds on MaxQueryTime #9064

Enforce bounds on MaxQueryTime #9064

Conversation

pierreca commented Oct 9, 2020

tgross left a comment

Choose a reason for hiding this comment

pierreca commented Oct 12, 2020

tgross commented Oct 12, 2020

pierreca commented Oct 12, 2020 • edited Loading

tgross commented Oct 13, 2020

pierreca commented Oct 13, 2020

tgross commented Oct 15, 2020

tgross left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 14, 2022

pierreca commented Oct 12, 2020 •

edited

Loading