Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

benbuzbee · 2020-11-04T22:02:44Z

Overhaul the RPC retry logic with recent learnings about how it can fail. For tons of detail about all the learnings and care we put into this please see #9265

I realize this probably looks scary, so i am also available for a voice chat, and I hope the details in #9265 help.

Description

The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.

Consider:

A long poll is made with a timeout of 5 minutes
A retriable error occurs at the 4 minute mark
We make a new request to the server

What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. The request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.

What is even worse about this is that if another retriable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.

The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so straight forward.

Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.

One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.

Tested by

The following cases were testing. As a note, we have been using this patch at Cloudflare on thousands of nodes for the past two weeks and have noticed only an improvement

Ran this bash script for 8 minutes. It did eventually fail due to "no cluster leader" but everything retried appropriately

while :; do sudo docker restart nomad-server-$(($RANDOM % 5)) && sleep  $((5 + $RANDOM % 10)); done^C

Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me.
Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries

Tested by 1. Ran this bash script for 8 minutes. It did eventually fail due to "no cluster leader" but everything retried appropriately ``` while :; do sudo docker restart nomad-server-$(($RANDOM % 5)) && sleep $((5 + $RANDOM % 10)); done^C ``` 2. Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me. 3. Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries

tgross

Hi @benbuzbee! This looks great! I've left a question about the reflection approach, but it's possible I'm just missing something there.

tgross · 2020-11-13T15:01:33Z

client/rpc.go

@@ -46,19 +47,53 @@ func (c *Client) StreamingRpcHandler(method string) (structs.StreamingRpcHandler
 	return c.streamingRpcs.GetHandler(method)
 }

+// Given a type that is or eventually points to a concrete type with an embedded QueryOptions


These reflection-based functions are really here because we've removed the HasTimeout from the RPCInfo interface which was implemented by both the QueryOptions and the WriteRequest. It's been replaced only on the QueryOptions side by TimeToBlock. But it looks to me like all cases where we call TimeToBlock, we are testing for a non-0 result.

So couldn't we avoid this by having TimeToBlock on the RPCInfo interface, having WriteRequest implement it, but have WriteRequest always return 0? That seems cheaper to me if we can make it work.

I originally thought about this but for some reason thought it was better to use reflection, maybe it was early in my thinking before it evolved a bit. Either way now I don't see any reason why not so I updated the PR.

benbuzbee · 2020-11-13T22:49:52Z

For posterity, here is a simple go app I used to test the code by running this and killing nomad servers in a cluster

package main

import (
	"fmt"
	"math"
	"os"
	"os/signal"
	"sync/atomic"

	"math/rand"
	"time"

	"github.com/hashicorp/nomad/api"
)

// How many blocking requests to run in parallel
const numBlockingRequests int = 1000

// When asserting timings, how much grace to give in the calculation for
// rpc hold time and jitter
const timingGrace = time.Minute

var successfulUpdates uint64 = 0
var successfulWaits uint64 = 0

func main() {
	cfg := api.DefaultConfig()
	c, err := api.NewClient(cfg)
	if err != nil {
		panic(err)
	}

	startTime := time.Now()
	printSummary := func() {
		fmt.Printf("We lasted for %s\n", time.Now().Sub(startTime))
		fmt.Printf("We successfully got updated indexes %d times\n", atomic.LoadUint64(&successfulUpdates))
		fmt.Printf("We successfully got waited max block duration %d times\n", atomic.LoadUint64(&successfulWaits))
	}
	for i := 0; i < numBlockingRequests; i++ {
		go func() {
			// In case panic
			defer printSummary()
			BlockRandomLoop(c)
		}()
	}

	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, os.Interrupt, os.Kill)
	<-sigChan
	printSummary()
}

// BlockRandomLoop creates a blocking long-poll to a Nomad endpoint and asserts there are no errors
func BlockRandomLoop(c *api.Client) {
	time.Sleep(time.Duration(rand.Int63()) % (time.Second * 1))
	i := uint64(1)
	for {
		// Random time between 0 and 5 minutes
		blockTime := (time.Duration(rand.Int63()) % (time.Minute * 5))

		startTime := time.Now()
		_, meta, err := c.Jobs().List(&api.QueryOptions{
			WaitIndex: i,
			WaitTime:  blockTime,
		})
		elapsedDuration := time.Now().Sub(startTime)

		if err != nil {
			panic(err)
		}

		if meta.LastIndex != i {
			// Unblocked because stuff changed, should be < requested blocked time
			if elapsedDuration > (blockTime + timingGrace) {
				panic(fmt.Errorf("%s > %s", elapsedDuration, blockTime))
			}
			i = meta.LastIndex
			atomic.AddUint64(&successfulUpdates, 1)
			continue
		}

		// Should have blocked for "exactly" WaitTime
		e := time.Duration(math.Abs(float64(elapsedDuration - blockTime)))
		if e > timingGrace {
			panic(fmt.Errorf("block error too high between expected '%s' and actual '%s'", blockTime, elapsedDuration))
		}
		atomic.AddUint64(&successfulWaits, 1)
	}
}

tgross

LGTM 👍

tgross · 2020-11-25T13:47:51Z

Sorry for the delay on merging this @benbuzbee. I want to get one more Nomad engineer's eyes on this, but we're heading into the holiday weekend here in the US. I'm going to make sure this lands in Nomad 1.0.0, so it'll get merged sometime next week I'm sure.

tgross · 2020-11-30T20:14:01Z

This will ship in 1.0.0. The changelog already includes the original #8921 so nothing to add there. Thanks again @benbuzbee!

github-actions · 2022-12-08T02:17:50Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview November 4, 2020 22:02 View deployment

benbuzbee mentioned this pull request Nov 4, 2020

RPC retry logic in nomad client's rpc.go is incorrect #9265

Closed

tgross self-requested a review November 13, 2020 14:55

tgross reviewed Nov 13, 2020

View reviewed changes

Refactor retry logic to avoid reflection by using RPCInfo

d708251

vercel bot temporarily deployed to Preview November 13, 2020 22:48 Inactive

tgross approved these changes Nov 16, 2020

View reviewed changes

tgross added this to the 1.0 milestone Nov 25, 2020

tgross merged commit 6a6547b into hashicorp:master Nov 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

benbuzbee commented Nov 4, 2020

tgross left a comment

tgross Nov 13, 2020

benbuzbee Nov 13, 2020

benbuzbee commented Nov 13, 2020

tgross left a comment

tgross commented Nov 25, 2020

tgross commented Nov 30, 2020

github-actions bot commented Dec 8, 2022

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

Conversation

benbuzbee commented Nov 4, 2020

Description

Tested by

tgross left a comment

Choose a reason for hiding this comment

tgross Nov 13, 2020

Choose a reason for hiding this comment

benbuzbee Nov 13, 2020

Choose a reason for hiding this comment

benbuzbee commented Nov 13, 2020

tgross left a comment

Choose a reason for hiding this comment

tgross commented Nov 25, 2020

tgross commented Nov 30, 2020

github-actions bot commented Dec 8, 2022