-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] Added jitter in the reconnect logic #564
Conversation
I added a random delay capped to the reconnect wait. Also made sure that we kick out the doReconnect go routine (was previously possibly waiting for reconnectWait even when connection was closed. Should we have a minimum wait? For instance use random but for [reconnectWait/2:reconnectWait]? |
Some tests will now sometimes fail because they expect the connection to be reconnecting based on the ReconnectWait duration. Waiting for decision on wait time before fixing those (for instance: TestReconnectAllowedFlags |
go.mod
Outdated
github.com/nats-io/jwt v0.3.2 | ||
github.com/nats-io/nats-server/v2 v2.1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that go mod tooling added this, but should we continue to remove it to avoid depending on server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg.. it's my local update of staticcheck that did that likely and I forgot to remove. Will update.
go.mod
Outdated
) | ||
|
||
go 1.13 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not completely sure of the impact of adding this one yet too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will reset the go.mod/sum changes, but will be easier for me to do a git push force, so be aware.
nats.go
Outdated
@@ -1844,7 +1848,11 @@ func (nc *Conn) doReconnect(err error) { | |||
if sleepTime <= 0 { | |||
runtime.Gosched() | |||
} else { | |||
time.Sleep(time.Duration(sleepTime)) | |||
rt.Reset(time.Duration(rand.Int63n(sleepTime))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconnect Wait time should still hold same semantics. Do not attempt reconnect for some period of time. What I think we need to add is a Jitter option which has a sane default, I might say 2 secs or so that is added onto reconnectWait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So random between [0,2sec] added to the base, ok got it. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argg.. quite a bit of tests seem to expect no more than reconnectWait, so we may get some flappers pop up, will have to fix those along the way... unless you think that added jitter should be an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be an option, but should be low as default. For tests we can set to 0 which should do old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect that some of those tests are in /test package so I cannot play that trick without exposing an option or function to set that value to zero..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jitter was always meant to be a public option, just wanted a sane default. For the case we were dealing with today they would probably set it to 5-10 secs.. I think that is alot for default cases where you have live backup servers and want quick reconnect. Maybe 1 second is plenty in that case, so [0,1] is default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so we do make it an Option in that it is settable. Still, it means that all tests in /test that relied on no more than reconnectWait will have to be altered, but that's ok. Just that I may not find them all at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they will need to set Jitter to 0.. ;)
In my ideal world I want this to be an exponential back off. Or at least the ability to pass in a delay function to achieve that. Ie. Jitter by default but make it possible to do expo back offs via user supplied sleeper. Pass the sleeper a reconnect counter. At large site I had nats using all cpu for several minutes on restarts due to combos of slow TLS handshakes, timeouts and synced reconnects. |
eaa6848
to
b0b0bd3
Compare
@ripienaar Yes we were discussing about the serves and clients interacting. This is mostly to avoid thundering herd but if org controls both ends they can configure there clients to flatten the curve of reconnect rates, especially for high-demanding TLS handshakes, like 4K RSA keys ;) |
Sure, I am saying I want more control than this pr provides. This high cpu and literal outage for minutes was even with the fastest ciphers. |
ok gotcha.. |
@ripienaar @derekcollison I can certainly add a
Question: Should the value returned by the invocation of the user's custom reconnect jitter callback be used as-is (in addition to the reconnect wait), or should the library still use a random value between 0 and that value? |
I think jitter should be handled as we discussed. What sounds like something RI might need is a reconnect failed handler such that he can possibly change reconnect time. What I think he wants is a simple option (or default behavior) that just does exponential backoff. @ripienaar is this more for the client app's benefit or the servers? |
Yeah no URL - this is for me to use in a go client. If I know I am working on a system with >50k clients, I will do the extra work to use write a custom jitter. Using the stateless backoff here https://blog.gopheracademy.com/advent-2014/backoff/ CustomReconnectDelay(func(n int) time.Duration {
return backoff.Default.Duration(n)
}) And that's it, if I specify this you ignore the default you are proposing and people with complex needs can implement their crazy however they like. When you're doing TLS at scale with NATS you have to be VERY careful - that's fine it's normal its not a NATS problem - but if I am at scale this is a key knob needed. |
Yes definitely needed for large scale TLS (not even large TBH, user we are working with was dealing with 1200). Do we have a callback that exists now that says I tried everyone and no one picked up? Feels like this callback would be where we reset the reconnect time (Jitter should probably not be tweaked here IMO). The on connect again we can reset it back to normal, etc. |
Yeah, this is why my example takes a |
IMO, this can be more general than jitter, maybe label it ReconnectAttemptFailedCB. Could be very useful in debugging. Couldn't attempts be tracked via closure? Along with custom, we could provide a few predefined callbacks that users could choose - jitter only, backoff w/ jitter, etc... |
nats.go
Outdated
// It can return a jitter value based on those metrics. | ||
// The library will take a random value between 0 and the returned value | ||
// and add that to the ReconnectWait duration. | ||
// Note that any jitter is capped with ReconnectJitterMax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with this design library users still cant use their own backoff logic, we can only influence the jitter which is not what I was suggesting earlier. It's not nothing and will help, just pointing out that's not what I was suggesting at all.
The desired effect of a backoff is that the minimal reconnect interval increases over time + jitter. This implimentation means still that minimal reconnect time is ~2seconds which is not good
Here's 30 reconnect attempts by grpc, note its capped at 2 minutes and then jitters around that, but the minimum reconnect time approaches max over time not min+jitter:
0: 1s
1: 1.599317579s
2: 2.56393758s
3: 3.990555513s
4: 5.962072093s
5: 10.763383515s
6: 18.872196205s
7: 29.559588104s
8: 37.147186587s
9: 1m10.383560759s
10: 1m37.91169291s
11: 2m18.715950662s
12: 1m45.303404133s
13: 1m59.446929074s
14: 1m51.271171221s
15: 2m13.320486513s
16: 1m53.109727095s
17: 2m22.762489936s
18: 1m54.080148652s
19: 2m1.71590643s
20: 2m1.300598788s
21: 1m49.431812662s
22: 2m4.949590579s
23: 1m46.34926616s
24: 1m42.787212095s
25: 2m6.799644605s
26: 1m48.488322855s
27: 2m13.048933797s
28: 2m14.636264763s
29: 1m48.410590953s
30: 1m43.727174768s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want to control both then yes? The most important is to spread out jitter. What the min wait is stops the attempts from happening quickly but changing that and not jitter (unless you already have randomness in how folks got disconnected) will not help smooth reconnect. So for the grpc example above what are the achieving besides the client just not trying as often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a true backoff:
- not as often
- over time less often
- capped at a max duration
- with jitter always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point I was trying to make was that jitter is more important then the delay. I understand we could have both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In example you provided above can you influence jitter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fair, use some library for wait and jitter combined. And just make sane defaults for most scenarios.
In your case is your algorithm TLS aware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - but your defaults could be of course and would be fine for most.
choria cant be run without tls unless you go and recompile it at which point i dont care :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for that simple function. If set, supersede any ReconnectWait option. So we keep the current jitter options (nonTLS and TLS) for cases where the custom option is not set. Doc for CustomReconnectDelay() will clearly state that the library will sleep for given delay and it is recommended that the function returns a value that include some jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ready to submit updates. Do you guys prefer that I squash first or want to see just the difference (and I will only squash prior to merge)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squash for me.
nats.go
Outdated
// Note that any jitter is capped with ReconnectJitterMax. | ||
ReconnectJitterTLS time.Duration | ||
|
||
// ReconnectJitterMax caps the maximum jitter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth pointing out here that setting to 0 will disable the capping
// user the current number of attempts. This function returns the | ||
// amount of time the library will sleep before attempting to reconnect | ||
// again. It is strongly recommended that this value contains some | ||
// jitter to prevent all connections to attempt reconnecting at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet 👍
go.mod
Outdated
@@ -1,5 +1,7 @@ | |||
module github.com/nats-io/nats.go | |||
|
|||
go 1.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No... but I updated to go 1.14 and that touched this file.. you don't know how mad this thing makes me.
I just discovered that VSCode is now updating go.mod/sum anytime I save because they use go list, etc.. that does change those files. Will revert.
nats.go
Outdated
RequestChanLen = 8 | ||
DefaultDrainTimeout = 30 * time.Second | ||
LangString = "go" | ||
Version = "1.9.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably bump to 1.9.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
nats.go
Outdated
DefaultPort = 4222 | ||
DefaultMaxReconnect = 60 | ||
DefaultReconnectWait = 2 * time.Second | ||
DefaultReconnectJitterNonTLS = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just DefaultReconnectJitter
imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. If no TLS specified, it means non-TLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just going to say that this is uber complex - 2 jitter values, and a function that tells them that it failed when the close should do that. My question is if that is the case, why don't we just have one jitter that is a random value added to reconnect time wait, and also an exponential backoff which starts at reconnect time wait + jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also with the advent of host:port knowing the type of server is more difficult - and in most cases the client doesn't care, as the server will be whatever they have deployed and likely will only be either tls or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more - I think the correct thing would be to have a 'maxReconnectWait' that is analogous to reconnectWait, but that is random between 0-maxReconnectWait. That would give a bigger jitter window without the complexity of the multipliers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.
So why not just say have jitter
as a boolean and that randomizes reconnectTimeWait. The jitter callback can be a dynamic thing - in case of node the property can be a boolean or a function, and it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had long discussions on all of these, base connect and jitter are separate things in almost all other systems. We tried to simplify and had quite a few back and forths.
-Add ReconnectJitter option setter allowing user to set jitter for non and TLS connections (possibly different values) -Default for non-TLS will be 100ms and TLS 1sec -Add a CustomReconnectDelay handler that is passed the number of full url lists attempts. The user callback will then return a duration that is going to be used for the wait. It is the responsibility of the user to send a delay with some jitter if desired. Resolves #563 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@derekcollison @ripienaar Please consider the changes we have done here. Since we now sleep only at the end of the list, it changes the fact that we use to check the lastAttempt from each server we are trying to reconnect to. We have changed that (and I have removed the lastAttempt variable that was no longer used in b2be4bb). Let's make sure this is correct before others port to respective client libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
i = 0 | ||
var st time.Duration | ||
if crd != nil { | ||
wlf++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we count wlf regardless of crd? Noop now but we may use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will move it out of the condition when we actually need it, will be easy to do.
When library would wait for the reconnect wait interval, it will
now wait for a random time bounded by this value.
This prevents thundering herd reconnect issue, especially with
TLS.
Resolves #563
Signed-off-by: Ivan Kozlovic ivan@synadia.com