-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow JetStream Publish retries iff ErrNoResponders was returned. #930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test had 2 failures out of the 4 runs, so may want to see if that can be improved.
26b58d2
to
15d1a42
Compare
js.go
Outdated
// To protect against small blips in leadership changes etc, if we get a no responders here retry. | ||
time.Sleep(o.rwait) | ||
if o.ttl > 0 { | ||
resp, err = js.nc.RequestMsg(m, time.Duration(o.ttl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the call below uses the context that is created at the beginning, so it is likely that the deadline is the original one, whereas here you make each request with the original timeout, so maybe that should be reduced by the time already spent? If not, there will be some difference in behavior between context/timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be fine, I like how the context controls the max duration of the retries logic. For example:
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
_, err := js.Publish(subj, msg, nats.Context(ctx), nats.RetryWait(500*time.Millisecond), nats.RetryAttempts(30))
Above would mean to attempt to publish for 10 seconds max, although the retries would potentially take 15s but the context still sets the deadline. Also means that possible to have infinite retries, so would retry as needed until the context expires (edit: actually not possible to do infinite retries with current logic so made a suggestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do a rough estimation here subtracting out o.rwait and bailing if local ttl in the loop goes negative. Will post separate CL for ease of review but can squash before merging.
js.go
Outdated
@@ -422,11 +433,23 @@ func (js *js) PublishMsg(m *Msg, opts ...PubOpt) (*PubAck, error) { | |||
} | |||
|
|||
if err != nil { | |||
if err == ErrNoResponders { | |||
err = ErrNoStreamResponse | |||
for r := 0; err == ErrNoResponders && r < o.rnum; r++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for r := 0; err == ErrNoResponders && r < o.rnum; r++ { | |
for r := 0; err == ErrNoResponders && (r < o.rnum || o.rnum < 0); r++ { |
That change would make it possible to allow infinite retries, in order to let retries happen as needed while context is still active:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_, err := js.Publish(subj, msg, nats.Context(ctx), nats.RetryWait(500*time.Millisecond), nats.RetryAttempts(-1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make sure a ctx is present to do that one?
I will make the change but we could get into an infinite loop if the stream is truly gone and their is no context provided to the Publish call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made below.
bf7a7ad
to
b44711d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
b44711d
to
7ccdf20
Compare
This is to avoid small blips from leader changes surfacing to the end application. Signed-off-by: Derek Collison <derek@nats.io>
7ccdf20
to
2a5ee5f
Compare
This is to avoid small blips from leader changes surfacing to the end application.
Signed-off-by: Derek Collison derek@nats.io