Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow JetStream Publish retries iff ErrNoResponders was returned. #930

Merged
merged 1 commit into from
Mar 17, 2022

Conversation

derekcollison
Copy link
Member

This is to avoid small blips from leader changes surfacing to the end application.

Signed-off-by: Derek Collison derek@nats.io

@coveralls
Copy link

coveralls commented Mar 16, 2022

Coverage Status

Coverage decreased (-0.2%) to 85.097% when pulling 2a5ee5f on js_pub_retry_2 into c92df80 on main.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test had 2 failures out of the 4 runs, so may want to see if that can be improved.

test/js_test.go Outdated Show resolved Hide resolved
test/js_test.go Show resolved Hide resolved
js.go Outdated
// To protect against small blips in leadership changes etc, if we get a no responders here retry.
time.Sleep(o.rwait)
if o.ttl > 0 {
resp, err = js.nc.RequestMsg(m, time.Duration(o.ttl))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the call below uses the context that is created at the beginning, so it is likely that the deadline is the original one, whereas here you make each request with the original timeout, so maybe that should be reduced by the time already spent? If not, there will be some difference in behavior between context/timeout.

Copy link
Member

@wallyqs wallyqs Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be fine, I like how the context controls the max duration of the retries logic. For example:

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
_, err := js.Publish(subj, msg, nats.Context(ctx), nats.RetryWait(500*time.Millisecond), nats.RetryAttempts(30))

Above would mean to attempt to publish for 10 seconds max, although the retries would potentially take 15s but the context still sets the deadline. Also means that possible to have infinite retries, so would retry as needed until the context expires (edit: actually not possible to do infinite retries with current logic so made a suggestion)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do a rough estimation here subtracting out o.rwait and bailing if local ttl in the loop goes negative. Will post separate CL for ease of review but can squash before merging.

js.go Outdated Show resolved Hide resolved
js.go Outdated
@@ -422,11 +433,23 @@ func (js *js) PublishMsg(m *Msg, opts ...PubOpt) (*PubAck, error) {
}

if err != nil {
if err == ErrNoResponders {
err = ErrNoStreamResponse
for r := 0; err == ErrNoResponders && r < o.rnum; r++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for r := 0; err == ErrNoResponders && r < o.rnum; r++ {
for r := 0; err == ErrNoResponders && (r < o.rnum || o.rnum < 0); r++ {

That change would make it possible to allow infinite retries, in order to let retries happen as needed while context is still active:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_, err := js.Publish(subj, msg, nats.Context(ctx), nats.RetryWait(500*time.Millisecond), nats.RetryAttempts(-1))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make sure a ctx is present to do that one?
I will make the change but we could get into an infinite loop if the stream is truly gone and their is no context provided to the Publish call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made below.

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This is to avoid small blips from leader changes surfacing to the end application.

Signed-off-by: Derek Collison <derek@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants