Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] added client retry for jetstream async publish old API #1695

Merged
merged 6 commits into from
Aug 14, 2024

Conversation

pranavmehta94
Copy link
Contributor

This PR contains following:

  1. Added retry logic for jestream async publish using old API
  2. retries are controlled by retry wait and retry attempts publish options
  3. no responder error is retried
  4. pub ack future object is re-used in retries

This PR fixes following issues:

@pranavmehta94 pranavmehta94 changed the title added client retry for jetstream async publish old API [IMPROVEMENT] added client retry for jetstream async publish old API Aug 8, 2024
@pranavmehta94 pranavmehta94 force-pushed the async_publish_retry branch 2 times, most recently from e038a42 to 6753a97 Compare August 8, 2024 12:49
@pranavmehta94 pranavmehta94 marked this pull request as draft August 8, 2024 13:43
@pranavmehta94 pranavmehta94 marked this pull request as ready for review August 8, 2024 15:04
Copy link
Collaborator

@piotrpio piotrpio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I have a few comments.

js.go Outdated
dch := js.dch
js.dch = nil
// Defer here so error is processed and can be checked.
defer close(dch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This defer does not work as expected since it'll be executed at the end of closeDch, not the caller. While it would be good to extract this func since it's repeated twice, I do think that we should make sure that close(dch) is executed at the very end to first send signals on the pubAckFuture and only later to PublishAsyncComplete()

js.go Outdated
@@ -894,7 +933,10 @@ func (js *js) PublishAsync(subj string, data []byte, opts ...PubOpt) (PubAckFutu
const defaultStallWait = 200 * time.Millisecond

func (js *js) PublishMsgAsync(m *Msg, opts ...PubOpt) (PubAckFuture, error) {
var o pubOpts
var o = pubOpts{
rwait: DefaultPubRetryWait,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an old API many people rely on, I would be hesitant to make the retries default, but rather treat them as opt-in and clearly state it in the doc. I know that this makes it work differently than the new API, but I don't think we should be changing the default behavior in this case.

js.dch = nil
// Defer here so error is processed and can be checked.
defer close(dch)
}
}

doErr := func(err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clear the reply subject before passing the message in error handler. If user wants to retry manually inside the error handler, having paf.msg.Reply set will cause the retry attempt to fail with nats: reply subject should be empty

object.go Outdated
@@ -398,7 +398,7 @@ func (obs *obs) Put(meta *ObjectMeta, r io.Reader, opts ...ObjectOpt) (*ObjectIn
return nil
}

m, h := NewMsg(chunkSubj), sha256.New()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, reply field in msg was set to empty string on PublishMsgAsync exit using defer. After adding retry logic, go test detected race between setting reply in async handler handleAsyncReply and emptying it in PublishMsgAsync. So, the defer call which emptied the reply inside PublishMsgAsync was removed.
The code was re-using the msg object inside the loop which called PublishMsgAsync and the test failed complaining about nats reply being not empty. Hence, the new msg object is being created in every iteration of the loop now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no good I think. We should not leave Msg.Reply set with the generated subject as it makes it impossible to re-publish by the user (as in the object store now). By the way, having Put() in object store recreate the message with each iteration means we have increased allocations (especially for large payloads / small chunks).

I may have an idea of how to tackle this problem - we could stop modifying the message passed by the user (which would prevent race) and utilize the low level nc.publish(). Here's my initial attempt from your branch: https://gist.github.com/piotrpio/796e5203e2d5be2cabda771e9ec51d54

Possibly this can be improved, but it seems to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// PublishMsgAsync publishes a Msg to JetStream and returns a PubAckFuture. // The message should not be changed until the PubAckFuture has been processed.

From the above comment, the way Put() uses PublishMsgAsync is incorrect since it seems to modify msg object without waiting on PubAckFuture.

Copy link
Collaborator

@piotrpio piotrpio Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but even so, with your changes the message will never be restored to its original state right? Even after the PubAckFuture has been processed. If we need to fix it anyway we can also improve by not modifying the message at all I think.

Copy link
Contributor Author

@pranavmehta94 pranavmehta94 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that a caller would assume that message would not be modified in anyway after PublishMsgAsync is called and PubAckFuture is processed. This is the reason why Reply was unset earlier. I will update the PR by applying your patch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it was unset before your changes: https://github.com/nats-io/nats.go/pull/1695/files#diff-176a2342c55218b70f165bed6a7731dc83d901f9513382f3c98bbd3d08ce0678L941

After PubAckFuture is processed, the message can be re-used by the user and thus should be in its original state:
/ The message should not be changed *until* the PubAckFuture has been processed.

Again, your changes are good in general and we should definitely process those but we also need to be very careful when changing any behavior in the old JetStream API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand it was unset before these changes as well

Copy link
Contributor Author

@pranavmehta94 pranavmehta94 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piotrpio I have updated the PR by applying your patch with one small change. Please review the changes

test/js_test.go Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really have tests for async retries. For reference, see tests in the new API: https://github.com/nats-io/nats.go/blob/main/jetstream/test/publish_test.go#L1402

@pranavmehta94
Copy link
Contributor Author

@piotrpio I incorporated the review comments but the CI pipeline now fails in an unrelated test. When I run the tests locally, all the tests succeed

Copy link
Collaborator

@piotrpio piotrpio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@piotrpio piotrpio merged commit 6f181d3 into nats-io:main Aug 14, 2024
1 check passed
@piotrpio piotrpio mentioned this pull request Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants