Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarama Async Producer Encounters 'Out of Order' Error: what are the reasons? #2803

Open
nevillus opened this issue Feb 15, 2024 · 6 comments
Labels
needs-investigation Issues that require followup from maintainers

Comments

@nevillus
Copy link

Description

We are encountering an error (once every few weeks) while using the async producer in our Kafka setup. The error message encountered is as follows:

assertion failed: message out of sequence added to a batch

This error seems to originate from the following line in the Sarama library:
produce_set.go#L89

The occurrence of this error is sporadic, and we are struggling to understand the underlying cause or identify any corrective measures. It appears that, occasionally, messages are being added to the batch in an incorrect order.

We are seeking insights or suggestions on what might be triggering this error. Our investigations have considered network issues as a potential cause; however, we have not found any corresponding logs or indicators to substantiate this theory when the error occurs.

Versions
Sarama Kafka Go
v1.42.1 2.6.2 1.20.6
Configuration
	config := sarama.NewConfig()
	config.Version = version
	config.Consumer.Group.Rebalance.Strategy = sarama.NewBalanceStrategySticky()
	config.Producer.RequiredAcks = sarama.WaitForAll
	config.Producer.Idempotent = true
	config.Net.MaxOpenRequests = 1
	config.Producer.Retry.Max = 100000
	config.Producer.Retry.Backoff = 100 * time.Millisecond
	config.Producer.Return.Successes = true
	config.Producer.Return.Errors = true
	config.Producer.Partitioner = sarama.NewHashPartitioner
Logs

We are facing the error detailed at the following location:
produce_set.go#L89

Additional Context

All messages are dispatched using an asynchronous producer, configured with a high retry count to ensure message delivery even in the event of transient Kafka broker failures. Despite this, we observe that occasionally a message fails to be added to the batch, rendering it ineligible for any retry mechanism in Sarama.

@dnwe dnwe added the needs-investigation Issues that require followup from maintainers label Feb 15, 2024
@nevillus
Copy link
Author

nevillus commented Mar 12, 2024

With the setup described above, we encountered some instances of "The broker received an out of order sequence number" errors recently too. These occurrences are very rare too, but we are wondering if this could indicate an issue with how the messages are being pushed, leading to them being ordered incorrectly.

@dnwe
Copy link
Collaborator

dnwe commented Mar 12, 2024

So this appears to be an ordering issue / race condition between new batches being produced and batches being retried in the idempotent producer:

sarama/async_producer.go

Lines 1144 to 1148 in f21c512

if bp.parent.conf.Producer.Idempotent {
go bp.parent.retryBatch(topic, partition, pSet, block.Err)
} else {
bp.parent.retryMessages(pSet.msgs, block.Err)
}

This shouldn't occur with config.Net.MaxOpenRequests = 1, but we have had other reports (e.g., #2619) suggesting that when request pipelining was introduced it inadvertently changed the behaviour of the producer such that it lost some of its ordering guarantees

@nevillus
Copy link
Author

Thank you, @dnwe. Is there currently someone addressing this issue? If not, we're willing to assist and contribute to a solution. Could you provide some guidance on where we might start or what to look into?

@prestona
Copy link
Member

I was able to reproduce with a simple async producer that sets:

config.Net.MaxOpenRequests = 1
config.Producer.Idempotent = true

In my case, the trigger that causes the assertion failed: message out of sequence added to a batch message is to interrupt network connectivity between the Sarama client and brokers (connecting to / disconnecting from a VPN).

I don't see the same problem if I switch to using the sync producer in a loop (keeping the same configuration). I suspect this is because my test program will block until Kafka acks each message - effectively preventing the possibility of there being more than one request in flight at any time.

@richardartoul
Copy link
Contributor

Should be fixed by #2943 if someone can review

@flylhcat
Copy link

Hi, @nevillus. Maybe I can know whether this error auto-recover? We plan to upgrade Sarama, if it can auto recover and won't cause actual disorder, maybe no big impact? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation Issues that require followup from maintainers
Projects
None yet
Development

No branches or pull requests

5 participants