-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offsets are not managed correctly when the connector fails to process some of the records in a batch #68
Comments
Ugh, sorry about this @byreshb. I introduced this behavior with the guard in Lines 144 to 147 in 55c7d40
flush on an already-failed task due to KAFKA-10188.
It does technically prevent those unnecessary exceptions from being rethrown, but it has the unfortunate side effect that the call to
This should only affect snapshot 2.0+ builds of the connector running on Connect workers that don't yet have apache/kafka#8910, which removes that final offset commit attempt (since it's not a great idea to invoke anything on a task after I'll file a fix PR that should hopefully work for this case and prevent the obnoxious messages about not being able to await tasks on a closed executor. Thanks for catching this before it made its way into a release! |
Looks like the offset and the number of records in BQ table doesn't match. Test details are documented below: (done on my local)
Publish Message 1 = Correct Data. Lag = 0, Current Offset = 1, End Offset = 1 (0 message in BQ table) – New Error? Expected 1 message in BQ Table Connector: Running, Connect Task: Running Publish Message 6 = Invalid Data, Lag = 1, Current Offset = 5, End Offset = 6 (4 messages in BQ table) Connector: Running, Connect Task: Failed Publish Message 7 = Correct Data, Lag = 2, Current Offset = 5, End Offset = 7 (4 messages in BQ table) Restart Connect Task: Lag = 8, Current Offset = 5, End Offset = 13 (7 messages in BQ table) – Offset didn’t change but there are 3 new records in BQ table which doesn’t look correct Restarted the connect task and the connector service few more times. No change in the result |
@byreshb extra records in the table are expected in this case. The connector doesn't manually manage offsets unless upsert/delete is turned on since manual offset management causes things to break when SMTs are used that change topic names in sink records. As a result, it can only commit an entire batch at a time, where a "batch" is all the records sent to the record since the last offset commit. So, if it reads some valid records, sends those to BigQuery successfully, then reads a record that causes it to fail, it's unable to commit offsets for the valid records, and will re-process them when restarted. This is obviously not great but it's not a regression; the regression here is that instead of reprocessing failed records, the connector actually skips them and drops them forever, which can lead to data loss. That's all that #70 is meant to address; manual offset management can be added separately (although it would either cause the connector to break on topic-altering SMTs or would require a KIP to add some kind of API for knowing the original topic/partition/offset of each record). |
Connector was processing 500 records in batch. When one of the records threw an exception, connector stopped processing (Connector task status: FAILED). When connector task was restarted, connector resumed processing skipping the failed record. Maybe the connector falsely committed the offset for all records in the batch.
Connector Plugin Version:
Steps to reproduce:
Expected Result:
Actual Result:
Error log snippet:
The text was updated successfully, but these errors were encountered: