Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing: Partial Success in BufferedStreamConsumer (Destination) #3555

Merged
merged 8 commits into from
Jul 21, 2021

Conversation

cgardens
Copy link
Contributor

@cgardens cgardens commented May 22, 2021

What

  • We already have functionality that allows us to checkpoint a partial success in the case of a source failure. This PR is doing the same on the destination side for most of our destinations (any that use BufferedStreamConsumer).
  • onClose if any records were successfully flushed to a tmp table, then the destination will still try to commit them. If it succeeds (doesn't throw an exception) then the destination will emit a state. If it doesn't then it won't.
  • Note: This PR can go in separately from the rest of the checkpointing feature--I am recommending that we do not block the release on this PR.

Pre-merge Checklist

  • Improve testing: need to split out testing of the buffer size into its own test.
  • Improve testing: test more failure / partial success cases.
  • bump versions

┆Issue is synchronized with this Asana task by Unito

Base automatically changed from cgardens/migrate_copy_destinations to cgardens/migrate_other_destinations May 25, 2021 20:48
@cgardens cgardens force-pushed the cgardens/migrate_other_destinations branch 2 times, most recently from f917e65 to e7fa9e5 Compare May 25, 2021 21:07
Base automatically changed from cgardens/migrate_other_destinations to cgardens/checkpointing_respect_destination_state May 25, 2021 21:08
@cgardens cgardens force-pushed the cgardens/checkpointing_respect_destination_state branch from bb6cd74 to bdb60ab Compare May 25, 2021 21:22
@cgardens cgardens force-pushed the cgardens/checkpointing_destination_failures branch from 5c6cdad to 5792690 Compare May 25, 2021 22:00
@cgardens cgardens marked this pull request as ready for review May 25, 2021 22:47
Base automatically changed from cgardens/checkpointing_respect_destination_state to master May 25, 2021 23:47
@cgardens cgardens force-pushed the cgardens/checkpointing_destination_failures branch from 3e53b2f to 9bd3e6d Compare May 26, 2021 16:57
Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

I just realised this approach might not work for copy destinations in their current form since the actual db insert, via the copy command, only happens at the end of the sync. Storing state in the middle of the sync means we would miss some records on the next sync. Does that make sense?

@cgardens
Copy link
Contributor Author

@davinchia I don't think I understand why we would miss records. Could you explain the problematic scenario?

My understanding:

  • upload records to "tmp" table (in this case a file)
  • failure happens during this process
  • attempt to copy what was uploaded into the raw table
  • if previous step is successful emit the last state message that we know the records were uploaded to the tmp table. if not, then assume nothing was written.

@davinchia
Copy link
Contributor

davinchia commented May 28, 2021

What I was thinking of:

  1. failure happens during a sync
  2. the onClose function is triggered. however because hasFailed is true, we don't actually write the data to a tmp table. the copied data is later removed (a couple of LOC down).
  3. if we have saved any state prior to this, the next sync would start from the state we last saved, and cause us to miss records that were read but not actually written to the final table.

I think the approach you wrote out makes sense if we always try to copy the uploaded file to the tmp and final table - might be missing something, but I don't think we are doing that yet.

Maybe we should add integration tests cases for this?

@cgardens
Copy link
Contributor Author

cgardens commented Jul 14, 2021

@davinchia I think we're talking past each other. i don't think the case you're describing is possible given the way the code is written. so one of us missing something. 😅 let's discuss offline.

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of changes requested but overall the logic seems sound

* buffered records are flushed out of memory using the user-provided recordWriter. When this flush
* happens, a state message is moved from pending to flushed. On close, if the user-provided onClose
* function is successful, then the flushed state record is considered committed and is then
* emitted. We expect this class to only ever emit either 1 state message (in the case of a full or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this class be emitting state messages potentially continuously to allow halfway failure checkpointing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the hangup here on normalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalization is orthogonal to this.

Emitting state as we go would be another valid strategy for approaching checkpointing in destinations.

The current strategy is to consume records and put them in a temporary table in the destination until something does wrong. As soon as something does wrong (or we consume all records) attempt to commit everything in the temporary table into the final tables. This current approach gets us checkpointing behavior, but it is vulnerable to the case where the commit would have been successful for a subset of the flushed records but not all of them. I'm not sure to what extent this is really a high frequency failure point.

The reason to not go for emitting state as we go is just expedience. If we emit state as we go then we will need to be committing records from the temporary tables to the final tables as we go. So now each sync can potentially have multiple temporary tables. It is definitely doable and something we should probably shoot for, but at the time I didn't have appetite for the added complexity.

// if any state message flushed that means we can still go for at least a partial success. if none
// was emitted, if there were still no failures, then we can still succeed. the latter case is full
// refresh.
onClose.accept(lastFlushedState == null && hasFailed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-read this line like 5 times to map out the boolean table in my head. Do you mind unfurling this e.g:

if (lastFlushedState == null) { 
  onClose.accept(hasFailed)
} else { 
  outputRecordCollector.accept(lastFlushedState)
  onClose.accept( false ) 
} 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha. no i don't mind at all. it took me a long time to write this expression (which should have been a sign that it was too hard to think about)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (lastFlushedState == null) { 
  onClose.accept(hasFailed)
} else { 
  
  onClose.accept( false ) 
} 

if (lastFlushedState != null) {
   outputRecordCollector.accept(lastFlushedState);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why separate them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from talking with davin:

  • hasFailed
  • clarify the lastFlushedState concept (especially clarifying in the context of the copy strategy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to add re throwing the exception on close?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline: went through code and confirmed this works for Copy.

One note here to rename hasFailed to something closer to commit or discard since this no longer represents failure.

Copy link
Contributor Author

@cgardens cgardens Jul 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada they need to be separate because the condidiontal is not the same. one is lastFlushedState == null and one is lastFlushedState != null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't it go in the else clause?

}

@Test
void testExceptionDuringOnClose() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test (and potentially others) should verify that this line throws an exception (also that line should throw an exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.

* function is successful, then the flushed state record is considered committed and is then
* emitted. We expect this class to only ever emit either 1 state message (in the case of a full or
* partial success) or 0 state messages (in the case where the onClose step was never reached or did
* not complete without exception).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline: make clearer 'flushing' can mean very different things. Link to Copy Strategy implementation.

@cgardens cgardens force-pushed the cgardens/checkpointing_destination_failures branch from 570927c to 900efc0 Compare July 16, 2021 18:45
@cgardens
Copy link
Contributor Author

@davinchia, I added the clarification around close destinations.

I started trying to change hasFailed and I became pretty convinced it is not a good idea. The boolean that is being passed is an indicator of what happened in the lifecycle of the consumer. Ultimately it is up to the implementer to user that information as it will. If we switch it to commit it is stating an opinion about how the destination should handle the failure when ultimately that is destination specific.

@cgardens
Copy link
Contributor Author

@davinchia also I feel a little saner about this bug. We had mentioned that we would have expected to see this bug really frequently. Really the only case it appears is if the close function has a failure (which I can convince myself isn't too frequent). Other failures thrown during the lifecycle of the consumer still are thrown and thus cause the process to exist with a non-zero status code. The reason I was confused about this yesterday was because I forgot that FailureTrackingAirbyteMessageConsumer re throws exceptions after it sees them in start and track.

@cgardens
Copy link
Contributor Author

@sherifnada I think this should be ready for you to take another look.

@cgardens cgardens force-pushed the cgardens/checkpointing_destination_failures branch from ee723cc to 368533a Compare July 16, 2021 20:13
@cgardens
Copy link
Contributor Author

cgardens commented Jul 16, 2021

/test connector=connectors/destination-mssql

🕑 connectors/destination-mssql https://github.com/airbytehq/airbyte/actions/runs/1038682924
❌ connectors/destination-mssql https://github.com/airbytehq/airbyte/actions/runs/1038682924

@cgardens
Copy link
Contributor Author

cgardens commented Jul 16, 2021

/test connector=connectors/destination-mysql

🕑 connectors/destination-mysql https://github.com/airbytehq/airbyte/actions/runs/1038683051
✅ connectors/destination-mysql https://github.com/airbytehq/airbyte/actions/runs/1038683051

@cgardens
Copy link
Contributor Author

cgardens commented Jul 16, 2021

/test connector=connectors/destination-jdbc

🕑 connectors/destination-jdbc https://github.com/airbytehq/airbyte/actions/runs/1038683445
✅ connectors/destination-jdbc https://github.com/airbytehq/airbyte/actions/runs/1038683445

@cgardens
Copy link
Contributor Author

local

@cgardens
Copy link
Contributor Author

cgardens commented Jul 16, 2021

/test connector=connectors/destination-local-json

🕑 connectors/destination-local-json https://github.com/airbytehq/airbyte/actions/runs/1038684185
✅ connectors/destination-local-json https://github.com/airbytehq/airbyte/actions/runs/1038684185

@cgardens
Copy link
Contributor Author

cgardens commented Jul 16, 2021

/test connector=connectors/destination-meilisearch

🕑 connectors/destination-meilisearch https://github.com/airbytehq/airbyte/actions/runs/1038684378
✅ connectors/destination-meilisearch https://github.com/airbytehq/airbyte/actions/runs/1038684378

@cgardens cgardens force-pushed the cgardens/checkpointing_destination_failures branch from 368533a to c6fe7ae Compare July 21, 2021 21:52
@github-actions github-actions bot added the area/connectors Connector related issues label Jul 21, 2021
@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1054158946
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1054158946

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-redshift

🕑 connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1054159482
✅ connectors/destination-redshift https://github.com/airbytehq/airbyte/actions/runs/1054159482

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-postgres

🕑 connectors/destination-postgres https://github.com/airbytehq/airbyte/actions/runs/1054159828
✅ connectors/destination-postgres https://github.com/airbytehq/airbyte/actions/runs/1054159828

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-oracle

🕑 connectors/destination-oracle https://github.com/airbytehq/airbyte/actions/runs/1054160098
✅ connectors/destination-oracle https://github.com/airbytehq/airbyte/actions/runs/1054160098

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-meilisearch

🕑 connectors/destination-meilisearch https://github.com/airbytehq/airbyte/actions/runs/1054160478
✅ connectors/destination-meilisearch https://github.com/airbytehq/airbyte/actions/runs/1054160478

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-local-json

🕑 connectors/destination-local-json https://github.com/airbytehq/airbyte/actions/runs/1054160962
✅ connectors/destination-local-json https://github.com/airbytehq/airbyte/actions/runs/1054160962

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-mysql

🕑 connectors/destination-mysql https://github.com/airbytehq/airbyte/actions/runs/1054161252
✅ connectors/destination-mysql https://github.com/airbytehq/airbyte/actions/runs/1054161252

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-mssql

🕑 connectors/destination-mssql https://github.com/airbytehq/airbyte/actions/runs/1054161939
✅ connectors/destination-mssql https://github.com/airbytehq/airbyte/actions/runs/1054161939

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-csv

🕑 connectors/destination-csv https://github.com/airbytehq/airbyte/actions/runs/1054164943
✅ connectors/destination-csv https://github.com/airbytehq/airbyte/actions/runs/1054164943

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1054166173
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1054166173

@cgardens
Copy link
Contributor Author

cgardens commented Jul 21, 2021

/publish connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1054166818
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/1054166818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants