Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End to end acknowledgments for Dynamo source #3538

Closed
graytaylor0 opened this issue Oct 21, 2023 · 1 comment · Fixed by #3575
Closed

End to end acknowledgments for Dynamo source #3538

graytaylor0 opened this issue Oct 21, 2023 · 1 comment · Fixed by #3575
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@graytaylor0
Copy link
Member

Is your feature request related to a problem? Please describe.
Currently, the s3 and opensearch source have an internal acknowledgment timeout of 1 or 2 hours. This is because these are mainly used for historical migrations of data, so they are less time sensitive. However, as dynamodb streams is a streaming use case where order as well as any delay really matters, it is not as simple to provide a high default timeout like these other sources.

Describe the solution you'd like
The simple approach would be to just make the timeout configurable for the user with an acknowledgment_timeout parameter that specifies the timeout, but this is not ideal either since this value can be hard to gauge for streams on their own (also with OpenSearch backpressure), and would still require a fairly large default value to make sure that timeouts don't occur too early right before the data gets to OpenSearch. We need a way to lower the acknowledgment_timeout to quickly recover for streams, but to do that we need to lower the amount of data, since a shard could take a long time to process, which means a longer time to indicate a failure and timeout (and save progress instead of restarting the entire shard when the last part of it has an acknowledgment timeout).

To handle these complications, I would propose that we chunk the AcknowledgmentSets for the dynamo source to a general amount of data (in bytes). We can do this by grouping by sequence numbers for a DDB stream shard. For example, we have a configurable

acknowledgment_checkpoint_size: "10mb"
acknowledgment_timeout: "30s"

which will keep reading from the shard until it hits a sequence number that goes to (or past this point, we don't care if it's a little overestimated). Now when we receive an acknowledgment for one of these chunks (we should receive them in order), we can update the partition ownership timeout for that partition to be something small (potentially as low as 30 seconds, or however long the acknowledgment_checkpoint_size takes to travel through the pipeline (#3494 could be used to tune these values)but it is wise for us to allow this to be configurable as well since there is always sink backpressure. For a well scaled sink, low values for acknowledgment_timeout may be possible, which would allow nodes to pick up on crashes very quickly and continue processing the shard.

When an acknowledgment callback is received, not only do we update the ownership timeout with acknowledgment_timeout amount of time, we also update the partition progress state of the partition with the last sequence number that we received an ack callback for. This will allow us to pick up right where we left off in the case of failure instead of having to restart the processing of the entire shard.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@graytaylor0
Copy link
Member Author

For now, due to simplicity, it may make sense to have a single AcknowledgmentSet per shard, and have an hour timeout for each. We can optimize to get closer to the above proposal in the future if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants