Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File CDK: Implement incremental sync #26758

Closed
5 tasks
clnoll opened this issue May 30, 2023 · 0 comments
Closed
5 tasks

File CDK: Implement incremental sync #26758

clnoll opened this issue May 30, 2023 · 0 comments

Comments

@clnoll
Copy link
Contributor

clnoll commented May 30, 2023

Support incremental syncs using the pattern established in the existing implementation.

This addresses the race condition described in the tech spec where we have files A and B where the last-modified timestamp of A is earlier than B. If B becomes visible to the storage system before A, we could accidentally omit file A (since the cutoff time for the next slice may be based on B’s timestamp).

To address this, we modify the state message to include a “history” key whose value is a mapping of timestamp to a list of the last files that were synced at that time. When a sync starts, if we see any files between the timestamps in the “history” key, whose value is not in the list for the associated timestamp, we’ve skipped the file and should sync it.

Some of the existing code can be reused, but some aspects of it should be modified:

  • The size of the history key is set to 1GB, but I believe we want to keep state messages under 1MB (to avoid hitting the 2MB max).
  • Currently, if there are more files than will fit into the history, we delete the history key and always sync all records. This should be modified to only sync files newer than (3?) days, per the proposal (see section 7 Incremental Syncs), or equal to or newer than the oldest file(s) recorded in the history if there are too many files newer than 3 days. To support this, we'll need to know whether the number of synced files fit into the history. This can be achieved by adding a boolean to the history key indicating whether it was able to store all the files. We should also stop deleting the history key when the number of files synced gets too large, and instead keep a moving window of synced files where the oldest drop off if the history gets too large.
  • We should log a warning when the number of files to be synced exceeds that allowed in the history key.

Acceptance criteria

  • When read is called for an incremental sync, we output a state message with a "history" key containing a mapping of last-modified timestamp to a list of synced files.
  • When the number of files with the same last-modified timestamp is too large, we log an appropriate warning.
  • Tests verify that we sync any new files that have shown up between the timestamps in the history key, if the history key does not exceed the maximum size.
  • Tests verify that we sync only files newer than the last timestamp in the history key, if the history key does exceed the maximum size.
  • Support and User-facing documentation is created describing the new contract for incremental syncs.
@clnoll clnoll changed the title Implement incremental reads File CDK: Implement incremental reads May 30, 2023
@clnoll clnoll changed the title File CDK: Implement incremental reads File CDK: Implement incremental sync May 30, 2023
@girarda girarda self-assigned this Jun 8, 2023
@clnoll clnoll closed this as completed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants