Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctlog: write pending tiles to protect against concurrent sequencers #1

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jellevandenhooff
Copy link
Owner

If two sequencers (somehow) end up running at the same time, they could scribble over each other's uploads to the backend storage.

Protect against that special case by writing the tiles in two steps. First, all tiles are staged in a unique tar file (keyed by the hash of the new tree), and only written to their final location after a successful lock update.

In the normal case, this will incur one extra upload and delete, adding one write operation, and double the amount of data written to backend storage.

To make sure that the (rare) recovery code path is well-tested, use the same code path in both the recovery and normal case.

A special IAM policy could allow deletes only of pending files.

If two sequencers (somehow) end up running at the same time, they
could scribble over each other's uploads to the backend storage.

Protect against that special case by writing the tiles in two steps. First, all
tiles are staged in a unique tar file (keyed by the hash of the new tree), and
only written to their final location after a successful lock update.

In the normal case, this will incur one extra upload and delete, adding one
write operation, and double the amount of data written to backend storage.

To make sure that the (rare) recovery code path is well-tested, use the same
code path in both the recovery and normal case.

A special IAM policy could allow deletes only of pending files.
if err := uploadPending(ctx, l.c, checkpoint, tree.Tree); err != nil {
// Return an error so we don't produce SCTs that, although committed,
// aren't yet part of a publicly visible tree.
return fmtErrorf("couldn't copy pending files to object storage: %w", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Sorry for resurrecting this so much later. I am finally doing a pass of large changes to Sunlight, and decided to go for a variant of this, thank you so much for bringing up the issue and proposing this solution.

I think this needs to be a fatal error, right? If we just fail the round and continue, the next round will make progress, but there will be missing tiles in the tree. (They will be in staging because we won't reach the Delete, so it will be recoverable, but still.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, unsure how TestRecoverLog is passing, since it checks that Sequence returns an error, which only happens for fatal errors.

Copy link
Owner Author

@jellevandenhooff jellevandenhooff Aug 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Sorry for resurrecting this so much later. I am finally doing a pass of large changes to Sunlight, and decided to go for a variant of this, thank you so much for bringing up the issue and proposing this solution.

That's exciting to hear! No worries at all. Let me know if you would like me to update this PR or if you prefer your own code.

I think this needs to be a fatal error, right? If we just fail the round and continue, the next round will make progress, but there will be missing tiles in the tree. (They will be in staging because we won't reach the Delete, so it will be recoverable, but still.)

That sounds right to me. It would be good to spell out exactly what the protocol guarantees at each write to either the lock or storage backends. With a fatal error here you get the guarantee that every lock update is followed by a successful uploadPending call, and I agree that without it's not clear that all tiles will be uploaded.

Also, unsure how TestRecoverLog is passing, since it checks that Sequence returns an error, which only happens for fatal errors.

It's been a while and so I don't remember the details, but I think TestRecoverLog passes because uploadPending returns a fatal error in the test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this in FiloSottile@1ed5201. If you have time to review it, that would be amazing.

FiloSottile added a commit to FiloSottile/sunlight that referenced this pull request Aug 4, 2024
Based on a design by @jellevandenhooff at jellevandenhooff#1.

Fixes #11

Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>
FiloSottile added a commit to FiloSottile/sunlight that referenced this pull request Aug 4, 2024
Based on a design by @jellevandenhooff at jellevandenhooff#1.

Fixes #11

Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>
FiloSottile added a commit to FiloSottile/sunlight that referenced this pull request Aug 7, 2024
Based on a design by @jellevandenhooff at jellevandenhooff#1.

Fixes #11

Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants