Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctlog: test recover behavior with errors on overwrite #12

Conversation

jellevandenhooff
Copy link
Contributor

@jellevandenhooff jellevandenhooff commented Mar 17, 2024

Test recovery behavior when using conditional requests to upload immutable files (using "If-Match" on Tigris).

A sequencing failure after uploading a tile but before updating the lock can cause the log to get stuck:

  • Sequencing starts.
  • The sequencer writer tile file "A".
  • The sequences fails to update the lock.
  • Sequencing restarts.
  • The sequencer tries to write tile file "A".
  • The sequencer fails because tile file "A" already exists.

In the log this looks as follows:

testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:751 msg="uploading partial data tile" tree_size=1 tile="{H:8 L:-1 N:0 W:1}" size=25
testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:771 msg="uploading tree tile" old_tree_size=0 tree_size=1 tile="{H:8 L:0 N:0 W:1}" size=32
testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:795 msg="uploading checkpoint" size=209
testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=0 entries=1 err="fatal sequencing error\nfailing commit checkpoint for test"
...
testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=DEBUG source=ctlog.go:751 msg="uploading partial data tile" tree_size=1 tile="{H:8 L:-1 N:0 W:1}" size=32
testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=DEBUG source=ctlog.go:771 msg="uploading tree tile" old_tree_size=0 tree_size=1 tile="{H:8 L:0 N:0 W:1}" size=32
testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=0 entries=1 err="couldn't upload a tile: key \"tile/8/0/000.p/1\" already exists"

Test recovery behavior when using conditional requests to upload immutable
files (using "If-Match" on Tigris).

A sequencing failure after uploading a tile but before updating the lock can
cause the log to get stuck:
- Sequencing starts.
- The sequencer writer tile file "A".
- The sequences fails to update the lock.
- Sequencing restarts.
- The sequencer tries to write tile file "A".
- The sequencer fails because tile file "A" already exists.

In the log this looks as follows:

    testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:751 msg="uploading partial data tile" tree_size=1 tile="{H:8 L:-1 N:0 W:1}" size=25
    testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:771 msg="uploading tree tile" old_tree_size=0 tree_size=1 tile="{H:8 L:0 N:0 W:1}" size=32
    testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=DEBUG source=ctlog.go:795 msg="uploading checkpoint" size=209
    testlog_test.go:79: time=2024-03-16T17:44:35.325-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=0 entries=1 err="fatal sequencing error\nfailing commit checkpoint for test"
    ...
    testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=DEBUG source=ctlog.go:751 msg="uploading partial data tile" tree_size=1 tile="{H:8 L:-1 N:0 W:1}" size=32
    testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=DEBUG source=ctlog.go:771 msg="uploading tree tile" old_tree_size=0 tree_size=1 tile="{H:8 L:0 N:0 W:1}" size=32
    testlog_test.go:79: time=2024-03-16T17:44:35.326-07:00 level=ERROR source=ctlog.go:671 msg="pool sequencing failed" old_tree_size=0 entries=1 err="couldn't upload a tile: key \"tile/8/0/000.p/1\" already exists"
FiloSottile added a commit that referenced this pull request Mar 19, 2024
This can cause the log to lock up as describe in #12, or even in steady
state due to request hedging: if request A wins, but request B returns a
412 before request A's 200, the sequencing will fail (safely) and leave
behind non-overwritable tiles.

This means the tree might suffer fatal failures on Tigris if two
sequencers run at the same time, or if the stale Get issues in the
LockBackend reoccur, since there is no object versioning.
FiloSottile added a commit that referenced this pull request Aug 4, 2024
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants