Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disaster recovery integration test #13

Closed
arhag opened this issue Mar 25, 2024 · 0 comments · Fixed by #54, #72 or #153
Closed

Disaster recovery integration test #13

arhag opened this issue Mar 25, 2024 · 0 comments · Fixed by #54, #72 or #153
Assignees
Labels
👍 lgtm OCI Work exclusive to OCI team

Comments

@arhag
Copy link
Member

arhag commented Mar 25, 2024

Scenario 1

Create integration test with 4 nodes (A, B, C, and D) which each have their own producer and finalizer. The finalizer policy consists of the four finalizers with a threshold of 3. The proposer policy involves all four proposers.

The 4 nodes are cleanly shutdown in the following state:

  • A has LIB N. A has a finalizer safety information file that locks on a block after N.
  • B, C, and D have LIB less than N. They have finalizer safety information files that lock on N.

Nodes B, C, and D lose their reversible blocks. All nodes restart from an earlier snapshot.

A is restarted and replays up to its last reversible block (which is a block number greater than N) after restarting from snapshot. Blocks N and later is sent to the other nodes B, C, and D after they are also started up again.

Verify that LIB advances and that A, B, C, and D are eventually voting strong on new blocks.

Scenario 2

Create an integration test with 5 nodes (A, B, C, D, and P). Nodes A, B, C, and D each have one finalizer but no proposers. Node P has a proposer but no finalizers. The finalizer policy consists of the four finalizers with a threshold of 3. The proposer policy involves just the single proposer P.

A, B, C, and D can be connected to each other however we like as long as blocks sent to node A can traverse to the other nodes B, C, and D. However, node P should only be connected to node A.

At some point after IF transition has completed and LIB is advancing, block production on node P should be paused. Enough time should be given to allow and in-flight votes on the latest produced blocks to be delivered to node P. Then, the connection between node P and node A should be severed, and then block production on node P resumed. The LIB on node P should advance to but then stall at block N. Then shortly after that, node P should be cleanly shut down.

Verify that the LIB on A, B, C, and D has stalled and is less than block N. Then, nodes A, B, C, and D can all be cleanly shut down.

Then, reversible blocks from all nodes should be removed. All nodes are restarted from an earlier snapshot (prior to block N).

P is restarted and replays up to block N after restarting from snapshot. Blocks up to and including block N are sent to the other nodes A, B, C, and D after they are also started up again.

Verify that LIB advances and that A, B, C, and D are eventually voting strong on new blocks.

Scenario 3

Create integration test with 4 nodes (A, B, C, and D) which each have their own producer and finalizer. The finalizer policy consists of the four finalizers with a threshold of 3. The proposer policy involves all four proposers.

  • At least two of the four nodes should have a LIB N and a finalizer safety information file that locks on a block after N. The other two nodes should have a LIB that is less than or equal to block N.

All nodes are shut down. The reversible blocks on all nodes is deleted. Then restart all nodes from an earlier snapshot.

All nodes eventually sync up to block N. Some nodes will consider block N to LIB but others may not.

Not enough finalizers should be voting because of the lock in their finalizer safety information file. Verify that LIB does not advance on any node.

Cleanly shut down all nodes and delete their finalizer safety information files. Then restart the nodes.

Verify that LIB advances on all nodes and they all agree on the LIB. In particular, verify that block N is the same ID on all nodes as the one before nodes were first shutdown.

@arhag arhag transferred this issue from AntelopeIO/leap Apr 10, 2024
@arhag arhag added this to the Savanna: Production-Ready milestone Apr 10, 2024
@arhag arhag changed the title IF: Disaster recovery integration test Disaster recovery integration test Apr 10, 2024
@arhag arhag added 👍 lgtm and removed triage labels Apr 10, 2024
@heifner heifner self-assigned this Apr 15, 2024
@heifner heifner added the OCI Work exclusive to OCI team label Apr 15, 2024
@heifner heifner moved this from Todo to In Progress in Team Backlog Apr 15, 2024
@heifner heifner linked a pull request Apr 19, 2024 that will close this issue
heifner added a commit that referenced this issue Apr 23, 2024
heifner added a commit that referenced this issue Apr 23, 2024
heifner added a commit that referenced this issue Apr 23, 2024
@BenjaminGormanPMP BenjaminGormanPMP moved this from In Progress to Awaiting Review in Team Backlog Apr 23, 2024
heifner added a commit that referenced this issue Apr 25, 2024
IF: Add the beginning of a savanna disaster recovery test
@heifner heifner moved this from Awaiting Review to In Progress in Team Backlog Apr 25, 2024
@heifner heifner linked a pull request Apr 25, 2024 that will close this issue
heifner added a commit that referenced this issue Apr 25, 2024
@BenjaminGormanPMP BenjaminGormanPMP moved this from In Progress to Awaiting Review in Team Backlog Apr 25, 2024
heifner added a commit that referenced this issue Apr 29, 2024
heifner added a commit that referenced this issue Apr 29, 2024
heifner added a commit that referenced this issue Apr 29, 2024
heifner added a commit that referenced this issue Apr 30, 2024
heifner added a commit that referenced this issue Apr 30, 2024
heifner added a commit that referenced this issue May 1, 2024
IF: Disaster_recovery scenario 2 test
@github-project-automation github-project-automation bot moved this from Awaiting Review to Done in Team Backlog May 1, 2024
@heifner heifner reopened this May 1, 2024
@github-project-automation github-project-automation bot moved this from Done to Todo in Team Backlog May 1, 2024
heifner added a commit that referenced this issue May 17, 2024
heifner added a commit that referenced this issue May 17, 2024
@heifner heifner moved this from Todo to In Progress in Team Backlog May 17, 2024
heifner added a commit that referenced this issue May 17, 2024
heifner added a commit that referenced this issue May 20, 2024
…lso update the wait on node2 and node3 to be on N.
@BenjaminGormanPMP BenjaminGormanPMP moved this from In Progress to Awaiting Review in Team Backlog May 21, 2024
heifner added a commit that referenced this issue May 22, 2024
heifner added a commit that referenced this issue May 22, 2024
@github-project-automation github-project-automation bot moved this from Awaiting Review to Done in Team Backlog May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment