FSM fault injection #13419

tgross · 2022-06-17T18:35:55Z

This changeset is a proof-of-concept for a fault injection interface
into the FSM.Apply function. This would allow us to introduce
timeouts or errors in unit testing by adding a LogApplier
implementation to a map of interceptionAppliers. This is similar to
how we register LogAppliers for the enterprise FSM functions
currently. Most interception appliers are expected to then call the
normal applier directly.

This was developed initially for #13407 but can't be used to reproduce
that particular bug. But I'm opening this PR for further discussion
about whether this is a worthwhile tool to have for testing otherwise.
(Once #13407 is merged I'll rebase this on main)

cc @lgfa29 @jazzyfresh

The plan applier has to get a snapshot with a minimum index for the plan it's working on in order to ensure consistency. Under heavy raft loads, we can exceed the timeout. When this happens, we hit a bug where the plan applier blocks waiting on the `indexCh` forever, and all schedulers will block in `Plan.Submit`. Closing the `indexCh` when the `asyncPlanWait` is done with it will prevent the deadlock without impacting correctness of the previous snapshot index. This changeset includes the a PoC failing test that works by injecting a large timeout into the state store. We need to turn this into a test we can run normally without breaking the state store before we can merge this PR.

This changeset is a proof-of-concept for a fault injection interface into the `FSM.Apply` function. This would allow us to introduce timeouts or errors in unit testing by adding a LogApplier implementation to a map of `interceptionAppliers`. This is similar to how we register LogAppliers for the enterprise FSM functions currently. Most interception appliers are expected to then call the normal applier directly. This was developed initially for #13407 but can't be used to reproduce that particular bug. But I'm opening this PR for further discussion about whether this is a worthwhile tool to have for testing otherwise.

tgross · 2022-08-17T14:07:37Z

I'm going to close this out, as it doesn't really seem all that worthwhile except as a one-off experiment.

github-actions · 2022-12-16T02:13:27Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added 3 commits June 17, 2022 09:01

remove temporarily broken state store code

6fab937

changelog entry

5e0964e

tgross added stage/needs-discussion theme/testing Test related issues labels Jun 17, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui June 17, 2022 18:38 View deployment

tgross force-pushed the fsm-fault-injection branch from 98a79b8 to 41c5318 Compare June 17, 2022 19:03

vercel bot deployed to Preview – nomad-storybook-and-ui June 17, 2022 19:06 View deployment

Base automatically changed from plan-apply-deadlock to main June 23, 2022 16:06

tgross closed this Aug 17, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSM fault injection #13419

FSM fault injection #13419

tgross commented Jun 17, 2022 •

edited

Loading

tgross commented Aug 17, 2022

github-actions bot commented Dec 16, 2022

FSM fault injection #13419

FSM fault injection #13419

Conversation

tgross commented Jun 17, 2022 • edited Loading

tgross commented Aug 17, 2022

github-actions bot commented Dec 16, 2022

tgross commented Jun 17, 2022 •

edited

Loading