Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSM fault injection #13419

Closed
wants to merge 4 commits into from
Closed

FSM fault injection #13419

wants to merge 4 commits into from

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 17, 2022

This changeset is a proof-of-concept for a fault injection interface
into the FSM.Apply function. This would allow us to introduce
timeouts or errors in unit testing by adding a LogApplier
implementation to a map of interceptionAppliers. This is similar to
how we register LogAppliers for the enterprise FSM functions
currently. Most interception appliers are expected to then call the
normal applier directly.

This was developed initially for #13407 but can't be used to reproduce
that particular bug. But I'm opening this PR for further discussion
about whether this is a worthwhile tool to have for testing otherwise.
(Once #13407 is merged I'll rebase this on main)

cc @lgfa29 @jazzyfresh

The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.

Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.
This changeset is a proof-of-concept for a fault injection interface
into the `FSM.Apply` function. This would allow us to introduce
timeouts or errors in unit testing by adding a LogApplier
implementation to a map of `interceptionAppliers`. This is similar to
how we register LogAppliers for the enterprise FSM functions
currently. Most interception appliers are expected to then call the
normal applier directly.

This was developed initially for #13407 but can't be used to reproduce
that particular bug. But I'm opening this PR for further discussion
about whether this is a worthwhile tool to have for testing otherwise.
@tgross
Copy link
Member Author

tgross commented Aug 17, 2022

I'm going to close this out, as it doesn't really seem all that worthwhile except as a one-off experiment.

@tgross tgross closed this Aug 17, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant