Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSM fault injection #13419

Closed
wants to merge 4 commits into from
Closed

FSM fault injection #13419

wants to merge 4 commits into from

Commits on Jun 17, 2022

  1. fix deadlock in plan_apply

    The plan applier has to get a snapshot with a minimum index for the
    plan it's working on in order to ensure consistency. Under heavy raft
    loads, we can exceed the timeout. When this happens, we hit a bug
    where the plan applier blocks waiting on the `indexCh` forever, and
    all schedulers will block in `Plan.Submit`.
    
    Closing the `indexCh` when the `asyncPlanWait` is done with it will
    prevent the deadlock without impacting correctness of the previous
    snapshot index.
    
    This changeset includes the a PoC failing test that works by injecting
    a large timeout into the state store. We need to turn this into a test
    we can run normally without breaking the state store before we can
    merge this PR.
    tgross committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    2f7f862 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6fab937 View commit details
    Browse the repository at this point in the history
  3. changelog entry

    tgross committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    5e0964e View commit details
    Browse the repository at this point in the history
  4. proof of concept for injecting faults into FSM.Apply

    This changeset is a proof-of-concept for a fault injection interface
    into the `FSM.Apply` function. This would allow us to introduce
    timeouts or errors in unit testing by adding a LogApplier
    implementation to a map of `interceptionAppliers`. This is similar to
    how we register LogAppliers for the enterprise FSM functions
    currently. Most interception appliers are expected to then call the
    normal applier directly.
    
    This was developed initially for #13407 but can't be used to reproduce
    that particular bug. But I'm opening this PR for further discussion
    about whether this is a worthwhile tool to have for testing otherwise.
    tgross committed Jun 17, 2022
    Configuration menu
    Copy the full SHA
    41c5318 View commit details
    Browse the repository at this point in the history