Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic eval insertion with job (de-)registration #8435

Merged
merged 5 commits into from
Jul 15, 2020
Merged

Commits on Jul 14, 2020

  1. Atomic eval insertion with job (de-)registration

    This fixes a bug where jobs may get "stuck" unprocessed that
    dispropotionately affect periodic jobs around leadership transitions.
    When registering a job, the job registration and the eval to process it
    get applied to raft as two separate transactions; if the job
    registration succeeds but eval application fails, the job may remain
    unprocessed. Operators may detect such failure, when submitting a job
    update and get a 500 error code, and they could retry; periodic jobs
    failures are more likely to go unnoticed, and no further periodic
    invocations will be processed until an operator force evaluation.
    
    This fixes the issue by ensuring that the job registration and eval
    application get persisted and processed atomically in the same raft log
    entry.
    
    Also, applies the same change to ensure atomicity in job deregistration.
    
    Backward Compatibility
    
    We must maintain compatibility in two scenarios: mixed clusters where a
    leader can handle atomic updates but followers cannot, and a recent
    cluster processes old log entries from legacy or mixed cluster mode.
    
    To handle this constraints: ensure that the leader continue to emit the
    Evaluation log entry until all servers have upgraded; also, when
    processing raft logs, the servers honor evaluations found in both spots,
    the Eval in job (de-)registration and the eval update entries.
    
    When an updated server sees mix-mode behavior where an eval is inserted
    into the raft log twice, it ignores the second instance.
    
    I made one compromise in consistency in the mixed-mode scenario: servers
    may disagree on the eval.CreateIndex value: the leader and updated
    servers will report the job registration index while old servers will
    report the index of the eval update log entry. This discripency doesn't
    seem to be material - it's the eval.JobModifyIndex that matters.
    Mahmood Ali committed Jul 14, 2020
    Configuration menu
    Copy the full SHA
    97c69ee View commit details
    Browse the repository at this point in the history

Commits on Jul 15, 2020

  1. time.Now().UTC().UnixNano() -> time.Now().UnixNano()

    Mahmood Ali committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    6a082ad View commit details
    Browse the repository at this point in the history
  2. only set args.Eval after all servers upgrade

    We set the Eval field on job (de-)registration only after all servers
    get upgraded, to avoid dealing with duplicate evals.
    Mahmood Ali committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    a6a96c4 View commit details
    Browse the repository at this point in the history
  3. no need to handle duplicate evals anymore

    Mahmood Ali committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    921a42b View commit details
    Browse the repository at this point in the history
  4. comment compat concern in fsm.go

    Mahmood Ali committed Jul 15, 2020
    Configuration menu
    Copy the full SHA
    37ad947 View commit details
    Browse the repository at this point in the history