Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: defensive against getting stale alloc updates #5906

Merged
merged 1 commit into from
Jul 2, 2019

Commits on Jun 29, 2019

  1. client: defensive against getting stale alloc updates

    When fetching node alloc assignments, be defensive against a stale read before
    killing local nodes allocs.
    
    The bug is when both client and servers are restarting and the client requests
    the node allocation for the node, it might get stale data as server hasn't
    finished applying all the restored raft transaction to store.
    
    Consequently, client would kill and destroy the alloc locally, just to fetch it
    again moments later when server store is up to date.
    
    The bug can be reproduced quite reliably with single node setup (configured with
    persistence).  I suspect it's too edge-casey to occur in production cluster with
    multiple servers, but we may need to examine leader failover scenarios more closely.
    
    In this commit, we only remove and destroy allocs if the removal index is more
    recent than the alloc index. This seems like a cheap resiliency fix we already
    use for detecting alloc updates.
    
    A more proper fix would be to ensure that a nomad server only serves
    RPC calls when state store is fully restored or up to date in leadership
    transition cases.
    Mahmood Ali committed Jun 29, 2019
    Configuration menu
    Copy the full SHA
    2e1978e View commit details
    Browse the repository at this point in the history