Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnected clients: Unknown alloc gets marked for migrate on node drain #12469

Closed
DerekStrickland opened this issue Apr 5, 2022 · 3 comments
Assignees
Milestone

Comments

@DerekStrickland
Copy link
Contributor

Nomad version

1.3.0-beta

Operating system and Environment details

1 Server
2 Clients

Issue

If you mark a disconnected client for drain, it will set the DesiredTransition.Migrate field to true for all allocations on that client. This results in those allocations being treated as migrations by the allocReconciler.

Reproduction steps

  • Create a job that ensures allocations are spread across multiple clients.
  • Once the jobs are running, simulate a network drop on one of the clients.
  • While that client is disconnected, run nomad node drain -enable -yes :id for the disconnected node id.
  • Delete the job with nomad job stop -purge
  • Rerun the job and observe the deployment output
  • The Desired count will be off by the number of allocs on the disconnected clients. This is a symptom of the alloc being marked as Migrate.
    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    cache           1        2       2        0          2022-04-05T20:19:42Z

Expected Result

The allocation should not be marked for migration since migrations require connectivity to the node.

Actual Result

Allocations are marked as Migrate.

Job file (if appropriate)

job "spread" {
  datacenters = ["dc1"]

  group "cache" {
    count = 2

    max_client_disconnect = "2m"
    
    spread {
      attribute = "${node.datacenter}"
    }

    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
@DerekStrickland
Copy link
Contributor Author

This issue was resolved by changing the order of filtering in filterByTainted so that the DesiredStatus stop for a reconnecting alloc is detected before filtering for migrations. This is working as desired on main as of early yesterday morning.

@DerekStrickland
Copy link
Contributor Author

Closed by #12476

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant