Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop_after_client_disconnect causes incorrect alloc stops #11943

Open
tgross opened this issue Jan 26, 2022 · 4 comments
Open

stop_after_client_disconnect causes incorrect alloc stops #11943

tgross opened this issue Jan 26, 2022 · 4 comments

Comments

@tgross
Copy link
Member

tgross commented Jan 26, 2022

While testing out a netsplit scenario for #11892 I ran into buggy behavior around the stop_after_client_disconnect field.

I have the following jobspec running on three clients (note that no CSI is involved here, to reduce variables):

jobspec
job "httpd" {
  datacenters = ["dc1"]

  group "web" {

    count = 3
    stop_after_client_disconnect = "10s"

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

Forcibly disconnect one of the clients by running: sudo iptables -A OUTPUT -p tcp --dport 4647 -j DROP, which results in the expected RPC errors and eventually the node getting marked as down:

2022-01-26T20:33:27.510Z [ERROR] client: error querying node allocations: error="rpc error: failed to get conn: dial tcp 192.168.56.20:4647: i/o timeout"
2022-01-26T20:33:27.510Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateStatus server=192.168.56.20:4647
2022-01-26T20:33:27.510Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.UpdateStatus server=192.168.56.20:4647
2022-01-26T20:33:27.510Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: rpc error: lead thread didn't get connection" period=16.919215002s

The allocation on that client is stopped, as we want, with the following log entry on the disconnected client:

2022-01-26T20:31:20.975Z [DEBUG] client: stopping alloc for stop_after_client_disconnect: alloc=63722979-1fc1-d683-e4dc-664ea93260d1

So far so good. But then replacement allocation is started and almost immediately exits, having been marked complete, a few seconds later:

$ nomad job status httpd
ID            = httpd
Name          = httpd
Submit Date   = 2022-01-26T20:30:50Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         0       0         2        0       2         0

Latest Deployment
ID          = 2eb4e3fd
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
web         3        3       3        0          2022-01-26T20:41:02Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
172ed8d3  bcd3edfe  web         0        run      complete  36s ago    5s ago
63722979  39ec5cf0  web         0        run      complete  2m10s ago  1m35s ago
cfdce531  64551e34  web         0        run      running   2m10s ago  1m59s ago
e5644f89  64551e34  web         0        run      running   2m10s ago  1m59s ago

Alloc status is as follows (the alloc logs are immediately GC'd, which is also suspicious):

$ nomad alloc status 172ed8d3
ID                  = 172ed8d3-b044-6a21-9575-42a2ae4f0ffa
Eval ID             = 8c932c8d
Name                = httpd.web[1]
Node ID             = bcd3edfe
Node Name           = nomad-client1
Job ID              = httpd
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 4m49s ago
Modified            = 4m18s ago

Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*www   yes      10.0.2.15:30147 -> 8001

Task "http" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/128 MHz  48 KiB/128 MiB  300 MiB

Task Events:
Started At     = 2022-01-26T20:32:26Z
Finished At    = 2022-01-26T20:32:55Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2022-01-26T20:32:55Z  Killed      Task successfully killed
2022-01-26T20:32:55Z  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-01-26T20:32:50Z  Killing     Sent interrupt. Waiting 5s before force killing
2022-01-26T20:32:26Z  Started     Task started by client
2022-01-26T20:32:26Z  Task Setup  Building Task Directory
2022-01-26T20:32:25Z  Received    Task received by client

The client for this new allocation on the new node has the following log entry:

2022-01-26T20:32:50.432Z [DEBUG] client: stopping alloc for stop_after_client_disconnect: alloc=172ed8d3-b044-6a21-9575-42a2ae4f0ffa

Firing off new evals results in the same behavior:

  • If I job run again, I get a new allocation that exits.
  • When I reconnected the client, a new eval got fired and a new allocation was launched, but it faced the same fate and was marked complete immediately.

The status is always:

Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>

My hypothesis is that the replacement allocations are being "tainted" in some way by the scheduler and that's getting picked up by the client? But I don't have a theory as to the mechanism for that.

full alloc API object
{
  "AllocModifyIndex": 351,
  "AllocatedResources": {
    "Shared": {
      "DiskMB": 300,
      "Networks": [
        {
          "CIDR": "",
          "DNS": null,
          "Device": "",
          "DynamicPorts": [
            {
              "HostNetwork": "default",
              "Label": "www",
              "To": 8001,
              "Value": 30262
            }
          ],
          "IP": "10.0.2.15",
          "MBits": 0,
          "Mode": "bridge",
          "ReservedPorts": null
        }
      ],
      "Ports": [
        {
          "HostIP": "10.0.2.15",
          "Label": "www",
          "To": 8001,
          "Value": 30262
        }
      ]
    },
    "TaskLifecycles": {
      "http": null
    },
    "Tasks": {
      "http": {
        "Cpu": {
          "CpuShares": 128,
          "ReservedCores": null
        },
        "Devices": null,
        "Memory": {
          "MemoryMB": 128,
          "MemoryMaxMB": 0
        },
        "Networks": null
      }
    }
  },
  "ClientDescription": "All tasks have completed",
  "ClientStatus": "complete",
  "CreateIndex": 351,
  "CreateTime": 1643229878250158600,
  "DeploymentStatus": {
    "Canary": false,
    "Healthy": true,
    "ModifyIndex": 360,
    "Timestamp": "2022-01-26T20:44:49.14186693Z"
  },
  "DesiredStatus": "run",
  "EvalID": "b5fc2caa-6edb-3319-8854-4085d54fd44c",
  "ID": "75fc16f0-3a6e-92f1-e5c8-11bbad7bb0ce",
  "Job": {
    "Affinities": null,
    "AllAtOnce": false,
    "Constraints": null,
    "ConsulNamespace": "",
    "ConsulToken": "",
    "CreateIndex": 293,
    "Datacenters": [
      "dc1"
    ],
    "DispatchIdempotencyToken": "",
    "Dispatched": false,
    "ID": "httpd",
    "JobModifyIndex": 293,
    "Meta": null,
    "ModifyIndex": 304,
    "Multiregion": null,
    "Name": "httpd",
    "Namespace": "default",
    "NomadTokenID": "c59b7c6c-afbd-2907-e909-3709dea629f8",
    "ParameterizedJob": null,
    "ParentID": "",
    "Payload": null,
    "Periodic": null,
    "Priority": 50,
    "Region": "global",
    "Spreads": null,
    "Stable": true,
    "Status": "running",
    "StatusDescription": "",
    "Stop": false,
    "SubmitTime": 1643229050631335200,
    "TaskGroups": [
      {
        "Affinities": null,
        "Constraints": null,
        "Consul": {
          "Namespace": ""
        },
        "Count": 3,
        "EphemeralDisk": {
          "Migrate": false,
          "SizeMB": 300,
          "Sticky": false
        },
        "Meta": null,
        "Migrate": {
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000
        },
        "Name": "web",
        "Networks": [
          {
            "CIDR": "",
            "DNS": null,
            "Device": "",
            "DynamicPorts": [
              {
                "HostNetwork": "default",
                "Label": "www",
                "To": 8001,
                "Value": 0
              }
            ],
            "IP": "",
            "MBits": 0,
            "Mode": "bridge",
            "ReservedPorts": null
          }
        ],
        "ReschedulePolicy": {
          "Attempts": 0,
          "Delay": 30000000000,
          "DelayFunction": "exponential",
          "Interval": 0,
          "MaxDelay": 3600000000000,
          "Unlimited": true
        },
        "RestartPolicy": {
          "Attempts": 2,
          "Delay": 15000000000,
          "Interval": 1800000000000,
          "Mode": "fail"
        },
        "Scaling": null,
        "Services": null,
        "ShutdownDelay": null,
        "Spreads": null,
        "StopAfterClientDisconnect": 10000000000,
        "Tasks": [
          {
            "Affinities": null,
            "Artifacts": null,
            "CSIPluginConfig": null,
            "Config": {
              "ports": [
                "www"
              ],
              "args": [
                "-v",
                "-f",
                "-p",
                "8001",
                "-h",
                "/local"
              ],
              "command": "httpd",
              "image": "busybox:1"
            },
            "Constraints": null,
            "DispatchPayload": null,
            "Driver": "docker",
            "Env": null,
            "KillSignal": "",
            "KillTimeout": 5000000000,
            "Kind": "",
            "Leader": false,
            "Lifecycle": null,
            "LogConfig": {
              "MaxFileSizeMB": 10,
              "MaxFiles": 10
            },
            "Meta": null,
            "Name": "http",
            "Resources": {
              "CPU": 128,
              "Cores": 0,
              "Devices": null,
              "DiskMB": 0,
              "IOPS": 0,
              "MemoryMB": 128,
              "MemoryMaxMB": 0,
              "Networks": null
            },
            "RestartPolicy": {
              "Attempts": 2,
              "Delay": 15000000000,
              "Interval": 1800000000000,
              "Mode": "fail"
            },
            "ScalingPolicies": null,
            "Services": null,
            "ShutdownDelay": 0,
            "Templates": [
              {
                "ChangeMode": "restart",
                "ChangeSignal": "",
                "DestPath": "local/index.html",
                "EmbeddedTmpl": "<html>hello, world</html>",
                "Envvars": false,
                "LeftDelim": "{{",
                "Perms": "0644",
                "RightDelim": "}}",
                "SourcePath": "",
                "Splay": 5000000000,
                "VaultGrace": 0,
                "Wait": null
              }
            ],
            "User": "",
            "Vault": null,
            "VolumeMounts": null
          }
        ],
        "Update": {
          "AutoPromote": false,
          "AutoRevert": false,
          "Canary": 0,
          "HealthCheck": "checks",
          "HealthyDeadline": 300000000000,
          "MaxParallel": 1,
          "MinHealthyTime": 10000000000,
          "ProgressDeadline": 600000000000,
          "Stagger": 30000000000
        },
        "Volumes": null
      }
    ],
    "Type": "service",
    "Update": {
      "AutoPromote": false,
      "AutoRevert": false,
      "Canary": 0,
      "HealthCheck": "",
      "HealthyDeadline": 0,
      "MaxParallel": 1,
      "MinHealthyTime": 0,
      "ProgressDeadline": 0,
      "Stagger": 30000000000
    },
    "VaultNamespace": "",
    "VaultToken": "",
    "Version": 0
  },
  "JobID": "httpd",
  "Metrics": {
    "AllocationTime": 49016,
    "ClassExhausted": null,
    "ClassFiltered": null,
    "CoalescedFailures": 0,
    "ConstraintFiltered": null,
    "DimensionExhausted": null,
    "NodesAvailable": {
      "dc1": 3
    },
    "NodesEvaluated": 2,
    "NodesExhausted": 0,
    "NodesFiltered": 0,
    "QuotaExhausted": null,
    "ResourcesExhausted": null,
    "ScoreMetaData": [
      {
        "NodeID": "bcd3edfe-41bc-edbd-f388-bf38528c9146",
        "NormScore": 0.8059322314648598,
        "Scores": {
          "binpack": 0.8059322314648598,
          "job-anti-affinity": 0,
          "node-reschedule-penalty": 0,
          "node-affinity": 0
        }
      },
      {
        "NodeID": "39ec5cf0-3e25-f610-6164-2af984c203ad",
        "NormScore": 0.4304707020253908,
        "Scores": {
          "node-reschedule-penalty": 0,
          "node-affinity": 0,
          "binpack": 0.4304707020253908,
          "job-anti-affinity": 0
        }
      }
    ],
    "Scores": null
  },
  "ModifyIndex": 364,
  "ModifyTime": 1643229908619675600,
  "Name": "httpd.web[1]",
  "Namespace": "default",
  "NetworkStatus": {
    "Address": "172.26.64.30",
    "DNS": {
      "Options": [],
      "Searches": [],
      "Servers": []
    },
    "InterfaceName": "eth0"
  },
  "NodeID": "bcd3edfe-41bc-edbd-f388-bf38528c9146",
  "NodeName": "nomad-client1",
  "Resources": {
    "CPU": 128,
    "Cores": 0,
    "Devices": null,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 128,
    "MemoryMaxMB": 128,
    "Networks": [
      {
        "CIDR": "",
        "DNS": null,
        "Device": "",
        "DynamicPorts": [
          {
            "HostNetwork": "default",
            "Label": "www",
            "To": 8001,
            "Value": 30262
          }
        ],
        "IP": "10.0.2.15",
        "MBits": 0,
        "Mode": "bridge",
        "ReservedPorts": null
      }
    ]
  },
  "SharedResources": {
    "CPU": 0,
    "Cores": 0,
    "Devices": null,
    "DiskMB": 300,
    "IOPS": 0,
    "MemoryMB": 0,
    "MemoryMaxMB": 0,
    "Networks": [
      {
        "CIDR": "",
        "DNS": null,
        "Device": "",
        "DynamicPorts": [
          {
            "HostNetwork": "default",
            "Label": "www",
            "To": 8001,
            "Value": 30262
          }
        ],
        "IP": "10.0.2.15",
        "MBits": 0,
        "Mode": "bridge",
        "ReservedPorts": null
      }
    ]
  },
  "TaskGroup": "web",
  "TaskResources": {
    "http": {
      "CPU": 128,
      "Cores": 0,
      "Devices": null,
      "DiskMB": 0,
      "IOPS": 0,
      "MemoryMB": 128,
      "MemoryMaxMB": 0,
      "Networks": null
    }
  },
  "TaskStates": {
    "http": {
      "Events": [
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Task received by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229878253418500,
          "Type": "Received",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "message": "Building Task Directory"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Building Task Directory",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "Building Task Directory",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229878985618200,
          "Type": "Task Setup",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Task started by client",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229879140377600,
          "Type": "Started",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "kill_timeout": "5s"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Sent interrupt. Waiting 5s before force killing",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 5000000000,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229903258884400,
          "Type": "Killing",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {
            "exit_code": "137",
            "signal": "0",
            "oom_killed": "false",
            "exit_message": "Docker container exited with non-zero exit code: 137"
          },
          "DiskLimit": 0,
          "DisplayMessage": "Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 137,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "Docker container exited with non-zero exit code: 137",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229908315548700,
          "Type": "Terminated",
          "ValidationError": "",
          "VaultError": ""
        },
        {
          "Details": {},
          "DiskLimit": 0,
          "DisplayMessage": "Task successfully killed",
          "DownloadError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "FailedSibling": "",
          "FailsTask": false,
          "GenericSource": "",
          "KillError": "",
          "KillReason": "",
          "KillTimeout": 0,
          "Message": "",
          "RestartReason": "",
          "SetupError": "",
          "Signal": 0,
          "StartDelay": 0,
          "TaskSignal": "",
          "TaskSignalReason": "",
          "Time": 1643229908321478100,
          "Type": "Killed",
          "ValidationError": "",
          "VaultError": ""
        }
      ],
      "Failed": false,
      "FinishedAt": "2022-01-26T20:45:08.321988108Z",
      "LastRestart": null,
      "Restarts": 0,
      "StartedAt": "2022-01-26T20:44:39.140379991Z",
      "State": "dead",
      "TaskHandle": null
    }
  }
}
@tgross tgross changed the title allocs rescheduled for stop_after_client_disconnect are also stopped stop_after_client_disconnect causes incorrect alloc stops Jan 26, 2022
@tgross
Copy link
Member Author

tgross commented Jan 26, 2022

On further testing it looks like this happens whether or not the allocation is rescheduled. After some time, even on a stable cluster without any disconnects, all of the allocations eventually stops and is marked complete.

@tgross tgross self-assigned this Jan 26, 2022
@tgross tgross removed their assignment May 19, 2022
@morphine1900
Copy link

Got the same issue. Is this solved?

@jrasell
Copy link
Member

jrasell commented Aug 4, 2023

Hi @morphine1900, it is not currently solved. Once a PR is raised to fix the issue, it will be linked here, and this issue closed once it has been merged.

@morphine1900
Copy link

Hi @morphine1900, it is not currently solved. Once a PR is raised to fix the issue, it will be linked here, and this issue closed once it has been merged.

Hi Jrasell, thanks for the reply, is there any plan to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants