Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poststart task exiting causes alloc health to be marked as unhealthy incorrectly #12303

Closed
dadgar opened this issue Mar 15, 2022 · 2 comments
Closed

Comments

@dadgar
Copy link
Contributor

dadgar commented Mar 15, 2022

Nomad version

Nomad v1.2.6 (a6c6b47)

Operating system and Environment details

OSX 12.2

Issue

When a task group has an post-start task and it exits, the main task is marked as unhealthy:

"Task not running for min_healthy_time of 10s by deadline"

Reproduction steps

Run this job:

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    network {
      port "db" {}
    }

    task "init" {
      lifecycle {
        hook = "poststart"
      }

      driver = "docker"
      config {
        image = "redis:3.2"
        args  = [ "sleep", "5" ]
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
        ports = ["db"]
      }
    }
  }
}

Expected Result

Allocation to be healthy

Actual Result

$ nomad alloc status 59
ID                  = 59d0528a-a011-c4fc-3016-e6475ef41165
Eval ID             = 8753cb28
Name                = example.cache[0]
Node ID             = 4e95ec5b
Node Name           = alexdadgar-YWFQ2L5CXR
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 15s ago
Modified            = 9s ago
Deployment ID       = 23d80bf8
Deployment Health   = unhealthy

Allocation Addresses
Label  Dynamic  Address
*db    yes      127.0.0.1:31454

Task "init" (poststart) is "dead"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  336 KiB/300 MiB  300 MiB

Task Events:
Started At     = 2022-03-15T18:35:53Z
Finished At    = 2022-03-15T18:35:58Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-03-15T11:35:58-07:00  Terminated  Exit Code: 0
2022-03-15T11:35:53-07:00  Started     Task started by client
2022-03-15T11:35:52-07:00  Task Setup  Building Task Directory
2022-03-15T11:35:52-07:00  Received    Task received by client

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
5/100 MHz  1.3 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2022-03-15T18:35:52Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2022-03-15T11:35:58-07:00  Alloc Unhealthy  Task not running for min_healthy_time of 10s by deadline
2022-03-15T11:35:52-07:00  Started          Task started by client
2022-03-15T11:35:52-07:00  Task Setup       Building Task Directory
2022-03-15T11:35:52-07:00  Received         Task received by client

$ nomad alloc status -json 59
{
    "AllocModifyIndex": 17,
    "AllocatedResources": {
        "Shared": {
            "DiskMB": 300,
            "Networks": [
                {
                    "CIDR": "",
                    "DNS": null,
                    "Device": "",
                    "DynamicPorts": [
                        {
                            "HostNetwork": "default",
                            "Label": "db",
                            "To": 0,
                            "Value": 31454
                        }
                    ],
                    "Hostname": "",
                    "IP": "127.0.0.1",
                    "MBits": 0,
                    "Mode": "",
                    "ReservedPorts": null
                }
            ],
            "Ports": [
                {
                    "HostIP": "127.0.0.1",
                    "Label": "db",
                    "To": 0,
                    "Value": 31454
                }
            ]
        },
        "Tasks": {
            "init": {
                "Cpu": {
                    "CpuShares": 100
                },
                "Devices": null,
                "Memory": {
                    "MemoryMB": 300,
                    "MemoryMaxMB": 0
                },
                "Networks": null
            },
            "redis": {
                "Cpu": {
                    "CpuShares": 100
                },
                "Devices": null,
                "Memory": {
                    "MemoryMB": 300,
                    "MemoryMaxMB": 0
                },
                "Networks": null
            }
        }
    },
    "ClientDescription": "Tasks are running",
    "ClientStatus": "running",
    "CreateIndex": 11,
    "CreateTime": 1647369352236687000,
    "DeploymentID": "23d80bf8-7797-a6a4-e065-8a8777f65475",
    "DeploymentStatus": {
        "Canary": false,
        "Healthy": false,
        "ModifyIndex": 16,
        "Timestamp": "2022-03-15T11:35:58.213659-07:00"
    },
    "DesiredDescription": "",
    "DesiredStatus": "run",
    "DesiredTransition": {
        "Migrate": null,
        "Reschedule": true
    },
    "EvalID": "8753cb28-d69f-0cd5-9266-30e15d4c3de0",
    "FollowupEvalID": "",
    "ID": "59d0528a-a011-c4fc-3016-e6475ef41165",
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": null,
        "ConsulNamespace": "",
        "ConsulToken": "",
        "CreateIndex": 10,
        "Datacenters": [
            "dc1"
        ],
        "DispatchIdempotencyToken": "",
        "Dispatched": false,
        "ID": "example",
        "JobModifyIndex": 10,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 10,
        "Multiregion": null,
        "Name": "example",
        "Namespace": "default",
        "NomadTokenID": "",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "global",
        "Reschedule": null,
        "Spreads": null,
        "Stable": false,
        "Status": "pending",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1647369352213738000,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": null,
                "Consul": {
                    "Namespace": ""
                },
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "cache",
                "Networks": [
                    {
                        "CIDR": "",
                        "DNS": null,
                        "Device": "",
                        "DynamicPorts": [
                            {
                                "HostNetwork": "default",
                                "Label": "db",
                                "To": 0,
                                "Value": 0
                            }
                        ],
                        "Hostname": "",
                        "IP": "",
                        "MBits": 0,
                        "Mode": "",
                        "ReservedPorts": null
                    }
                ],
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 30000000000,
                    "DelayFunction": "exponential",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Scaling": null,
                "Services": null,
                "ShutdownDelay": null,
                "Spreads": null,
                "StopAfterClientDisconnect": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "args": [
                                "sleep",
                                "5"
                            ],
                            "image": "redis:3.2"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": {
                            "Hook": "poststart",
                            "Sidecar": false
                        },
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "init",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 300,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    },
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "image": "redis:3.2",
                            "ports": [
                                "db"
                            ]
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": null,
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "Lifecycle": null,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "redis",
                        "Resources": {
                            "CPU": 100,
                            "Cores": 0,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 300,
                            "MemoryMaxMB": 0,
                            "Networks": null
                        },
                        "RestartPolicy": {
                            "Attempts": 2,
                            "Delay": 15000000000,
                            "Interval": 1800000000000,
                            "Mode": "fail"
                        },
                        "ScalingPolicies": null,
                        "Services": null,
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": false,
                    "AutoRevert": false,
                    "Canary": 0,
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 30000000000
        },
        "VaultNamespace": "",
        "VaultToken": "",
        "Version": 0
    },
    "JobID": "example",
    "Metrics": {
        "AllocationTime": 4944000,
        "ClassExhausted": null,
        "ClassFiltered": null,
        "CoalescedFailures": 0,
        "ConstraintFiltered": null,
        "DimensionExhausted": null,
        "NodesAvailable": {
            "dc1": 1
        },
        "NodesEvaluated": 1,
        "NodesExhausted": 0,
        "NodesFiltered": 0,
        "QuotaExhausted": null,
        "ResourcesExhausted": null,
        "ScoreMetaData": [
            {
                "NodeID": "4e95ec5b-2b8a-0e64-8eeb-c12790686ac3",
                "NormScore": 0.016893542950934384,
                "Scores": {
                    "node-reschedule-penalty": 0.0,
                    "node-affinity": 0.0,
                    "binpack": 0.016893542950934384,
                    "job-anti-affinity": 0.0
                }
            }
        ],
        "Scores": null
    },
    "ModifyIndex": 17,
    "ModifyTime": 1647369358368951000,
    "Name": "example.cache[0]",
    "Namespace": "default",
    "NextAllocation": "",
    "NodeID": "4e95ec5b-2b8a-0e64-8eeb-c12790686ac3",
    "NodeName": "alexdadgar-YWFQ2L5CXR",
    "PreemptedAllocations": null,
    "PreemptedByAllocation": "",
    "PreviousAllocation": "",
    "RescheduleTracker": null,
    "Resources": {
        "CPU": 200,
        "Cores": 0,
        "Devices": null,
        "DiskMB": 300,
        "IOPS": 0,
        "MemoryMB": 600,
        "MemoryMaxMB": 600,
        "Networks": [
            {
                "CIDR": "",
                "DNS": null,
                "Device": "",
                "DynamicPorts": [
                    {
                        "HostNetwork": "default",
                        "Label": "db",
                        "To": 0,
                        "Value": 31454
                    }
                ],
                "Hostname": "",
                "IP": "127.0.0.1",
                "MBits": 0,
                "Mode": "",
                "ReservedPorts": null
            }
        ]
    },
    "Services": null,
    "TaskGroup": "cache",
    "TaskResources": {
        "redis": {
            "CPU": 100,
            "Cores": 0,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 300,
            "MemoryMaxMB": 0,
            "Networks": null
        },
        "init": {
            "CPU": 100,
            "Cores": 0,
            "Devices": null,
            "DiskMB": 0,
            "IOPS": 0,
            "MemoryMB": 300,
            "MemoryMaxMB": 0,
            "Networks": null
        }
    },
    "TaskStates": {
        "init": {
            "Events": [
                {
                    "Details": {},
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Task received by client",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369352255561000,
                    "Type": "Received",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {
                        "message": "Building Task Directory"
                    },
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Building Task Directory",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "Building Task Directory",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369352770316000,
                    "Type": "Task Setup",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {},
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Task started by client",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369353174229000,
                    "Type": "Started",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {
                        "oom_killed": "false",
                        "exit_code": "0",
                        "signal": "0"
                    },
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Exit Code: 0",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369358192845000,
                    "Type": "Terminated",
                    "ValidationError": "",
                    "VaultError": ""
                }
            ],
            "Failed": false,
            "FinishedAt": "2022-03-15T18:35:58.212949Z",
            "LastRestart": null,
            "Restarts": 0,
            "StartedAt": "2022-03-15T18:35:53.174239Z",
            "State": "dead",
            "TaskHandle": null
        },
        "redis": {
            "Events": [
                {
                    "Details": {},
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Task received by client",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369352256061000,
                    "Type": "Received",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {
                        "message": "Building Task Directory"
                    },
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Building Task Directory",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "Building Task Directory",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369352259879000,
                    "Type": "Task Setup",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {},
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Task started by client",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369352769667000,
                    "Type": "Started",
                    "ValidationError": "",
                    "VaultError": ""
                },
                {
                    "Details": {
                        "message": "Task not running for min_healthy_time of 10s by deadline"
                    },
                    "DiskLimit": 0,
                    "DiskSize": 0,
                    "DisplayMessage": "Task not running for min_healthy_time of 10s by deadline",
                    "DownloadError": "",
                    "DriverError": "",
                    "DriverMessage": "",
                    "ExitCode": 0,
                    "FailedSibling": "",
                    "FailsTask": false,
                    "GenericSource": "",
                    "KillError": "",
                    "KillReason": "",
                    "KillTimeout": 0,
                    "Message": "Task not running for min_healthy_time of 10s by deadline",
                    "RestartReason": "",
                    "SetupError": "",
                    "Signal": 0,
                    "StartDelay": 0,
                    "TaskSignal": "",
                    "TaskSignalReason": "",
                    "Time": 1647369358213554000,
                    "Type": "Alloc Unhealthy",
                    "ValidationError": "",
                    "VaultError": ""
                }
            ],
            "Failed": false,
            "FinishedAt": null,
            "LastRestart": null,
            "Restarts": 0,
            "StartedAt": "2022-03-15T18:35:52.769842Z",
            "State": "running",
            "TaskHandle": null
        }
    }
}
@tgross
Copy link
Member

tgross commented Mar 15, 2022

Fixed in #11945, which will ship in Nomad 1.3.0 (with backports)

@tgross tgross closed this as completed Mar 15, 2022
@tgross tgross added this to the 1.3.0 milestone Mar 15, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants