Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-asg target: Panic while scaling in nodes #506

Closed
josegonzalez opened this issue Jul 1, 2021 · 3 comments · Fixed by #508
Closed

aws-asg target: Panic while scaling in nodes #506

josegonzalez opened this issue Jul 1, 2021 · 3 comments · Fixed by #508

Comments

@josegonzalez
Copy link

josegonzalez commented Jul 1, 2021

Using nomad-autoscaler v0.3.3 OSS, we get the following panic (logs retrieved from our datadog logs, which may be out of order in case the ordering looks funny):

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd38732]
goroutine 90 [running]:
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.computeNodeTotalResources(...)
/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:130
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.(*leastBusyClusterScaleInNodeSelector).computeNodeResources(0xc000c4e348, 0xc000b1e540, 0x8, 0xc000b34300, 0xc000646270)
/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:87 +0x112
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.(*leastBusyClusterScaleInNodeSelector).Select(0xc000c4e348, 0xc000b342c0, 0x8, 0x8, 0x1, 0x2, 0x0, 0x0)
/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:55 +0xaa
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils.(*ClusterScaleUtils).IdentifyScaleInNodes(0xc0005f6f00, 0xc000e78450, 0x1, 0x40db9b, 0xc000bf9310, 0x8, 0x8, 0x17e8320)
/home/circleci/project/project/sdk/helper/scaleutils/cluster.go:140 +0x5f8
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/aws-asg/plugin.(*TargetPlugin).scaleIn(0xc0006462a0, 0x1cd04c8, 0xc00003c090, 0xc000cee160, 0x1, 0xc000e78450, 0x0, 0x0)
/home/circleci/project/project/plugins/builtin/target/aws-asg/plugin/aws.go:106 +0x6c
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/aws-asg/plugin.(*TargetPlugin).Scale(0xc0006462a0, 0x7, 0xc000118780, 0x27, 0xff00, 0xc0010f0630, 0xc000e78450, 0x7f74047277a8, 0xc0012cbc08)
/home/circleci/project/project/plugins/builtin/target/aws-asg/plugin/plugin.go:125 +0x399
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).runTargetScale(0xc00008d9f0, 0x7f740472d010, 0xc0006462a0, 0xc000544770, 0x7, 0xc000118780, 0x27, 0xff00, 0xc0010f0630, 0x0, ...)
/home/circleci/project/project/policyeval/base_worker.go:249 +0x23c
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).handlePolicy(0xc00008d9f0, 0x1cd0490, 0xc000284bc0, 0xc000a8c050, 0x0, 0x0)
/home/circleci/project/project/policyeval/base_worker.go:215 +0xf4d
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00008d9f0, 0x1cd0490, 0xc000284bc0)
/home/circleci/project/project/policyeval/base_worker.go:76 +0x2d9
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
/home/circleci/project/project/agent/agent.go:130 +0x42e
2021/07/01 03:00:07.983048 [ERR] (cli) child process died with exit code 2

Running against Nomad 0.12.11+ent.

Happy to provide any extra details that might be useful for debugging.

@josegonzalez
Copy link
Author

Maybe a better log output (from an event I just saw happen):

2021-07-01T19:25:10.946Z [DEBUG] internal_plugin.aws-asg: performing node pool filtering: node_class=applications
2021-07-01T19:25:10.950Z [DEBUG] internal_plugin.aws-asg: performing node selection: selector_strategy=least_busy
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd38732]

goroutine 81 [running]:
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.computeNodeTotalResources(...)
	/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:130
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.(*leastBusyClusterScaleInNodeSelector).computeNodeResources(0xc000eba540, 0xc00031ae40, 0x2, 0xc000679df0, 0xc0005954d0)
	/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:87 +0x112
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils/nodeselector.(*leastBusyClusterScaleInNodeSelector).Select(0xc000eba540, 0xc000679dd0, 0x2, 0x2, 0x1, 0x2, 0x0, 0x0)
	/home/circleci/project/project/sdk/helper/scaleutils/nodeselector/least_busy.go:55 +0xaa
github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils.(*ClusterScaleUtils).IdentifyScaleInNodes(0xc000210840, 0xc000ad7140, 0x1, 0x40db9b, 0xc000c1a9b0, 0x8, 0x8, 0x17e8320)
	/home/circleci/project/project/sdk/helper/scaleutils/cluster.go:140 +0x5f8
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/aws-asg/plugin.(*TargetPlugin).scaleIn(0xc000595500, 0x1cd04c8, 0xc00003a088, 0xc000d57b80, 0x1, 0xc000ad7140, 0x0, 0x0)
	/home/circleci/project/project/plugins/builtin/target/aws-asg/plugin/aws.go:106 +0x6c
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/aws-asg/plugin.(*TargetPlugin).Scale(0xc000595500, 0x1, 0xc000ca9c80, 0x27, 0xff00, 0xc000e8c2a0, 0xc000ad7140, 0x0, 0xc000f21c08)
	/home/circleci/project/project/plugins/builtin/target/aws-asg/plugin/plugin.go:125 +0x399
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).runTargetScale(0xc0001015e0, 0x7fb6ff4b3af0, 0xc000595500, 0xc0005ac770, 0x1, 0xc000ca9c80, 0x27, 0xff00, 0xc000e8c2a0, 0x0, ...)
	/home/circleci/project/project/policyeval/base_worker.go:249 +0x23c
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).handlePolicy(0xc0001015e0, 0x1cd0490, 0xc0000a2d80, 0xc000ac0000, 0x0, 0x0)
	/home/circleci/project/project/policyeval/base_worker.go:215 +0xf4d
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc0001015e0, 0x1cd0490, 0xc0000a2d80)
	/home/circleci/project/project/policyeval/base_worker.go:76 +0x2d9
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
	/home/circleci/project/project/agent/agent.go:130 +0x42e
2021/07/01 19:25:10.964401 [ERR] (cli) child process died with exit code 2

@jrasell
Copy link
Member

jrasell commented Jul 2, 2021

Hi @josegonzalez and thanks for this report. I believe I know the cause after looking at the stack output so will dig into a fix today hopefully. I the meantime, it'd be great to get the version information of the Nomad client(s) that are being processed by the node selector. If you're also able to share what an entry in the return object is from the node list API call for a node in the target pool looks like, that would be very helpful.

@jrasell jrasell self-assigned this Jul 2, 2021
@josegonzalez
Copy link
Author

We're using the same version (Nomad 0.12.11+ent) everywhere. Here is the first node from that list (its ineligible, not sure if that matters).

{
    "Address": "10.2.99.133",
    "ID": "023358e3-adec-c68f-441b-75dc43fed9ed",
    "Datacenter": "us-east-1",
    "Name": "high-cpu-applications-i-08c9d5ea6ea73b1f7",
    "NodeClass": "high-cpu-applications",
    "Version": "0.12.11+ent",
    "Drain": false,
    "SchedulingEligibility": "ineligible",
    "Status": "down",
    "StatusDescription": "",
    "Drivers": {
        "raw_exec": {
            "Attributes": {
                "driver.raw_exec": "true"
            },
            "Detected": true,
            "Healthy": true,
            "HealthDescription": "Healthy",
            "UpdateTime": "2021-07-01T17:03:40.201671603Z"
        },
        "java": {
            "Attributes": null,
            "Detected": false,
            "Healthy": false,
            "HealthDescription": "",
            "UpdateTime": "2021-07-01T17:03:40.201730824Z"
        },
        "exec": {
            "Attributes": {
                "driver.exec": "true"
            },
            "Detected": true,
            "Healthy": true,
            "HealthDescription": "Healthy",
            "UpdateTime": "2021-07-01T17:03:40.202224437Z"
        },
        "docker": {
            "Attributes": {
                "driver.docker.os_type": "linux",
                "driver.docker": "true",
                "driver.docker.version": "20.10.3",
                "driver.docker.volumes.enabled": "true",
                "driver.docker.runtimes": "io.containerd.runc.v2,io.containerd.runtime.v1.linux,runc"
            },
            "Detected": true,
            "Healthy": true,
            "HealthDescription": "Healthy",
            "UpdateTime": "2021-07-01T17:03:40.219787043Z"
        },
        "qemu": {
            "Attributes": null,
            "Detected": false,
            "Healthy": false,
            "HealthDescription": "",
            "UpdateTime": "2021-07-01T17:03:40.201629192Z"
        }
    },
    "HostVolumes": null,
    "CreateIndex": 113830814,
    "ModifyIndex": 113889153
}

Happy to provide additional details via a hashicorp support ticket if that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants