ECS Probe - SIGSEGV: segmentation violation #2508

pecigonzalo · 2017-05-09T13:25:35Z

When running scope on ECS with the --probe.ecs=true we get:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420012050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc42360e000, 0x13, 0x20)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc422e6cc80, 0x13, 0x13)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc422e6cc80, 0x13, 0x13, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc423664000, 0xc422e6cc80, 0x13, 0x13, 0x0, 0x13, 0xc4232ce1e0)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215bac00, 0x100000, 0x34630b8a000, 0xc421043270, 0xc421043150, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215bac90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4228dfa10, 0xc4228dfad0, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc42035a090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4228dfa10, 0xc4228dfad0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc42035a090)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c

Setting --probe.ecs=false works without issues.

The text was updated successfully, but these errors were encountered:

2opremio · 2017-05-09T13:31:01Z

Thanks for the report @pecigonzalo . What version of Scope are you running?

pecigonzalo · 2017-05-09T13:39:39Z

@2opremio thanks for the quick reply Weave Scope version 1.3.0 let me know if there is any additional information i can provide.

2opremio · 2017-05-09T13:40:36Z

Thanks, I will take a look. Hopefully it will be an easy one to fix :)

pecigonzalo · 2017-05-09T13:42:47Z

Maybe just this other info, running on Amazon Linux: Linux <hostname> 4.4.39-34.54.amzn1.x86_64 #1 SMP Fri Dec 30 19:11:28 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

2opremio · 2017-05-10T15:00:35Z

I am almost certain that this was caused by the ARN of a task being nil.

@ekimekim Any idea of how the could happen? I vaguely recall discussing #2026 and you mentioning that the API guaranteed no task pointers could be null. To start with, should the initialization happen if there's a failure in the loop above?

@pecigonzalo Would you mind sending us the full logs? In particular, do you see errors like "Failed to describe ECS task %s, ECS service report may be incomplete: %s" ?

ekimekim · 2017-05-10T15:41:01Z

It might be possible that in the event of a partial failure, empty tasks can be returned? That seems odd to me, but the documentation isn't very specific so it's a possibility.

2opremio · 2017-05-10T15:49:13Z

@pecigonzalo I have merged #2514 with the hope of avoiding the crash. Would you mind running weaveworks/scope:latest to see if the crash goes away?

Even if it does fix the crashes, there must be a deeper problem causing the null values being obtained, so please send us the logs anyhow.

pecigonzalo · 2017-05-10T15:57:13Z

@2opremio I dont see that log line you mentioned, but I'll run this again and let you know what I can find and then run with latest

thanks

pecigonzalo · 2017-05-10T16:08:46Z

@2opremio Ok, I just ran some tests again, unfortunately, the crash is not happening anymore with 1.3.0 either, so I can't know if it's fixed or not although I believe it is indeed related to your PR, but maybe when certain listed tasks/services are no longer present.

Ill keep 1.3.0 running and report back if I see the issue again.

Im getting a lot of

Failed to describe ECS task arn:aws:ecs:eu-central-1:<redacted>:task/<redacted>, ECS service report may be incomplete: MISSING

for several tasks

2opremio · 2017-05-10T16:18:19Z

maybe when certain listed tasks/services are no longer present.

Yep, probably this happens when tasks die/end.

I betDescribeTask may return nil pointers in tasks which don't exist anymore.

pecigonzalo · 2017-05-10T16:27:26Z

I believe its true, as i was getting more Tasks listed than the actual list of tasks I had. Unfortunately i dont have the full log, if i can replicate ill post here.

2opremio · 2017-05-10T16:29:34Z

Please do, thanks.

pecigonzalo · 2017-05-19T10:59:27Z

Ok, i just had this again @2opremio

<probe> INFO: 2017/05/19 10:55:21.722228 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:21.722307 probe starting, version 1.3.0, ID 6ed3ea4b47fb014f
<probe> ERRO: 2017/05/19 10:55:21.832364 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:21.872855 Success collecting weave plugin status
<probe> INFO: 2017/05/19 10:55:21.890689 Success collecting weave ps
<probe> ERRO: 2017/05/19 10:55:21.890778 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:21.930740 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:21.945030 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:21.989464 Success collecting weave proxy status
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a20, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a00, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a00, 0x2, 0x2, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42439f940, 0xc4243c0a00, 0x2, 0x2, 0x0, 0x2, 0xc4228af720)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215b6c90, 0x100000, 0x34630b8a000, 0xc421095250, 0xc4210950f0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215b6d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42431c210, 0xc42431c240, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420071e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42431c210, 0xc42431c240, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420071e60)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-05-19T10:55:23Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/05/19 10:55:23.275463 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:23.275542 probe starting, version 1.3.0, ID 18b75ed5e656b12c
<probe> ERRO: 2017/05/19 10:55:23.390745 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:23.426392 Success collecting weave plugin status
<probe> INFO: 2017/05/19 10:55:23.437603 Success collecting weave ps
<probe> ERRO: 2017/05/19 10:55:23.439163 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:23.476731 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:23.504597 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:23.528011 Success collecting weave proxy status
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3420, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3400, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3400, 0x2, 0x2, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42436c140, 0xc4243d3400, 0x2, 0x2, 0x0, 0x2, 0xc4228e0080)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215bac90, 0x100000, 0x34630b8a000, 0xc421086f20, 0xc421086dc0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215bad20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4243224b0, 0xc4243224e0, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420071ef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4243224b0, 0xc4243224e0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420071ef0)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-05-19T10:55:24Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/05/19 10:55:24.886530 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:24.886612 probe starting, version 1.3.0, ID 7a3040bbfa6b7b85
<probe> ERRO: 2017/05/19 10:55:25.023033 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:25.064751 Success collecting weave plugin status
<probe> ERRO: 2017/05/19 10:55:25.091617 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:25.092015 Success collecting weave ps
<probe> INFO: 2017/05/19 10:55:25.104618 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:25.129010 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:25.131402 Success collecting weave proxy status

After this, i ran with :latest and had no issues. I do believe this is related to empty/missing tasks as this happened after I updated some of them.

2opremio · 2017-05-19T11:14:02Z

OK, good to know. We will be releasing 1.3.1 soon (in a day or two) so please stay put.

pecigonzalo · 2017-07-20T13:44:55Z

This is still happening, now I believe due to AWS API Rate limit, maybe related to #2050 .

<probe> INFO: 2017/07/20 13:37:53.406117 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/07/20 13:37:53.406256 probe starting, version 1.5.1, ID 2729b73587f6c30b
<probe> INFO: 2017/07/20 13:37:53.454647 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/07/20 13:37:53.961431 Success collecting weave proxy status
<probe> ERRO: 2017/07/20 13:37:53.968585 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/07/20 13:37:53.973148 Success collecting weave plugin status
<probe> INFO: 2017/07/20 13:37:54.014688 Success collecting weave status
<probe> INFO: 2017/07/20 13:37:54.042530 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.042923 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.042684 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.055308 Success collecting weave ps
<probe> WARN: 2017/07/20 13:37:55.122235 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/36945a1a-3efa-4965-a8d4-01594383b520, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122256 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/2d0e4381-52a2-49a4-9213-47f695e3e969, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122262 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/17d138ca-1c3c-47e7-8dc0-9f54bc380a46, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122267 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/f5c3e08f-f391-4972-b169-46e2e1812d90, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x6480bb]
goroutine 80 [running]:
panic(0x1518ce0, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230da100, 0xf, 0x10)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4fb
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230fa000, 0xf, 0xf)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230fa000, 0xf, 0xf, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc423874000, 0xc4230fa000, 0xf, 0xf, 0x0, 0xf, 0xc422a75b00)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422c74660, 0x100000, 0x34630b8a000, 0xc4217bd5c0, 0xc4217bd460, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4d0
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422c746f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423900a20, 0xc423900a50, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc42015a750, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423900a20, 0xc423900a50, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc42015a750)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-07-20T13:37:55Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/07/20 13:37:55.566826 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/07/20 13:37:55.566894 probe starting, version 1.5.1, ID 339d52cb7b1d2abe
<probe> INFO: 2017/07/20 13:37:55.583605 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/07/20 13:37:56.001452 Success collecting weave plugin status
<probe> INFO: 2017/07/20 13:37:56.001650 Success collecting weave proxy status
<probe> INFO: 2017/07/20 13:37:56.124219 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:56.124564 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:56.113682 Control connection to scope.weave.local. starting
<probe> ERRO: 2017/07/20 13:37:56.132358 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/07/20 13:37:56.137031 Success collecting weave status
<probe> INFO: 2017/07/20 13:37:56.196769 Success collecting weave ps

But it also makes it so that scope reports the node in Weave Net section as unmanaged. I believe we need to either backoff and maybe even only collect from a single node.

rade · 2017-07-20T14:07:17Z

I was at a loss how the code can panic in that place:

	for _, task := range resp.Tasks {
		if task.TaskArn != nil {
			c.taskCache.Set(*task.TaskArn, newECSTask(task)) // panics here
		}
	}

though, I've just noticed that the addr in the SIGSEGV is not 0x0 but 0x8.

pecigonzalo · 2017-07-20T14:11:21Z

is resp.Task maybe is a generator? (sorry did not get to look at the code yet) and getting the panic due to the rate limit?

rade · 2017-07-20T14:15:14Z

getting the panic due to the rate limit?

maybe, but the code should defend against that.

Fixes #2508. Hopefully.

pecigonzalo · 2017-08-29T13:34:44Z

@rade This seems to still be an issue on: weaveworks/scope:1.6.2

<probe> WARN: 2017/08/29 13:31:26.782603 Failed to describe ECS task arn:aws:ecs:eu-central-1:123:task/6741f751-06f2-41bc-aa2c-8192ab0c66df, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/08/29 13:31:26.782610 Failed to describe ECS task arn:aws:ecs:eu-central-1:123:task/3ab4ac29-872b-4443-867c-ddab0291f4d6, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17f4811]
goroutine 67 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a400, 0x3d, 0x40)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4d1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a000, 0x3d, 0x3d)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a000, 0x3d, 0x3d, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42492edc0, 0xc42492a000, 0x3d, 0x3d, 0x0, 0x3d, 0xc422b54420)
        <autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422e03590, 0x100000, 0x34630b8a000, 0xc4219e74b0, 0xc4219e7410, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422e03620, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421ce03f0, 0xc421ce0420, 0x0, ...)
        <autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420408360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421ce03f0, 0xc421ce0420, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420408360)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-08-29T13:31:27Z" level=info msg="publishing to: 127.0.0.1:4040"

And its causing crazy high CPU usage. Maybe affects #1457

rade · 2017-08-29T14:50:57Z

@pecigonzalo are you sure you are running scope 1.6.2?

pecigonzalo · 2017-08-29T14:52:26Z

/usr/bin/docker run --name=weavescope --privileged \
    --cpu-shares=512 \
    --userns=host --net=host --pid=host \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/run/scope/plugins:/var/run/scope/plugins \
    -v /sys/kernel/debug:/sys/kernel/debug \
    -e CHECKPOINT_DISABLE \
    weaveworks/scope:1.6.2 \
    --probe.docker=true \
    --probe.ecs=true

That is my lunch command

rade · 2017-08-29T14:58:45Z

@pecigonzalo Please look for a line in the logs saying probe starting, which should contain the version number. Just to make sure it is really running the expected version.

pecigonzalo · 2017-09-01T10:58:16Z

<app> INFO: 2017/09/01 10:57:37.032091 command line args: --mode=app --probe.docker=true --probe.ecs=true
time="2017-09-01T10:57:37Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/09/01 10:57:37.060698 command line args: --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/09/01 10:57:37.060757 probe starting, version 1.6.2, ID 63708bfbfd75030
<app> INFO: 2017/09/01 10:57:37.110162 listening on :4040
<app> WARN: 2017/09/01 10:57:37.194233 Error updating weaveDNS, backing off 20s: Error running weave ps: exit status 1: "Link not found\n"
<probe> INFO: 2017/09/01 10:57:37.372689 Control connection to 127.0.0.1 starting

rade · 2017-09-01T12:07:20Z

Ok. I am at a complete loss what is going on here. I cannot think of any way that https://github.com/weaveworks/scope/blob/v1.6.2/probe/awsecs/client.go#L244 can explode with that error. @pecigonzalo can you? Also, do you think you could put together a minimal example / set of instructions that we could follow to reproduce the problem?

pecigonzalo · 2017-09-01T15:15:49Z

Im talking a wild shot, but this seems to coincide with the rate limit in many cases, could it be that ecsClientImpl is nil because the rate limit fails to create the session or similar?

rade · 2017-09-01T15:27:21Z

c is a struct, not a pointer, so it cannot be nil. Besides, it is used earlier on in the same function.

pecigonzalo · 2017-09-01T17:06:28Z

Im trying to find a way to trigger the issue, so we can reproduce consistently, but my Go is a bit rusty for such a big project.

Some questions to bounce around:

Could it be newECSTask(task) not being able to generate because a task is missing a key?
Could we put something like log.Debugf("TaskArn %s, Task: %s", task.TaskArn, ecsTask) before the L244? That way we can see the inputs
Why is it doing *task.TaskArn and not just task.TaskArn? As far as I can see we could use just a var and not reference.

As a side note from my findings, I think we are polling ECS too often. Based on my experience if you have many failed task which keeps polling from ECS instead of cache that is a lot of calls, times the number of nodes in the cluster.
Maybe scope could cache the failed hit count and if it reaches X stop polling for some time.

rade · 2017-09-01T17:46:06Z

Could it be newECSTask(task) not being able to generate because a task is missing a key?

It's not getting that far. Unless the stack trace is incomplete. It's an interesting theory.

Why is it doing *task.TaskArn and not just task.TaskArn?

Because task.TaskArn is a *string rather than string and we want the cache entry to be keyed on the string rather than the pointer.

Could we put something like log.Debugf("TaskArn %s, Task: %s", task.TaskArn, ecsTask) before the L244? That way we can see the inputs

Yes, but I would rather not ship a new scope release just to add some debug statements. Can you build scope yourself?

pecigonzalo · 2017-09-01T17:49:49Z

Yeah, I built it with that and its running manually, lets see if I hit it.

Regarding *task.TaskArn, I understand what you say, but if I understood gcache correctly it will only store it as a string anyway, so it would not matter.

rade · 2017-09-01T18:01:44Z

I think we are polling ECS too often. Based on my experience if you have many failed task which keeps polling from ECS instead of cache that is a lot of calls, times the number of nodes in the cluster.

Do you have evidence of undesirable behaviour, such has excessive CPU usage or something like that? Mind filing a separate issue?

pecigonzalo · 2017-09-01T18:28:19Z

Not really for the rate limit comment, the CPU usage I experience from this issue are due to SIGSEV causing the thread/probe to restart constantly.

Filed under: #2844

bboreham · 2017-11-01T14:39:26Z

Similar but different stack trace from probe 1.6.4:

<probe> INFO: 2017/11/01 14:09:15.970888 Publish loop for cloud.weave.works starting
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x17fc89a]

goroutine 45 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc424151a50, 0x1, 0x1)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4aa
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc4242dd540, 0x13, 0x13)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc4242dd540,0x13, 0x13, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42026c080, 0xc4242dd540, 0x13, 0x13, 0x0, 0x13, 0xc422dddd20)
        <autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc421a716b0, 0x100000, 0x34630b8a000, 0xc4214186e0, 0xc4214185c0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc421a71740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc424affd70, 0xc424affda0, 0x0, ...)
        <autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc421a72120, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc424affd70, 0xc424affda0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc421a72120)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-11-01T14:09:24Z" level=info msg="publishing to: https://cloud.weave.works:443"

given the protection against pointer access on that line, I would have to suspect the fault is actually in newECSTask which has been inlined.

galindro · 2017-11-01T19:39:38Z

@bboreham I've the same problem that you reported, but I'm using 1.6.5:

<probe> INFO: 2017/11/01 19:29:19.092470 command line args: --mode=probe --probe.docker=true --probe.ecs=true --probe.ecs.cluster.region=us-east-1 --service-token=<elided>
<probe> INFO: 2017/11/01 19:29:19.092508 probe starting, version 1.6.5, ID 6634efb2e53a7246
<probe> INFO: 2017/11/01 19:29:19.613257 Control connection to cloud.weave.works starting
<probe> INFO: 2017/11/01 19:29:19.663189 Success collecting weave status
<probe> ERRO: 2017/11/01 19:29:19.674958 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
<probe> INFO: 2017/11/01 19:29:19.719058 Success collecting weave ps
<probe> WARN: 2017/11/01 19:29:21.665221 awsecs tagger took longer than 1s
<probe> WARN: 2017/11/01 19:29:21.869640 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/ebbb24c8-1f70-42f9-884c-45335995505d, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/11/01 19:29:21.869667 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/7a8ba970-af67-4039-8fc9-8104bdf4ae68, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x17fd0fa]

goroutine 100 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423251000, 0x3f, 0x40)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:247 +0x4aa
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:301 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f, ...)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:365 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42343aac0, 0xc423270400, 0x3f, 0x3f, 0x0, 0x3f, 0xc421330c70)
	<autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422efcd80, 0x100000, 0x34630b8a000, 0x7fff00ebef28, 0x9, 0xc421330d90, 0xc421330c70, 0x10, 0x0, 0x0, ...)
	/go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:147 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422efb280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
	<autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc4200fe7e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
	/go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc4200fe7e0)
	/go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
	/go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-11-01T19:29:22Z" level=info msg="publishing to: https://cloud.weave.works:443"

pecigonzalo · 2017-11-10T13:42:54Z

As soon as #2918 is released Ill test this.

2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label May 9, 2017

2opremio added this to the Next milestone May 9, 2017

2opremio self-assigned this May 9, 2017

2opremio mentioned this issue May 10, 2017

Avoid null dereferences in ECS client #2514

Merged

2opremio closed this as completed in #2514 May 10, 2017

2opremio mentioned this issue May 10, 2017

Avoid null dereferences in ECS client (encore) #2515

Merged

rade reopened this Jul 20, 2017

rade added a commit that referenced this issue Jul 20, 2017

defend against nils

d368854

Fixes #2508. Hopefully.

rade assigned rade and unassigned 2opremio Jul 20, 2017

rade mentioned this issue Jul 20, 2017

defend against nils #2734

Merged

rade closed this as completed in #2734 Jul 20, 2017

rade modified the milestones: Next, 1.6 Jul 28, 2017

rade reopened this Sep 1, 2017

rade modified the milestones: Next, 1.6 Sep 1, 2017

bboreham mentioned this issue Nov 1, 2017

Don't de-reference pointers from ECS without checking #2918

Merged

bboreham closed this as completed in #2918 Nov 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS Probe - SIGSEGV: segmentation violation #2508

ECS Probe - SIGSEGV: segmentation violation #2508

pecigonzalo commented May 9, 2017

2opremio commented May 9, 2017

pecigonzalo commented May 9, 2017 •

edited

Loading

2opremio commented May 9, 2017

pecigonzalo commented May 9, 2017

2opremio commented May 10, 2017 •

edited

Loading

ekimekim commented May 10, 2017

2opremio commented May 10, 2017 •

edited

Loading

pecigonzalo commented May 10, 2017

pecigonzalo commented May 10, 2017 •

edited

Loading

2opremio commented May 10, 2017 •

edited

Loading

pecigonzalo commented May 10, 2017

2opremio commented May 10, 2017

pecigonzalo commented May 19, 2017 •

edited

Loading

2opremio commented May 19, 2017

pecigonzalo commented Jul 20, 2017

rade commented Jul 20, 2017

pecigonzalo commented Jul 20, 2017

rade commented Jul 20, 2017

pecigonzalo commented Aug 29, 2017

rade commented Aug 29, 2017

pecigonzalo commented Aug 29, 2017

rade commented Aug 29, 2017 •

edited

Loading

pecigonzalo commented Sep 1, 2017

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017 •

edited

Loading

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017 •

edited

Loading

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017

bboreham commented Nov 1, 2017

galindro commented Nov 1, 2017

pecigonzalo commented Nov 10, 2017

ECS Probe - SIGSEGV: segmentation violation #2508

ECS Probe - SIGSEGV: segmentation violation #2508

Comments

pecigonzalo commented May 9, 2017

2opremio commented May 9, 2017

pecigonzalo commented May 9, 2017 • edited Loading

2opremio commented May 9, 2017

pecigonzalo commented May 9, 2017

2opremio commented May 10, 2017 • edited Loading

ekimekim commented May 10, 2017

2opremio commented May 10, 2017 • edited Loading

pecigonzalo commented May 10, 2017

pecigonzalo commented May 10, 2017 • edited Loading

2opremio commented May 10, 2017 • edited Loading

pecigonzalo commented May 10, 2017

2opremio commented May 10, 2017

pecigonzalo commented May 19, 2017 • edited Loading

2opremio commented May 19, 2017

pecigonzalo commented Jul 20, 2017

rade commented Jul 20, 2017

pecigonzalo commented Jul 20, 2017

rade commented Jul 20, 2017

pecigonzalo commented Aug 29, 2017

rade commented Aug 29, 2017

pecigonzalo commented Aug 29, 2017

rade commented Aug 29, 2017 • edited Loading

pecigonzalo commented Sep 1, 2017

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017 • edited Loading

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017 • edited Loading

rade commented Sep 1, 2017

pecigonzalo commented Sep 1, 2017

bboreham commented Nov 1, 2017

galindro commented Nov 1, 2017

pecigonzalo commented Nov 10, 2017

pecigonzalo commented May 9, 2017 •

edited

Loading

2opremio commented May 10, 2017 •

edited

Loading

2opremio commented May 10, 2017 •

edited

Loading

pecigonzalo commented May 10, 2017 •

edited

Loading

2opremio commented May 10, 2017 •

edited

Loading

pecigonzalo commented May 19, 2017 •

edited

Loading

rade commented Aug 29, 2017 •

edited

Loading

pecigonzalo commented Sep 1, 2017 •

edited

Loading

pecigonzalo commented Sep 1, 2017 •

edited

Loading