Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Probe - SIGSEGV: segmentation violation #2508

Closed
pecigonzalo opened this issue May 9, 2017 · 34 comments · Fixed by #2514
Closed

ECS Probe - SIGSEGV: segmentation violation #2508

pecigonzalo opened this issue May 9, 2017 · 34 comments · Fixed by #2514
Assignees
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Milestone

Comments

@pecigonzalo
Copy link

When running scope on ECS with the --probe.ecs=true we get:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420012050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc42360e000, 0x13, 0x20)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc422e6cc80, 0x13, 0x13)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc420021ca0, 0xc4232ce1e0, 0x18, 0x23b93c0, 0xc422d57c20, 0x23b93c0, 0xc422d57ce0, 0xc422e6cc80, 0x13, 0x13, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc423664000, 0xc422e6cc80, 0x13, 0x13, 0x0, 0x13, 0xc4232ce1e0)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215bac00, 0x100000, 0x34630b8a000, 0xc421043270, 0xc421043150, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215bac90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4228dfa10, 0xc4228dfad0, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc42035a090, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4228dfa10, 0xc4228dfad0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc42035a090)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c

Setting --probe.ecs=false works without issues.

@2opremio 2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label May 9, 2017
@2opremio 2opremio added this to the Next milestone May 9, 2017
@2opremio
Copy link
Contributor

2opremio commented May 9, 2017

Thanks for the report @pecigonzalo . What version of Scope are you running?

@2opremio 2opremio self-assigned this May 9, 2017
@pecigonzalo
Copy link
Author

pecigonzalo commented May 9, 2017

@2opremio thanks for the quick reply Weave Scope version 1.3.0 let me know if there is any additional information i can provide.

@2opremio
Copy link
Contributor

2opremio commented May 9, 2017

Thanks, I will take a look. Hopefully it will be an easy one to fix :)

@pecigonzalo
Copy link
Author

Maybe just this other info, running on Amazon Linux: Linux <hostname> 4.4.39-34.54.amzn1.x86_64 #1 SMP Fri Dec 30 19:11:28 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

@2opremio
Copy link
Contributor

2opremio commented May 10, 2017

I am almost certain that this was caused by the ARN of a task being nil.

@ekimekim Any idea of how the could happen? I vaguely recall discussing #2026 and you mentioning that the API guaranteed no task pointers could be null. To start with, should the initialization happen if there's a failure in the loop above?

@pecigonzalo Would you mind sending us the full logs? In particular, do you see errors like "Failed to describe ECS task %s, ECS service report may be incomplete: %s" ?

@ekimekim
Copy link
Contributor

It might be possible that in the event of a partial failure, empty tasks can be returned? That seems odd to me, but the documentation isn't very specific so it's a possibility.

@2opremio
Copy link
Contributor

2opremio commented May 10, 2017

@pecigonzalo I have merged #2514 with the hope of avoiding the crash. Would you mind running weaveworks/scope:latest to see if the crash goes away?

Even if it does fix the crashes, there must be a deeper problem causing the null values being obtained, so please send us the logs anyhow.

@pecigonzalo
Copy link
Author

@2opremio I dont see that log line you mentioned, but I'll run this again and let you know what I can find and then run with latest

thanks

@pecigonzalo
Copy link
Author

pecigonzalo commented May 10, 2017

@2opremio Ok, I just ran some tests again, unfortunately, the crash is not happening anymore with 1.3.0 either, so I can't know if it's fixed or not although I believe it is indeed related to your PR, but maybe when certain listed tasks/services are no longer present.

Ill keep 1.3.0 running and report back if I see the issue again.

Im getting a lot of

Failed to describe ECS task arn:aws:ecs:eu-central-1:<redacted>:task/<redacted>, ECS service report may be incomplete: MISSING

for several tasks

@2opremio
Copy link
Contributor

2opremio commented May 10, 2017

maybe when certain listed tasks/services are no longer present.

Yep, probably this happens when tasks die/end.

I betDescribeTask may return nil pointers in tasks which don't exist anymore.

@pecigonzalo
Copy link
Author

I believe its true, as i was getting more Tasks listed than the actual list of tasks I had. Unfortunately i dont have the full log, if i can replicate ill post here.

@2opremio
Copy link
Contributor

Please do, thanks.

@pecigonzalo
Copy link
Author

pecigonzalo commented May 19, 2017

Ok, i just had this again @2opremio

<probe> INFO: 2017/05/19 10:55:21.722228 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:21.722307 probe starting, version 1.3.0, ID 6ed3ea4b47fb014f
<probe> ERRO: 2017/05/19 10:55:21.832364 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:21.872855 Success collecting weave plugin status
<probe> INFO: 2017/05/19 10:55:21.890689 Success collecting weave ps
<probe> ERRO: 2017/05/19 10:55:21.890778 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:21.930740 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:21.945030 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:21.989464 Success collecting weave proxy status
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a20, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a00, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc424274090, 0xc4228af720, 0x18, 0x23b93c0, 0xc4242ed620, 0x23b93c0, 0xc4242ed800, 0xc4243c0a00, 0x2, 0x2, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42439f940, 0xc4243c0a00, 0x2, 0x2, 0x0, 0x2, 0xc4228af720)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215b6c90, 0x100000, 0x34630b8a000, 0xc421095250, 0xc4210950f0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215b6d20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42431c210, 0xc42431c240, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420071e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc42431c210, 0xc42431c240, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420071e60)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-05-19T10:55:23Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/05/19 10:55:23.275463 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:23.275542 probe starting, version 1.3.0, ID 18b75ed5e656b12c
<probe> ERRO: 2017/05/19 10:55:23.390745 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:23.426392 Success collecting weave plugin status
<probe> INFO: 2017/05/19 10:55:23.437603 Success collecting weave ps
<probe> ERRO: 2017/05/19 10:55:23.439163 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:23.476731 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:23.504597 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:23.528011 Success collecting weave proxy status
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x642ca6]

goroutine 51 [running]:
panic(0x1510980, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3420, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:236 +0x4a6
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3400, 0x2, 0x2)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:289 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc42001e088, 0xc4228e0080, 0x18, 0x23b93c0, 0xc42004e120, 0x23b93c0, 0xc42004e180, 0xc4243d3400, 0x2, 0x2, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:353 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42436c140, 0xc4243d3400, 0x2, 0x2, 0x0, 0x2, 0xc4228e0080)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc4215bac90, 0x100000, 0x34630b8a000, 0xc421086f20, 0xc421086dc0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4b8
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc4215bad20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4243224b0, 0xc4243224e0, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420071ef0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4243224b0, 0xc4243224e0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420071ef0)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-05-19T10:55:24Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/05/19 10:55:24.886530 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/05/19 10:55:24.886612 probe starting, version 1.3.0, ID 7a3040bbfa6b7b85
<probe> ERRO: 2017/05/19 10:55:25.023033 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:25.064751 Success collecting weave plugin status
<probe> ERRO: 2017/05/19 10:55:25.091617 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/05/19 10:55:25.092015 Success collecting weave ps
<probe> INFO: 2017/05/19 10:55:25.104618 Success collecting weave status
<probe> INFO: 2017/05/19 10:55:25.129010 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/05/19 10:55:25.131402 Success collecting weave proxy status

After this, i ran with :latest and had no issues. I do believe this is related to empty/missing tasks as this happened after I updated some of them.

@2opremio
Copy link
Contributor

OK, good to know. We will be releasing 1.3.1 soon (in a day or two) so please stay put.

@pecigonzalo
Copy link
Author

This is still happening, now I believe due to AWS API Rate limit, maybe related to #2050 .

<probe> INFO: 2017/07/20 13:37:53.406117 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/07/20 13:37:53.406256 probe starting, version 1.5.1, ID 2729b73587f6c30b
<probe> INFO: 2017/07/20 13:37:53.454647 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/07/20 13:37:53.961431 Success collecting weave proxy status
<probe> ERRO: 2017/07/20 13:37:53.968585 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/07/20 13:37:53.973148 Success collecting weave plugin status
<probe> INFO: 2017/07/20 13:37:54.014688 Success collecting weave status
<probe> INFO: 2017/07/20 13:37:54.042530 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.042923 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.042684 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:54.055308 Success collecting weave ps
<probe> WARN: 2017/07/20 13:37:55.122235 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/36945a1a-3efa-4965-a8d4-01594383b520, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122256 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/2d0e4381-52a2-49a4-9213-47f695e3e969, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122262 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/17d138ca-1c3c-47e7-8dc0-9f54bc380a46, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/07/20 13:37:55.122267 Failed to describe ECS task arn:aws:ecs:eu-central-1:123123:task/f5c3e08f-f391-4972-b169-46e2e1812d90, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x6480bb]
goroutine 80 [running]:
panic(0x1518ce0, 0xc420010050)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230da100, 0xf, 0x10)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4fb
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230fa000, 0xf, 0xf)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x285
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422c40c18, 0xc422a75b00, 0x18, 0x23e4280, 0xc4235cb200, 0x23e4280, 0xc4235cb260, 0xc4230fa000, 0xf, 0xf, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x12f
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc423874000, 0xc4230fa000, 0xf, 0xf, 0x0, 0xf, 0xc422a75b00)
        <autogenerated>:12 +0x99
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422c74660, 0x100000, 0x34630b8a000, 0xc4217bd5c0, 0xc4217bd460, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x4d0
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422c746f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423900a20, 0xc423900a50, 0x0, ...)
        <autogenerated>:18 +0xac
github.com/weaveworks/scope/probe.(*Probe).tag(0xc42015a750, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423900a20, 0xc423900a50, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b0
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc42015a750)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1c1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-07-20T13:37:55Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/07/20 13:37:55.566826 command line args: --logtostderr=true --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/07/20 13:37:55.566894 probe starting, version 1.5.1, ID 339d52cb7b1d2abe
<probe> INFO: 2017/07/20 13:37:55.583605 Control connection to 127.0.0.1 starting
<probe> INFO: 2017/07/20 13:37:56.001452 Success collecting weave plugin status
<probe> INFO: 2017/07/20 13:37:56.001650 Success collecting weave proxy status
<probe> INFO: 2017/07/20 13:37:56.124219 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:56.124564 Control connection to scope.weave.local. starting
<probe> INFO: 2017/07/20 13:37:56.113682 Control connection to scope.weave.local. starting
<probe> ERRO: 2017/07/20 13:37:56.132358 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 425984 bytes.
<probe> INFO: 2017/07/20 13:37:56.137031 Success collecting weave status
<probe> INFO: 2017/07/20 13:37:56.196769 Success collecting weave ps

But it also makes it so that scope reports the node in Weave Net section as unmanaged. I believe we need to either backoff and maybe even only collect from a single node.

@rade rade reopened this Jul 20, 2017
@rade
Copy link
Member

rade commented Jul 20, 2017

I was at a loss how the code can panic in that place:

	for _, task := range resp.Tasks {
		if task.TaskArn != nil {
			c.taskCache.Set(*task.TaskArn, newECSTask(task)) // panics here
		}
	}

though, I've just noticed that the addr in the SIGSEGV is not 0x0 but 0x8.

@pecigonzalo
Copy link
Author

is resp.Task maybe is a generator? (sorry did not get to look at the code yet) and getting the panic due to the rate limit?

@rade
Copy link
Member

rade commented Jul 20, 2017

getting the panic due to the rate limit?

maybe, but the code should defend against that.

rade added a commit that referenced this issue Jul 20, 2017
Fixes #2508. Hopefully.
@rade rade assigned rade and unassigned 2opremio Jul 20, 2017
@rade rade modified the milestones: Next, 1.6 Jul 28, 2017
@pecigonzalo
Copy link
Author

@rade This seems to still be an issue on: weaveworks/scope:1.6.2

<probe> WARN: 2017/08/29 13:31:26.782603 Failed to describe ECS task arn:aws:ecs:eu-central-1:123:task/6741f751-06f2-41bc-aa2c-8192ab0c66df, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/08/29 13:31:26.782610 Failed to describe ECS task arn:aws:ecs:eu-central-1:123:task/3ab4ac29-872b-4443-867c-ddab0291f4d6, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17f4811]
goroutine 67 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a400, 0x3d, 0x40)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4d1
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a000, 0x3d, 0x3d)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc42000e8d0, 0xc422b54420, 0x18, 0x2f18f60, 0xc421e100c0, 0x2f18f60, 0xc421e10120, 0xc42492a000, 0x3d, 0x3d, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42492edc0, 0xc42492a000, 0x3d, 0x3d, 0x0, 0x3d, 0xc422b54420)
        <autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422e03590, 0x100000, 0x34630b8a000, 0xc4219e74b0, 0xc4219e7410, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422e03620, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421ce03f0, 0xc421ce0420, 0x0, ...)
        <autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc420408360, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc421ce03f0, 0xc421ce0420, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc420408360)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-08-29T13:31:27Z" level=info msg="publishing to: 127.0.0.1:4040" 

And its causing crazy high CPU usage. Maybe affects #1457

@rade
Copy link
Member

rade commented Aug 29, 2017

@pecigonzalo are you sure you are running scope 1.6.2?

@pecigonzalo
Copy link
Author

/usr/bin/docker run --name=weavescope --privileged \
    --cpu-shares=512 \
    --userns=host --net=host --pid=host \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/run/scope/plugins:/var/run/scope/plugins \
    -v /sys/kernel/debug:/sys/kernel/debug \
    -e CHECKPOINT_DISABLE \
    weaveworks/scope:1.6.2 \
    --probe.docker=true \
    --probe.ecs=true

That is my lunch command

@rade
Copy link
Member

rade commented Aug 29, 2017

@pecigonzalo Please look for a line in the logs saying probe starting, which should contain the version number. Just to make sure it is really running the expected version.

@pecigonzalo
Copy link
Author

<app> INFO: 2017/09/01 10:57:37.032091 command line args: --mode=app --probe.docker=true --probe.ecs=true
time="2017-09-01T10:57:37Z" level=info msg="publishing to: 127.0.0.1:4040" 
<probe> INFO: 2017/09/01 10:57:37.060698 command line args: --mode=probe --probe.docker=true --probe.ecs=true
<probe> INFO: 2017/09/01 10:57:37.060757 probe starting, version 1.6.2, ID 63708bfbfd75030
<app> INFO: 2017/09/01 10:57:37.110162 listening on :4040
<app> WARN: 2017/09/01 10:57:37.194233 Error updating weaveDNS, backing off 20s: Error running weave ps: exit status 1: "Link not found\n"
<probe> INFO: 2017/09/01 10:57:37.372689 Control connection to 127.0.0.1 starting

@rade
Copy link
Member

rade commented Sep 1, 2017

Ok. I am at a complete loss what is going on here. I cannot think of any way that https://github.com/weaveworks/scope/blob/v1.6.2/probe/awsecs/client.go#L244 can explode with that error. @pecigonzalo can you? Also, do you think you could put together a minimal example / set of instructions that we could follow to reproduce the problem?

@rade rade reopened this Sep 1, 2017
@rade rade modified the milestones: Next, 1.6 Sep 1, 2017
@pecigonzalo
Copy link
Author

Im talking a wild shot, but this seems to coincide with the rate limit in many cases, could it be that ecsClientImpl is nil because the rate limit fails to create the session or similar?

@rade
Copy link
Member

rade commented Sep 1, 2017

c is a struct, not a pointer, so it cannot be nil. Besides, it is used earlier on in the same function.

@pecigonzalo
Copy link
Author

pecigonzalo commented Sep 1, 2017

Im trying to find a way to trigger the issue, so we can reproduce consistently, but my Go is a bit rusty for such a big project.

Some questions to bounce around:

  • Could it be newECSTask(task) not being able to generate because a task is missing a key?
  • Could we put something like log.Debugf("TaskArn %s, Task: %s", task.TaskArn, ecsTask) before the L244? That way we can see the inputs
  • Why is it doing *task.TaskArn and not just task.TaskArn? As far as I can see we could use just a var and not reference.

As a side note from my findings, I think we are polling ECS too often. Based on my experience if you have many failed task which keeps polling from ECS instead of cache that is a lot of calls, times the number of nodes in the cluster.
Maybe scope could cache the failed hit count and if it reaches X stop polling for some time.

@rade
Copy link
Member

rade commented Sep 1, 2017

Could it be newECSTask(task) not being able to generate because a task is missing a key?

It's not getting that far. Unless the stack trace is incomplete. It's an interesting theory.

Why is it doing *task.TaskArn and not just task.TaskArn?

Because task.TaskArn is a *string rather than string and we want the cache entry to be keyed on the string rather than the pointer.

Could we put something like log.Debugf("TaskArn %s, Task: %s", task.TaskArn, ecsTask) before the L244? That way we can see the inputs

Yes, but I would rather not ship a new scope release just to add some debug statements. Can you build scope yourself?

@pecigonzalo
Copy link
Author

pecigonzalo commented Sep 1, 2017

Yeah, I built it with that and its running manually, lets see if I hit it.

Regarding *task.TaskArn, I understand what you say, but if I understood gcache correctly it will only store it as a string anyway, so it would not matter.

@rade
Copy link
Member

rade commented Sep 1, 2017

I think we are polling ECS too often. Based on my experience if you have many failed task which keeps polling from ECS instead of cache that is a lot of calls, times the number of nodes in the cluster.

Do you have evidence of undesirable behaviour, such has excessive CPU usage or something like that? Mind filing a separate issue?

@pecigonzalo
Copy link
Author

Not really for the rate limit comment, the CPU usage I experience from this issue are due to SIGSEV causing the thread/probe to restart constantly.

Filed under: #2844

@bboreham
Copy link
Collaborator

bboreham commented Nov 1, 2017

Similar but different stack trace from probe 1.6.4:

<probe> INFO: 2017/11/01 14:09:15.970888 Publish loop for cloud.weave.works starting
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x17fc89a]

goroutine 45 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc424151a50, 0x1, 0x1)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:244 +0x4aa
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc4242dd540, 0x13, 0x13)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:298 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422fdc040, 0xc422eeab60, 0x17, 0x2f2c0a0, 0xc422f2d2c0, 0x2f2c0a0, 0xc42306b260, 0xc4242dd540,0x13, 0x13, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/client.go:362 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42026c080, 0xc4242dd540, 0x13, 0x13, 0x0, 0x13, 0xc422dddd20)
        <autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc421a716b0, 0x100000, 0x34630b8a000, 0xc4214186e0, 0xc4214185c0, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:145 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc421a71740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc424affd70, 0xc424affda0, 0x0, ...)
        <autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc421a72120, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc424affd70, 0xc424affda0, 0x0, ...)
        /go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc421a72120)
        /go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
        /go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-11-01T14:09:24Z" level=info msg="publishing to: https://cloud.weave.works:443"

given the protection against pointer access on that line, I would have to suspect the fault is actually in newECSTask which has been inlined.

@galindro
Copy link

galindro commented Nov 1, 2017

@bboreham I've the same problem that you reported, but I'm using 1.6.5:

<probe> INFO: 2017/11/01 19:29:19.092470 command line args: --mode=probe --probe.docker=true --probe.ecs=true --probe.ecs.cluster.region=us-east-1 --service-token=<elided>
<probe> INFO: 2017/11/01 19:29:19.092508 probe starting, version 1.6.5, ID 6634efb2e53a7246
<probe> INFO: 2017/11/01 19:29:19.613257 Control connection to cloud.weave.works starting
<probe> INFO: 2017/11/01 19:29:19.663189 Success collecting weave status
<probe> ERRO: 2017/11/01 19:29:19.674958 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
<probe> INFO: 2017/11/01 19:29:19.719058 Success collecting weave ps
<probe> WARN: 2017/11/01 19:29:21.665221 awsecs tagger took longer than 1s
<probe> WARN: 2017/11/01 19:29:21.869640 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/ebbb24c8-1f70-42f9-884c-45335995505d, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/11/01 19:29:21.869667 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/7a8ba970-af67-4039-8fc9-8104bdf4ae68, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x17fd0fa]

goroutine 100 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423251000, 0x3f, 0x40)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:247 +0x4aa
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:301 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f, ...)
	/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:365 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42343aac0, 0xc423270400, 0x3f, 0x3f, 0x0, 0x3f, 0xc421330c70)
	<autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422efcd80, 0x100000, 0x34630b8a000, 0x7fff00ebef28, 0x9, 0xc421330d90, 0xc421330c70, 0x10, 0x0, 0x0, ...)
	/go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:147 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422efb280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
	<autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc4200fe7e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
	/go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc4200fe7e0)
	/go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
	/go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-11-01T19:29:22Z" level=info msg="publishing to: https://cloud.weave.works:443" 

@pecigonzalo
Copy link
Author

As soon as #2918 is released Ill test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants