-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS Probe - SIGSEGV: segmentation violation #2508
Comments
Thanks for the report @pecigonzalo . What version of Scope are you running? |
@2opremio thanks for the quick reply |
Thanks, I will take a look. Hopefully it will be an easy one to fix :) |
Maybe just this other info, running on Amazon Linux: |
I am almost certain that this was caused by the ARN of a task being @ekimekim Any idea of how the could happen? I vaguely recall discussing #2026 and you mentioning that the API guaranteed no task pointers could be null. To start with, should the initialization happen if there's a failure in the loop above? @pecigonzalo Would you mind sending us the full logs? In particular, do you see errors like |
It might be possible that in the event of a partial failure, empty tasks can be returned? That seems odd to me, but the documentation isn't very specific so it's a possibility. |
@pecigonzalo I have merged #2514 with the hope of avoiding the crash. Would you mind running Even if it does fix the crashes, there must be a deeper problem causing the null values being obtained, so please send us the logs anyhow. |
@2opremio I dont see that log line you mentioned, but I'll run this again and let you know what I can find and then run with latest thanks |
@2opremio Ok, I just ran some tests again, unfortunately, the crash is not happening anymore with Ill keep Im getting a lot of
for several tasks |
Yep, probably this happens when tasks die/end. I bet |
I believe its true, as i was getting more Tasks listed than the actual list of tasks I had. Unfortunately i dont have the full log, if i can replicate ill post here. |
Please do, thanks. |
Ok, i just had this again @2opremio
After this, i ran with |
OK, good to know. We will be releasing |
This is still happening, now I believe due to AWS API Rate limit, maybe related to #2050 .
But it also makes it so that scope reports the node in |
I was at a loss how the code can panic in that place: for _, task := range resp.Tasks {
if task.TaskArn != nil {
c.taskCache.Set(*task.TaskArn, newECSTask(task)) // panics here
}
} though, I've just noticed that the addr in the SIGSEGV is not 0x0 but 0x8. |
is |
maybe, but the code should defend against that. |
@rade This seems to still be an issue on:
And its causing crazy high CPU usage. Maybe affects #1457 |
@pecigonzalo are you sure you are running scope 1.6.2? |
That is my lunch command |
@pecigonzalo Please look for a line in the logs saying |
|
Ok. I am at a complete loss what is going on here. I cannot think of any way that https://github.com/weaveworks/scope/blob/v1.6.2/probe/awsecs/client.go#L244 can explode with that error. @pecigonzalo can you? Also, do you think you could put together a minimal example / set of instructions that we could follow to reproduce the problem? |
Im talking a wild shot, but this seems to coincide with the rate limit in many cases, could it be that |
|
Im trying to find a way to trigger the issue, so we can reproduce consistently, but my Go is a bit rusty for such a big project. Some questions to bounce around:
As a side note from my findings, I think we are polling ECS too often. Based on my experience if you have many failed task which keeps polling from ECS instead of cache that is a lot of calls, times the number of nodes in the cluster. |
It's not getting that far. Unless the stack trace is incomplete. It's an interesting theory.
Because
Yes, but I would rather not ship a new scope release just to add some debug statements. Can you build scope yourself? |
Yeah, I built it with that and its running manually, lets see if I hit it. Regarding |
Do you have evidence of undesirable behaviour, such has excessive CPU usage or something like that? Mind filing a separate issue? |
Not really for the rate limit comment, the CPU usage I experience from this issue are due to SIGSEV causing the thread/probe to restart constantly. Filed under: #2844 |
Similar but different stack trace from probe 1.6.4:
given the protection against pointer access on that line, I would have to suspect the fault is actually in |
@bboreham I've the same problem that you reported, but I'm using <probe> INFO: 2017/11/01 19:29:19.092470 command line args: --mode=probe --probe.docker=true --probe.ecs=true --probe.ecs.cluster.region=us-east-1 --service-token=<elided>
<probe> INFO: 2017/11/01 19:29:19.092508 probe starting, version 1.6.5, ID 6634efb2e53a7246
<probe> INFO: 2017/11/01 19:29:19.613257 Control connection to cloud.weave.works starting
<probe> INFO: 2017/11/01 19:29:19.663189 Success collecting weave status
<probe> ERRO: 2017/11/01 19:29:19.674958 conntrack stderr:NOTICE: Netlink socket buffer size has been set to 8388608 bytes.
<probe> INFO: 2017/11/01 19:29:19.719058 Success collecting weave ps
<probe> WARN: 2017/11/01 19:29:21.665221 awsecs tagger took longer than 1s
<probe> WARN: 2017/11/01 19:29:21.869640 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/ebbb24c8-1f70-42f9-884c-45335995505d, ECS service report may be incomplete: MISSING
<probe> WARN: 2017/11/01 19:29:21.869667 Failed to describe ECS task arn:aws:ecs:us-east-1:832266673134:task/7a8ba970-af67-4039-8fc9-8104bdf4ae68, ECS service report may be incomplete: MISSING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x17fd0fa]
goroutine 100 [running]:
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.getTasks(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423251000, 0x3f, 0x40)
/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:247 +0x4aa
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.ensureTasksAreCached(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f)
/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:301 +0x27a
github.com/weaveworks/scope/probe/awsecs.ecsClientImpl.GetInfo(0xc422a54420, 0xc423d9a336, 0xa, 0x2f2b0a0, 0xc42041a180, 0x2f2b0a0, 0xc42080af00, 0xc423270400, 0x3f, 0x3f, ...)
/go/src/github.com/weaveworks/scope/probe/awsecs/client.go:365 +0x126
github.com/weaveworks/scope/probe/awsecs.(*ecsClientImpl).GetInfo(0xc42343aac0, 0xc423270400, 0x3f, 0x3f, 0x0, 0x3f, 0xc421330c70)
<autogenerated>:12 +0x94
github.com/weaveworks/scope/probe/awsecs.Reporter.Tag(0xc422efcd80, 0x100000, 0x34630b8a000, 0x7fff00ebef28, 0x9, 0xc421330d90, 0xc421330c70, 0x10, 0x0, 0x0, ...)
/go/src/github.com/weaveworks/scope/probe/awsecs/reporter.go:147 +0x486
github.com/weaveworks/scope/probe/awsecs.(*Reporter).Tag(0xc422efb280, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
<autogenerated>:18 +0xb8
github.com/weaveworks/scope/probe.(*Probe).tag(0xc4200fe7e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc423d9cb10, 0xc423d9cb40, 0x0, ...)
/go/src/github.com/weaveworks/scope/probe/probe.go:180 +0x1b2
github.com/weaveworks/scope/probe.(*Probe).spyLoop(0xc4200fe7e0)
/go/src/github.com/weaveworks/scope/probe/probe.go:129 +0x1d1
created by github.com/weaveworks/scope/probe.(*Probe).Start
/go/src/github.com/weaveworks/scope/probe/probe.go:102 +0x5c
time="2017-11-01T19:29:22Z" level=info msg="publishing to: https://cloud.weave.works:443" |
As soon as #2918 is released Ill test this. |
When running scope on ECS with the
--probe.ecs=true
we get:Setting
--probe.ecs=false
works without issues.The text was updated successfully, but these errors were encountered: