Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault after shutting down ebpf tracker #3711

Closed
bboreham opened this issue Oct 13, 2019 · 0 comments · Fixed by #3712
Closed

Segfault after shutting down ebpf tracker #3711

bboreham opened this issue Oct 13, 2019 · 0 comments · Fixed by #3712
Labels
bug Broken end user or developer functionality; not working as the developers intended it

Comments

@bboreham
Copy link
Collaborator

bboreham commented Oct 13, 2019

BUG REPORT

Seen in logs on a Kubernetes cluster. Sometimes the process would restart a few times then work; sometimes it would restart endlessly.

scope version 1.11.6 (1.11.4 and 1.11.3 had same symptom)

I believe this is a race condition from running getInitialState() on a background thread - if we shut down the ebpf tracker before that has finished it will crash.

These nodes seem to have a lot of TCP connections, e.g. conntrack -L reports 245155 on one, only around 3,000 in TIME_WAIT.

Logs:

2019-10-13T19:27:25.821787845Z time="2019-10-13T19:27:25Z" level=info msg="publishing to: https://<redacted>@cloud.weave.works."
2019-10-13T19:27:25.821833615Z <probe> INFO: 2019/10/13 19:27:25.821238 Basic authentication disabled
2019-10-13T19:27:25.849864019Z <probe> INFO: 2019/10/13 19:27:25.849501 command line args: --mode=probe --probe-only=true --probe.docker=true --probe.docker.bridge=docker0 --probe.kubernetes.role=host https://<elided>@cloud.weave.works.
2019-10-13T19:27:25.849895079Z <probe> INFO: 2019/10/13 19:27:25.849589 probe starting, version 1.11.6, ID 4240307090b3c701
2019-10-13T19:27:25.856663104Z <probe> WARN: 2019/10/13 19:27:25.856137 Cannot resolve 'scope.weave.local.': dial tcp 172.17.0.1:53: connect: connection refused
2019-10-13T19:27:26.177284533Z <probe> INFO: 2019/10/13 19:27:26.176898 Control connection to cloud.weave.works. starting
2019-10-13T19:27:26.563771966Z <probe> WARN: 2019/10/13 19:27:26.563445 Error collecting weave status, backing off 10s: Get http://127.0.0.1:6784/report: dial tcp 127.0.0.1:6784: connect: connection refused. If you are not running Weave Net, you may wish to suppress this warning by launching scope with the `--weave=false` option.
2019-10-13T19:27:26.604484396Z <probe> ERRO: 2019/10/13 19:27:26.604152 tcp tracer received event with timestamp 12652644202032895 even though the last timestamp was 12652644205786052. Stopping the eBPF tracker.
2019-10-13T19:27:26.604712068Z <probe> ERRO: 2019/10/13 19:27:26.604464 tcp tracer received event with timestamp 12652644202138598 even though the last timestamp was 12652644205786052. Stopping the eBPF tracker.
2019-10-13T19:27:26.624841063Z <probe> INFO: 2019/10/13 19:27:26.624417 Publish loop for cloud.weave.works. starting
2019-10-13T19:27:26.715718105Z <probe> WARN: 2019/10/13 19:27:26.714537 Dropping report to cloud.weave.works.
2019-10-13T19:27:26.758087095Z <probe> WARN: 2019/10/13 19:27:26.757677 Dropping report to cloud.weave.works.
[[[.]
2019-10-13T19:27:27.629609655Z <probe> WARN: 2019/10/13 19:27:27.629252 Dropping report to cloud.weave.works.
2019-10-13T19:27:27.661782679Z <probe> WARN: 2019/10/13 19:27:27.659335 ebpf tracker died, restarting it
2019-10-13T19:27:27.662414461Z <probe> WARN: 2019/10/13 19:27:27.661235 Dropping report to cloud.weave.works.
[...]
2019-10-13T19:27:28.205549324Z <probe> WARN: 2019/10/13 19:27:28.205110 Dropping report to cloud.weave.works.
2019-10-13T19:27:28.659833456Z <probe> WARN: 2019/10/13 19:27:28.659448 Endpoint reporter took longer than 1s
2019-10-13T19:27:28.820249185Z <probe> WARN: 2019/10/13 19:27:28.819872 Endpoint reporter took 1.160715041s (longer than 1s)
2019-10-13T19:27:28.878843652Z <probe> ERRO: 2019/10/13 19:27:28.875629 tcp tracer received event with timestamp 12652646202038856 even though the last timestamp was 12652646204876031. Stopping the eBPF tracker.
2019-10-13T19:27:28.878955243Z <probe> ERRO: 2019/10/13 19:27:28.875708 tcp tracer received event with timestamp 12652646202110612 even though the last timestamp was 12652646204876031. Stopping the eBPF tracker.
2019-10-13T19:27:28.963800677Z <probe> WARN: 2019/10/13 19:27:28.961378 ebpf tracker died again, gently falling back to proc scanning
2019-10-13T19:27:31.067514362Z panic: runtime error: invalid memory address or nil pointer dereference
2019-10-13T19:27:31.067560271Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15a49fa]
2019-10-13T19:27:31.067568748Z 
2019-10-13T19:27:31.067573672Z goroutine 1545 [running]:
2019-10-13T19:27:31.067579267Z github.com/weaveworks/scope/probe/endpoint.(*EbpfTracker).feedInitialConnections(0x0, 0x22d8fc0, 0xc42104ca80, 0xc425108600, 0xc428593f48, 0x0, 0x0, 0xc420b7f8e0, 0x18)
2019-10-13T19:27:31.067583045Z  /go/src/github.com/weaveworks/scope/probe/endpoint/ebpf.go:324 +0x3a
2019-10-13T19:27:31.067586942Z github.com/weaveworks/scope/probe/endpoint.(*connectionTracker).getInitialState(0xc420628cc0)
2019-10-13T19:27:31.067590264Z  /go/src/github.com/weaveworks/scope/probe/endpoint/connection_tracker.go:186 +0x2bc
2019-10-13T19:27:31.06759352Z created by github.com/weaveworks/scope/probe/endpoint.(*connectionTracker).ReportConnections
2019-10-13T19:27:31.067596559Z  /go/src/github.com/weaveworks/scope/probe/endpoint/connection_tracker.go:99 +0x383
@bboreham bboreham added the bug Broken end user or developer functionality; not working as the developers intended it label Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Broken end user or developer functionality; not working as the developers intended it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant