Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add eBPF connection tracking #1967

Closed

Conversation

alban
Copy link
Contributor

@alban alban commented Oct 31, 2016

Introduces connection tracking via eBPF. This allows scope to get
notified of every connection event, without relying on the parsing of
/proc/$pid/net/tcp{,6} and /proc/$pid/fd/*, and therefore improve
performance.

The eBPF program is in a python script using bcc: docker/tcpv4tracer.py.
It is contributed upstream via iovisor/bcc#762
It is using kprobes on the following kernel functions:

  • tcp_v4_connect
  • inet_csk_accept
  • tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

The python script is piped into the Scope Probe and
probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers). There is also
a flag "probe.ebpf.connections" that could disable eBPF if set to false.

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack.

The Scope Docker image is bigger because we need a few more packages
for bcc:

  • weaveworks/scope in current master: 22 MB
  • weaveworks/scope with this patch: 147 MB

Limitations:

Fixes #1168 (walking /proc to
obtain connections is very expensive)

Fixes #1260 (Short-lived
connections not tracked for containers in shared networking namespaces)


This is the code from @asymmetric's ebpf branch after rebasing on master and squashing it.

Introduces connection tracking via eBPF. This allows scope to get
notified of every connection event, without relying on the parsing of
/proc/$pid/net/tcp{,6} and /proc/$pid/fd/*, and therefore improve
performance.

The eBPF program is in a python script using bcc: docker/tcpv4tracer.py.
It is contributed upstream via iovisor/bcc#762
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

The python script is piped into the Scope Probe and
probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers). There is also
a flag "probe.ebpf.connections" that could disable eBPF if set to false.

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB
- weaveworks/scope with this patch:   147 MB

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images: 6 times bigger
- [ ] Requirement on kernel headers
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to
obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived
connections not tracked for containers in shared networking namespaces)
@2opremio
Copy link
Contributor

Fixes #1168 but unfortunately cannot be merged into master (at least as is) until #1962 is addressed due to the big image increase (from 22MB to 147MB).

@2opremio
Copy link
Contributor

@alban Is the loopback connection tracking problem seen in the last demo fixed in this PR?

@alban
Copy link
Contributor Author

alban commented Nov 1, 2016

@alban Is the loopback connection tracking problem seen in the last demo fixed in this PR?

No, it is not fixed. You directed me to #1780 but I didn't have time to look at it yet.

@iaguis
Copy link
Contributor

iaguis commented Nov 1, 2016

Limitation:

  • For refused connections, we get a connect event without a matching close event causing the node to seem permanently connected

WIP fix in https://github.com/kinvolk/bcc/commits/iaguis/tcp_tracer_established

@2opremio
Copy link
Contributor

2opremio commented Nov 1, 2016

Are there any other limitations apart from refused connections and the loopback connections?

Can you document what testing has been done?

@alban
Copy link
Contributor Author

alban commented Nov 1, 2016

I could reproduce the issue with the loopback connection with the Scope master branch. In fact, I reproduce the issue both when connecting to the 127.0.0.1 address and connecting to the eth0 address.

I can see the two "nc" processes. When I look at /proc, I see they are connected, but Scope does not show it.

Reproducer:

apiVersion: v1
kind: Pod
metadata:
  name: connectlo
  labels:
    name: connectlo
spec:
  containers:
  - name: connectlo
    image: albanc/toolbox
    args:
      - '/bin/sh'
      - '-c'
      - "MYIP=$(ip addr show eth0 | grep \"inet\\b\" | awk '{print $2}' | cut -d/ -f1) ; echo MYIP=$MYIP ; (/bin/nc -l --keep-open -p 8081 & ) ; sleep 1 ; echo CLIENT ; (echo Hello ; sleep 1000 ; echo 'Bye' ) | /bin/nc $MYIP 8081"

So it does not seem to be a regression from this branch.

@alban
Copy link
Contributor Author

alban commented Nov 1, 2016

Are there any other limitations apart from refused connections and the loopback connections?

The other limitations listed in the description of the PR, but after that I don't know other limitations.

Can you document what testing has been done?

Works for me:

  • ✅ get initial connections before Scope was started
  • ✅ works well with NAT (tested with Kubernetes Services)
  • ✅ displaying connection in the container view when both the source and the destination are pods with multiple containers

Not tested:

  • ❓ falling back to the old parsing method when ebpf does not work at all (e.g. old kernel)
  • ❓ falling back to the old parsing method when ebpf stops working (e.g. python script crashed - I have not seen that happening, but fallback would be nice)
  • ❓ connection between 2 containers using the same network namespace (It should work but I don't remember the result of that test)

Does not work:

  • ❌ IPv6

@2opremio
Copy link
Contributor

2opremio commented Nov 2, 2016

I could reproduce the issue with the loopback connection with the Scope master branch

Ah, I see. Could you please create a separate issue for it?

@alban
Copy link
Contributor Author

alban commented Nov 3, 2016

Ah, I see. Could you please create a separate issue for it?

Done in #1972

@alban
Copy link
Contributor Author

alban commented Nov 14, 2016

To compare the performance between the master branch and the ebpf branch, I used the following script dialer.sh. It reads the cpu time spent by the scope-probe process in kernel space and userspace from /proc. It does 2 measures: the first 10 seconds, and the next 10 seconds. I only consider the second part so that we don't count the initialization part. The script makes sure there is 1000 tcp connections established to a nginx container.

  • master branch: 3.5 seconds
  • golang-ebpf branch: 4.5 seconds

It is not as expected, unfortunately. To investigate what's going on, we ran the "go tool pprof". pprof does not work with Golang 1.7, so we need to revert to using Golang 1.6:

Then I can edit dialer.sh to run Scope for longer and enable probe.http.listen with:

sudo VERSION=${VERSION} ./scope launch -probe.http.listen 0.0.0.0:4041

Then, we can run commands like:

go tool pprof -raw -seconds 30 http://localhost:4041/debug/pprof/heap
go tool pprof -raw -seconds 30 http://localhost:4041/debug/pprof/profile
go tool pprof ./docker/scope $HOME/pprof/pprof.localhost\:4041.samples...

See images

  • pprof-master.svg
    image
  • pprof-ebpf.svg
    image

As expected, we see that spyLoop takes the most of time in master but not anymore in golang-ebpf. But the regression is in the GC-related functions in golang-ebpf: runtime.gcDrain, runtime.bgsweep, runtime.scanobject take most of the time.

By looking at the code, I don't see much difference in the way objects are allocated and released between master and golang-ebpf. So I'm still investigating.

@2opremio
Copy link
Contributor

2opremio commented Nov 14, 2016

@alban It would help to get:

  • CPU flame graphs
  • An object allocation profile (with object allocation counts)
  • An explanation of what runtime.sysmon is and, in case of the underlying runtime.futex being a symptom of contention, also get a blocking profile.

@2opremio
Copy link
Contributor

Side note: We should try to reduce the allocations (caching? pools?) in:

  • makeEndpointNode
  • ReadDirNames (@alban , is this still used in the new /proc Connector code?)

@alban
Copy link
Contributor Author

alban commented Nov 14, 2016

ReadDirNames (@alban , is this still used in the new /proc Connector code?)

No, with the proc connector branch, ReadDirNames() is called only once at startup to get the initial state. Note that I am testing without the proc connector code, since it is independent from the ebpf branch.

@alban
Copy link
Contributor Author

alban commented Nov 15, 2016

Flamegraphs

It does not seem possible to upload flamegraphs SVG on github, imgur or similar. And converting the flamegraphs to png lose information. I have the flamegraphs

  • torch-master-perf.svg
  • torch-golang-ebpf.svg

object allocation profile

  • pprof-master-alloc_objects.png
    pprof-master-alloc_objects
  • pprof-ebpf-alloc_objects.png
    pprof-ebpf-alloc_objects

Blocking profiles

  • pprof-master-block.png
    pprof-master-block
  • pprof-ebpf-block.png
    pprof-ebpf-block

@alban
Copy link
Contributor Author

alban commented Nov 16, 2016

On master, with my test of 1000 connections, the reporter gets 1000 connections from conntrack but only 66 connections from /proc (checked with @iaguis' patch kinvolk-archives@ec98c48).

So the test with 1000 connections is not a good comparison between master and the ebpf branch because they don't handle the same amount of connections.

I tried to change the rate limiting parameters with this patch on master. It becomes slower, but the reporters adds all the connections from /proc (instead of only 66), so it seems more correct.

  • master with this change on the rate limits: 7490ms
  • golang-ebpf: 6400ms
    (both with Golang version reverted to 1.6)

Changing the timings in this way might not be what we want to test either since it makes master run the /proc parsing more often... @2opremio any idea of what would be a better test?

@2opremio
Copy link
Contributor

2opremio commented Nov 16, 2016

Why is it a bad comparison? IMO it's a good overall black-box comparison.

they don't handle the same amount of connections

They do, it's 1000 in both. The internals shouldn't matter for absolute CPU metrics.

If I am following your reasoning correctly, it seems the problem is most likely that we are duplicating work with conntrack and ebpf. Did you disable conntrack for everything except nat?

@2opremio
Copy link
Contributor

2opremio commented Nov 16, 2016

Did you disable conntrack for everything except nat?

It doesn't seem you didn't:
https://github.com/kinvolk/scope/blob/alban/conn-perf-ebpf-pr-1/probe/endpoint/reporter.go#L68

alban pushed a commit to kinvolk-archives/scope that referenced this pull request Nov 22, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master: 22 MB (compressed), 71 MB
  (uncompressed)
- weaveworks/scope with this patch:   97 MB (compressed), 363 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
alban pushed a commit to kinvolk-archives/scope that referenced this pull request Nov 22, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
alban pushed a commit to kinvolk-archives/scope that referenced this pull request Nov 22, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
alepuccetti pushed a commit to kinvolk-archives/scope that referenced this pull request Dec 7, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
alepuccetti pushed a commit to kinvolk-archives/scope that referenced this pull request Dec 7, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
alepuccetti pushed a commit to kinvolk-archives/scope that referenced this pull request Dec 7, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
iaguis pushed a commit to kinvolk-archives/scope that referenced this pull request Dec 7, 2016
Based on work from Lorenzo, updated by Iago and Alban

This is the second attempt to add eBPF connection tracking. The first
one was via weaveworks#1967 by forking a
python script using bcc. This one is done in Golang directly thanks to
[gobpf](https://github.com/iovisor/gobpf).

This is not enabled by default. For now, it should be enabled manually
with:
```
sudo ./scope launch --probe.ebpf.connections=true
```
Scope Probe also falls back on the the old /proc parsing if eBPF is not
working (e.g. too old kernel, or missing kernel headers).

This allows scope to get notified of every connection event, without
relying on the parsing of /proc/$pid/net/tcp{,6} and /proc/$pid/fd/*,
and therefore improve performance.

The eBPF program is in probe/endpoint/ebpf.go. It was discussed in bcc
via iovisor/bcc#762.
It is using kprobes on the following kernel functions:
- tcp_v4_connect
- inet_csk_accept
- tcp_close

It generates "connect", "accept" and "close" events containing the
connection tuple but also the pid and the netns.

probe/endpoint/ebpf.go maintains the list of connections. Similarly to
conntrack, we keep the dead connections for one iteration in order to
report the short-lived connections.

The code for parsing /proc/$pid/net/tcp{,6} and /proc/$pid/fd/* is still
there and still used at start-up because eBPF only brings us the events
and not the initial state. However, the /proc parsing for the initial
state is now done in foreground instead of background, via
newForegroundReader().

NAT resolutions on connections from eBPF works in the same way as it did
on connections from /proc: by using conntrack. One of the two conntrack
instances was removed since eBPF is able to get short-lived connections.

The Scope Docker image is bigger because we need a few more packages
for bcc:
- weaveworks/scope in current master:  22 MB (compressed),  71 MB
  (uncompressed)
- weaveworks/scope with this patchset: 83 MB (compressed), 223 MB
  (uncompressed)

But @iaguis has ongoing work to reduce the size of the image.

Limitations:
- [ ] Does not support IPv6
- [ ] Sets `procspied: true` on connections coming from eBPF
- [ ] Size of the Docker images
- [ ] Requirement on kernel headers for now
- [ ] Location of kernel headers: iovisor/bcc#743

Fixes weaveworks#1168 (walking /proc to obtain connections is very expensive)

Fixes weaveworks#1260 (Short-lived connections not tracked for containers in
shared networking namespaces)
@alban
Copy link
Contributor Author

alban commented Dec 8, 2016

This PR #1967 (ebpf using bcc/python) is superseded by #2024 (ebpf using bcc/gobpf) and #2070 (ebpf in Go without bcc).

@alban alban closed this Dec 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants