Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connections created during eBPF initialisation can be missed #2689

Closed
rade opened this issue Jul 6, 2017 · 5 comments
Closed

connections created during eBPF initialisation can be missed #2689

rade opened this issue Jul 6, 2017 · 5 comments
Assignees
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it
Milestone

Comments

@rade
Copy link
Member

rade commented Jul 6, 2017

eBPF-based connection tracking proceeds in the following stages:

  1. create a tracer.NewTracer. This will kick off some go-routines that send events to our EbfTracker by invoking tcpEventCbV4. However, we ignore these events to start with (this is controlled via the readyToHandleConnections boolean member)
  2. kick off a go-routine that walks /proc, to obtain information about processes and the sockets they own
  3. obtain NAT information via a conntrack run
  4. obtain the results of the /proc walk from (2)
  5. read /net/tcp{6}, create an iterator that produces connection information from the latter, annotated with pids from the former.
  6. enable processing of eBPF events
  7. iterate through the connections via the iterator from (5), feeding them as pseudo 'Accept' and 'Connect' events to the EbpfTracker.

There are several ways in which connections created during this process can be missed, even when they are relatively long-lived, i.e. survive past step 7.

  • if the connection is created between step 5 and 6, we miss it completely
  • if the connection is created before step 5, but the connecting/accepting process only started up after step 4 (or indeed during the proc walk we kicked off in 2, and we just missed it), then we see the connection but it will be lacking PIDs

We have seen a number of test failures in the 330 and 340 tests in CircleCI, and @2opremio has done some instrumentation in #2674 to help us track down the cause. Analysis of the scope reports produced in run 7608 strongly suggests we are hitting the 2nd case since the report contains the expected edge but without associated PIDs at the endpoints.

@rade rade added the bug Broken end user or developer functionality; not working as the developers intended it label Jul 6, 2017
@rade
Copy link
Member Author

rade commented Jul 6, 2017

btw, there's a small concurrency bug here too: EbpfTracker.feedInitialConnections sets readyToHandleConnections = true without taking any locks, even though the value is read by other go-routines (which do take a lock).

@rade
Copy link
Member Author

rade commented Jul 6, 2017

I am tempted to simply ditch readyToHandleConnections. Let the events flow freely from the start :)

We then need to be careful not to accidentally add back connections which get closed between steps 5 and 7. A cheap way to do that is to not add a procspied connection when its 4-tuple is in EbpfTracker.closedConnections, indicating that we've recently seen a 'Close' event.

rade added a commit that referenced this issue Jul 6, 2017
There really should be two variants - with and without eBPF - but the
latter is broken due to #2689.
@rade
Copy link
Member Author

rade commented Jul 6, 2017

To reproduce the first type of failure - connections missed completely - apply

diff --git a/probe/endpoint/connection_tracker.go b/probe/endpoint/connection_tracker.go
index 7d6f7ea2..9cd90bcc 100644
--- a/probe/endpoint/connection_tracker.go
+++ b/probe/endpoint/connection_tracker.go
@@ -2,6 +2,7 @@ package endpoint
 
 import (
        "strconv"
+       "time"
 
        log "github.com/Sirupsen/logrus"
        "github.com/weaveworks/scope/probe/endpoint/procspy"
@@ -170,6 +171,8 @@ func (t *connectionTracker) getInitialState() {
                }
        })
 
+       time.Sleep(10 * time.Second)
+
        t.ebpfTracker.feedInitialConnections(conns, seenTuples, processesWaitingInAccept, report.MakeHostNodeID(t.conf.HostID))
 }

and then run scope normally and within 10 seconds execute nc -l 1122 in one terminal and nc localhost 1122 in another. Wait for a about 15 seconds and then curl both the report and process topologies with

curl -s http://localhost:4040/api/report | jq -S '.' > /tmp/report.json
curl -s http://localhost:4040/api/topology/processes | jq -S '.' > /tmp/processes.json

Then inspect the json:

  • the full report contains both nc processes, but no endpoints referencing their pids or port 1122.
  • the process topology does not contain the nc processes (because we elide unconnected processes)

@rade
Copy link
Member Author

rade commented Jul 6, 2017

Actually, step 2 obtains the connection information. Step 5 only reads /proc/net/tcp{6} when no full proc traversal was performed.

That makes the window for the "connections present but missing pids" type failure quite narrow, moving it inside the proc walking logic.

@rade
Copy link
Member Author

rade commented Jul 6, 2017

The "connections present but missing pids" failure can be reproduced by applying

diff --git a/probe/endpoint/procspy/proc_linux.go b/probe/endpoint/procspy/proc_linux.go
index ec676072..0eccabac 100644
--- a/probe/endpoint/procspy/proc_linux.go
+++ b/probe/endpoint/procspy/proc_linux.go
@@ -254,6 +254,8 @@ func (w pidWalker) walk(buf *bytes.Buffer) (map[uint64]*Proc, error) {
                namespaces[namespaceID] = append(namespaces[namespaceID], &p)
        })
 
+       time.Sleep(10 * time.Second)
+
        for namespaceID, procs := range namespaces {
                select {
                case <-w.tickc:

and following the same steps as above.

The json inspection reveals:

  • the full report contains both nc processes, the accepting endpoint, the connecting endpoint, with an edge from the former to the latter, but no pids associated with either
  • the process topology does not contain the nc processes (because we elide unconnected processes)

@rade rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jul 7, 2017
@rade rade self-assigned this Jul 7, 2017
rade added a commit that referenced this issue Jul 11, 2017
...when initialising eBPF-based connection tracking.

Previously we were ignoring all eBPF events until we had gathered the
existing connections. That means we could a) miss connections created
during the gathering, and b) fail to forget connections that got
closed during the gathering.

The fix comprises the following changes:

1. pay attention to eBPF events immediately. That way we do not
miss anything.

2. remember connections for which we received a Close event during the
initalisation phase, and subsequently drop gathered existing
connections that match these. That way we do not erroneously consider
a gathered connection as open when it got closed since the gathering.

3. drop gathered existing connections which match connections detected
through eBPF events. The latter typically have more / current
metadata. In particular, PIDs can be missing from the former.

Fixes #2689.
Fixes #2700.
rade added a commit that referenced this issue Jul 11, 2017
...when initialising eBPF-based connection tracking.

Previously we were ignoring all eBPF events until we had gathered the
existing connections. That means we could a) miss connections created
during the gathering, and b) fail to forget connections that got
closed during the gathering.

The fix comprises the following changes:

1. pay attention to eBPF events immediately. That way we do not
miss anything.

2. remember connections for which we received a Close event during the
initalisation phase, and subsequently drop gathered existing
connections that match these. That way we do not erroneously consider
a gathered connection as open when it got closed since the gathering.

3. drop gathered existing connections which match connections detected
through eBPF events. The latter typically have more / current
metadata. In particular, PIDs can be missing from the former.

Fixes #2689.
Fixes #2700.
@rade rade modified the milestone: 1.6 Jul 12, 2017
rade added a commit that referenced this issue Jul 13, 2017
don't miss, or fail to forget, initial connections

Fixes #2689.
Fixes #2700.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it
Projects
None yet
Development

No branches or pull requests

1 participant