connection to dead process associated with different process #2638

rade · 2017-06-24T09:05:01Z

Running

$ scope launch --weave=false --probe.ebpf.connections=false
$ tcpserver -v 0 1122 echo foo

and then

$ nc localhost 1122

produces

i.e. the connection is erroneously associated with the chrome process.

The text was updated successfully, but these errors were encountered:

rade · 2017-06-24T09:06:48Z

During other occurrences of the problem, the destination endpoint got associated with other processes I have running, such as firefox or dropbox. In all cases these are long-running processes, so I don't think the problem is due to pid re-use.

tcpserver forks for new connections, executing the specified program. In the above example the new process terminates quickly, so the connection really shouldn't show up at all in scope since the process associated with the destination is long gone (though the source process remains alive).

When instead running with

$ tcpserver -v 0 1122 cat

which keeps the process running, the endpoint gets associated with the cat process:

That kinda makes sense, but note that when running with eBPF connection tracking enabled...

$ scope launch --weave=false

the endpoint gets associated with the tcpserver process:

This happens for both the echo foo and cat tests.

Here are some relevant details from the reports when the problem occurs...

For the first ~16s the connection doesn't show up. During that time the report contains

"xps;127.0.0.1;55794": {
  "adjacency": ["xps;127.0.0.1;1122"],
  "latest": {
    "conntracked": "true"
  },
  "topology": "endpoint"
}
"xps;127.0.0.1;1122": {
  "latest": {
    "conntracked": "true"
  },
  "topology": "endpoint"
},

i.e. the endpoints are conntracked, and are lacking pids.

Once the connection does show up, the report contains this:

"xps-4026531969;127.0.0.1;55794": {
  "adjacency": ["xps-4026531969;127.0.0.1;1122"],
  "latest": {
    "pid": "20692",
    "procspied": "true"
  },
  "topology": "endpoint"
},
"xps-4026531969;127.0.0.1;1122": {
  "latest": {
    "pid": "17057",
    "procspied": "true"
  },
  "topology": "endpoint"
},

So the endpoints are procspied now and have pids.

The connection disappears after after another 70s, at which point the endpoints look like this:

"xps-4026531969;127.0.0.1;55794": {
  "adjacency": ["xps-4026531969;127.0.0.1;1122"],
  "latest": {
    "pid": "20692",
    "procspied": "true"
  },
  "topology": "endpoint"
},
"xps-4026531969;127.0.0.1;1122": {
  "latest": {
    "procspied": "true"
  },
  "topology": "endpoint"
},

i.e. the destination endpoint is no longer associated with a process.

Throughout, the process topology contains

"xps;20692": {
  "latest": {
    "name": "nc"
  },
  "topology": "process"
},"
xps;17057": {
  "latest": {
    "name": "/opt/google/chrome/chrome"
  },
  "topology": "process"
},

So, as we can see, while the connection was showing up in the UI, the destination endpoint was indeed mysteriously associated with the chrome process.

ProcNet.Next does not allocate Connection structs, for efficiency. Instead it always returns a *Connection pointing to the same instance. As a result, any mutations by the caller to struct elements that aren't actually set by ProcNet.Next, in particular Connection.Proc, are carried across to subsequent calls. This had hilarious consequences: connections referencing an inode which we hadn't come across during proc walking would be associated with the process corresponding to the last successfully looked up inode. The fix is to clear out the garbage left over from previous calls. Fixes #2638.

ensure connections from /proc/net/tcp{,6} get the right pid Fixes #2638.

ProcNet.Next does not allocate Connection structs, for efficiency. Instead it always returns a *Connection pointing to the same instance. As a result, any mutations by the caller to struct elements that aren't actually set by ProcNet.Next, in particular Connection.Proc, are carried across to subsequent calls. This had hilarious consequences: connections referencing an inode which we hadn't come across during proc walking would be associated with the process corresponding to the last successfully looked up inode. The fix is to clear out the garbage left over from previous calls. Fixes #2638.

rade added accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it labels Jun 24, 2017

rade self-assigned this Jun 24, 2017

rade mentioned this issue Jun 24, 2017

ensure connections from /proc/net/tcp{,6} get the right pid #2639

Merged

rade closed this as completed in #2639 Jun 26, 2017

rade added a commit that referenced this issue Jun 26, 2017

Merge pull request #2639 from weaveworks/2638-connection-pid-pollution

80790e4

ensure connections from /proc/net/tcp{,6} get the right pid Fixes #2638.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connection to dead process associated with different process #2638

connection to dead process associated with different process #2638

rade commented Jun 24, 2017

rade commented Jun 24, 2017

connection to dead process associated with different process #2638

connection to dead process associated with different process #2638

Comments

rade commented Jun 24, 2017

rade commented Jun 24, 2017