Watch seems to leak connections #15

nelljerram · 2018-03-29T12:06:20Z

Run a local etcd server on port 2379, e.g. with docker run --net=host quay.io/coreos/etcd.

Create an Etcd3Client and make an initial connection to the server:

(py27) neil@smaug:~/calico/networking-calico$ python
Python 2.7.14+ (default, Mar 13 2018, 15:23:44) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from etcd3gw.client import Etcd3Client
>>> c = Etcd3Client()
>>> c.status()
{u'raftTerm': u'2', u'header': {u'raft_term': u'2', u'revision': u'1', u'cluster_id': u'14841639068965178418', u'member_id': u'10276657743932975437'}, u'version': u'3.2.9', u'raftIndex': u'4', u'dbSize': u'24576', u'leader': u'10276657743932975437'}

Find out the PID of that python shell, and then in another window monitor its etcd connections with

watch 'lsof -p 6770 | grep 2379'

You'll initially see that there is 1 connection:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)

Start and then cancel a watch:

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

The watch still shows just 1 connection.

However, now, for each further iteration of

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

we see 1 more ESTABLISHED connection. For example, after 8 more iterations, there are now 9 ESTABLISHED connections:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)
python  6770 neil    4u  IPv4 852249      0t0      TCP localhost:40374->localhost:2379 (ESTABLISHED)
python  6770 neil    5u  IPv4 851242      0t0      TCP localhost:40386->localhost:2379 (ESTABLISHED)
python  6770 neil    6u  IPv4 880526      0t0      TCP localhost:40580->localhost:2379 (ESTABLISHED)
python  6770 neil    7u  IPv4 884101      0t0      TCP localhost:40584->localhost:2379 (ESTABLISHED)
python  6770 neil    8u  IPv4 884108      0t0      TCP localhost:40588->localhost:2379 (ESTABLISHED)
python  6770 neil   10u  IPv4 884113      0t0      TCP localhost:40592->localhost:2379 (ESTABLISHED)
python  6770 neil   11u  IPv4 885316      0t0      TCP localhost:40596->localhost:2379 (ESTABLISHED)
python  6770 neil   12u  IPv4 880574      0t0      TCP localhost:40600->localhost:2379 (ESTABLISHED)

I believe this pattern continues without any bound, because in an overnight OpenStack churn test with networking-calico, the DHCP agent (which uses etcd3gw, and creates and cancels watches as above) was found to have more than 900 open connections to etcd.

The text was updated successfully, but these errors were encountered:

nelljerram · 2018-03-29T12:30:46Z

I guess it's this, from http://docs.python-requests.org/en/latest/user/advanced/#keep-alive:

Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.

But since we don't tell the server that we're no longer interested in the watch, I'm not sure there's a correct definition of "once all body data has been read", and in practice accessing <watcher>._response.content hangs for me.

I guess this is a known problem, and that this -

            if six.PY2:
                self._response.raw._fp.close()

is trying to work around it by reaching in and closing the socket by hand. But apparently it isn't working (with Python 2, which is what I'm using).

dims · 2018-03-31T13:42:56Z

hey @neiljerram do you have time to throw a debugger at it? else i'll try to get to it next week

nelljerram · 2018-04-01T19:22:01Z

@dims Thanks for reading and commenting! I'm happy to try debugging if you can give me a bit more guidance on what to do or look for... at the moment I'm afraid I don't know what you have in mind.

dims · 2018-04-01T20:03:03Z

@neiljerram oh, sorry. self._response.raw._fp.close() stick the debugger on that line and step in to see if that actually closes the file pointer or it throws an exception somewhere and the file pointer is not closed.

fasaxc · 2018-04-03T17:08:51Z

Hi @dims, I work with Neil, I'm picking this up and trying to find a workaround. I believe that close() is being called on the socket but it's still stuck open. I find that the code in the Python v3 branch of that function does seem to work for Python 2.7.12, the call to shutdown() seems to do the trick.

I'm guessing you put the conditional there for a reason though?

From the python docs:

Note close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

It doesn't seem like shutdown is new, looks like it's always been there.

dims · 2018-04-03T17:43:04Z

@fasaxc if that works, then let's switch over to that. please make sure the unit tests work too. if i remember right, i had trouble with the unit tests too w/o that messy hack

The previous technique used to try to force-close the watch socket didn't work on (at least) Python 2.7.12 but the old Python 3.x workaround seems to work fine on Python 2.7 as well. Use that. Fixes dims#15

fasaxc · 2018-04-04T13:40:44Z

I put up a PR; tox -e py27 passes for me locally. I noticed that the contributor guidelines say to follow the OpenStack process (via gerrit), is this an OpenStack project?

dims · 2018-04-04T13:41:52Z

@fasaxc at some point i'll propose this to oslo. until then, this is my personal repo.

The previous technique used to try to force-close the watch socket didn't work on (at least) Python 2.7.12 but the old Python 3.x workaround seems to work fine on Python 2.7 as well. Use that. Fixes #15

fasaxc mentioned this issue Apr 4, 2018

Use the socket shutdown() workaround on Python 2 #16

Merged

dims closed this as completed in #16 Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch seems to leak connections #15

Watch seems to leak connections #15

nelljerram commented Mar 29, 2018

nelljerram commented Mar 29, 2018

dims commented Mar 31, 2018

nelljerram commented Apr 1, 2018

dims commented Apr 1, 2018

fasaxc commented Apr 3, 2018 •

edited

Loading

dims commented Apr 3, 2018

fasaxc commented Apr 4, 2018

dims commented Apr 4, 2018

Watch seems to leak connections #15

Watch seems to leak connections #15

Comments

nelljerram commented Mar 29, 2018

nelljerram commented Mar 29, 2018

dims commented Mar 31, 2018

nelljerram commented Apr 1, 2018

dims commented Apr 1, 2018

fasaxc commented Apr 3, 2018 • edited Loading

dims commented Apr 3, 2018

fasaxc commented Apr 4, 2018

dims commented Apr 4, 2018

fasaxc commented Apr 3, 2018 •

edited

Loading