Skip to content
This repository has been archived by the owner on Aug 21, 2020. It is now read-only.

Watch seems to leak connections #15

Closed
nelljerram opened this issue Mar 29, 2018 · 8 comments · Fixed by #16
Closed

Watch seems to leak connections #15

nelljerram opened this issue Mar 29, 2018 · 8 comments · Fixed by #16

Comments

@nelljerram
Copy link
Contributor

Run a local etcd server on port 2379, e.g. with docker run --net=host quay.io/coreos/etcd.

Create an Etcd3Client and make an initial connection to the server:

(py27) neil@smaug:~/calico/networking-calico$ python
Python 2.7.14+ (default, Mar 13 2018, 15:23:44) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from etcd3gw.client import Etcd3Client
>>> c = Etcd3Client()
>>> c.status()
{u'raftTerm': u'2', u'header': {u'raft_term': u'2', u'revision': u'1', u'cluster_id': u'14841639068965178418', u'member_id': u'10276657743932975437'}, u'version': u'3.2.9', u'raftIndex': u'4', u'dbSize': u'24576', u'leader': u'10276657743932975437'}

Find out the PID of that python shell, and then in another window monitor its etcd connections with

watch 'lsof -p 6770 | grep 2379'

You'll initially see that there is 1 connection:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)

Start and then cancel a watch:

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

The watch still shows just 1 connection.

However, now, for each further iteration of

>>> stream, cancel = c.watch_prefix("/")
>>> cancel()

we see 1 more ESTABLISHED connection. For example, after 8 more iterations, there are now 9 ESTABLISHED connections:

python  6770 neil    3u  IPv4 837856      0t0      TCP localhost:40306->localhost:2379 (ESTABLISHED)
python  6770 neil    4u  IPv4 852249      0t0      TCP localhost:40374->localhost:2379 (ESTABLISHED)
python  6770 neil    5u  IPv4 851242      0t0      TCP localhost:40386->localhost:2379 (ESTABLISHED)
python  6770 neil    6u  IPv4 880526      0t0      TCP localhost:40580->localhost:2379 (ESTABLISHED)
python  6770 neil    7u  IPv4 884101      0t0      TCP localhost:40584->localhost:2379 (ESTABLISHED)
python  6770 neil    8u  IPv4 884108      0t0      TCP localhost:40588->localhost:2379 (ESTABLISHED)
python  6770 neil   10u  IPv4 884113      0t0      TCP localhost:40592->localhost:2379 (ESTABLISHED)
python  6770 neil   11u  IPv4 885316      0t0      TCP localhost:40596->localhost:2379 (ESTABLISHED)
python  6770 neil   12u  IPv4 880574      0t0      TCP localhost:40600->localhost:2379 (ESTABLISHED)

I believe this pattern continues without any bound, because in an overnight OpenStack churn test with networking-calico, the DHCP agent (which uses etcd3gw, and creates and cancels watches as above) was found to have more than 900 open connections to etcd.

@nelljerram
Copy link
Contributor Author

I guess it's this, from http://docs.python-requests.org/en/latest/user/advanced/#keep-alive:

Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.

But since we don't tell the server that we're no longer interested in the watch, I'm not sure there's a correct definition of "once all body data has been read", and in practice accessing <watcher>._response.content hangs for me.

I guess this is a known problem, and that this -

            if six.PY2:
                self._response.raw._fp.close()

is trying to work around it by reaching in and closing the socket by hand. But apparently it isn't working (with Python 2, which is what I'm using).

@dims
Copy link
Owner

dims commented Mar 31, 2018

hey @neiljerram do you have time to throw a debugger at it? else i'll try to get to it next week

@nelljerram
Copy link
Contributor Author

@dims Thanks for reading and commenting! I'm happy to try debugging if you can give me a bit more guidance on what to do or look for... at the moment I'm afraid I don't know what you have in mind.

@dims
Copy link
Owner

dims commented Apr 1, 2018

@neiljerram oh, sorry. self._response.raw._fp.close() stick the debugger on that line and step in to see if that actually closes the file pointer or it throws an exception somewhere and the file pointer is not closed.

@fasaxc
Copy link
Contributor

fasaxc commented Apr 3, 2018

Hi @dims, I work with Neil, I'm picking this up and trying to find a workaround. I believe that close() is being called on the socket but it's still stuck open. I find that the code in the Python v3 branch of that function does seem to work for Python 2.7.12, the call to shutdown() seems to do the trick.

I'm guessing you put the conditional there for a reason though?

From the python docs:

Note close() releases the resource associated with a connection but does not necessarily close the connection immediately. If you want to close the connection in a timely fashion, call shutdown() before close().

It doesn't seem like shutdown is new, looks like it's always been there.

@dims
Copy link
Owner

dims commented Apr 3, 2018

@fasaxc if that works, then let's switch over to that. please make sure the unit tests work too. if i remember right, i had trouble with the unit tests too w/o that messy hack

fasaxc added a commit to fasaxc/etcd3-gateway that referenced this issue Apr 4, 2018
The previous technique used to try to force-close the
watch socket didn't work on (at least) Python 2.7.12 but the
old Python 3.x workaround seems to work fine on Python 2.7 as
well. Use that.

Fixes dims#15
@fasaxc
Copy link
Contributor

fasaxc commented Apr 4, 2018

I put up a PR; tox -e py27 passes for me locally. I noticed that the contributor guidelines say to follow the OpenStack process (via gerrit), is this an OpenStack project?

@dims
Copy link
Owner

dims commented Apr 4, 2018

@fasaxc at some point i'll propose this to oslo. until then, this is my personal repo.

@dims dims closed this as completed in #16 Apr 4, 2018
dims pushed a commit that referenced this issue Apr 4, 2018
The previous technique used to try to force-close the
watch socket didn't work on (at least) Python 2.7.12 but the
old Python 3.x workaround seems to work fine on Python 2.7 as
well. Use that.

Fixes #15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants