Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.25.0 with grpcio==1.3.5 hangs in completion_queue.poll() #3450

Closed
yaalaa opened this issue May 23, 2017 · 14 comments
Closed

v0.25.0 with grpcio==1.3.5 hangs in completion_queue.poll() #3450

yaalaa opened this issue May 23, 2017 · 14 comments
Assignees
Labels
grpc type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@yaalaa
Copy link

yaalaa commented May 23, 2017

I'm not sure that's the right place to file this issue. It could be something well-known one of another project. Nevertheless here it is:
I'm running Google App Engine Flexible VM.
The webservice uses gunicorn + gevent + flask chain.
It uses this lib v0.25.0
Yesterday, upon new deployment, I've faced with old and relatively seldom issue - it hangs in completion_queue.poll() in deep of grpcio.
I have been seeing such an issue from time to time.
However it has become a nightmare with recent deployment. Workers end up in critical timeout and been rebooted all the way.
And here's what I've found out:

  • my recent deployment picks grpcio==1.3.5 - it hangs all the way
  • I've added explicit dependency to grpcio==1.3.0 and things go well.
@dhermes dhermes added the grpc label May 24, 2017
@dhermes
Copy link
Contributor

dhermes commented May 24, 2017

/cc @nathanielmanistaatgoogle

@dhermes
Copy link
Contributor

dhermes commented May 24, 2017

@yaalaa Thanks for taking the time to report this!

Which subpackage are you using (other than just google-cloud==0.25.0)? Also, can you provide a code snippet that causes this failure (or describe a little bit of what is failing)?

@dhermes dhermes added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label May 24, 2017
@yaalaa
Copy link
Author

yaalaa commented May 24, 2017

Here's the pip freeze output that hangs (Google-specific highlighted):

gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
gapic-google-cloud-speech-v1beta1==0.15.3
gapic-google-cloud-vision-v1==0.90.3
google-auth==1.0.1
google-auth-httplib2==0.0.2
google-cloud==0.25.0
google-cloud-bigquery==0.24.0
google-cloud-bigtable==0.24.0
google-cloud-core==0.24.1
google-cloud-datastore==1.0.0
google-cloud-dns==0.24.0
google-cloud-error-reporting==0.24.2
google-cloud-language==0.24.1
google-cloud-logging==1.0.0
google-cloud-monitoring==0.24.0
google-cloud-pubsub==0.25.0
google-cloud-resource-manager==0.24.0
google-cloud-runtimeconfig==0.24.0
google-cloud-spanner==0.24.2
google-cloud-speech==0.24.0
google-cloud-storage==1.1.1
google-cloud-translate==0.24.0
google-cloud-vision==0.24.0
google-gax==0.15.13
google-resumable-media==0.0.2
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
grpcio==1.3.5
oauth2client==4.0.0
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3
proto-google-cloud-speech-v1beta1==0.15.3
proto-google-cloud-vision-v1==0.90.3
protobuf==3.3.0

@yaalaa
Copy link
Author

yaalaa commented May 24, 2017

I've looked at the logs. Sometimes it's hard to describe an use-case.
I think my app uses Google Cloud Datastore only.

Here's an example of how worker feels at the end:

16:21:06 gunicorn.1  | [2017-05-23 16:21:06 +0000] [14] [CRITICAL] WORKER TIMEOUT (pid:1335)
16:21:06 gunicorn.1  | [2017-05-23 16:21:06 +0000] [1335] [INFO] gunicorn_config.worker_abort:
16:21:06 gunicorn.1  | [2017-05-23 16:21:06 +0000] [1335] [DEBUG]
16:21:06 gunicorn.1  | # Thread: (140447432369920)
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/threadpool.py", line 200, in _worker
16:21:06 gunicorn.1  |   task = task_queue.get()
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/_threading.py", line 437, in get
16:21:06 gunicorn.1  |   self.not_empty.wait()
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/_threading.py", line 153, in wait
16:21:06 gunicorn.1  |   waiter.acquire()
16:21:06 gunicorn.1  |
16:21:06 gunicorn.1  | # Thread: (140447440762624)
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/threadpool.py", line 200, in _worker
16:21:06 gunicorn.1  |   task = task_queue.get()
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/_threading.py", line 437, in get
16:21:06 gunicorn.1  |   self.not_empty.wait()
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/_threading.py", line 153, in wait
16:21:06 gunicorn.1  |   waiter.acquire()
16:21:06 gunicorn.1  |
16:21:06 gunicorn.1  | # Thread: (140447547774720)
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gevent/greenlet.py", line 536, in run
16:21:06 gunicorn.1  |   result = self._run(*self.args, **self.kwargs)
16:21:06 gunicorn.1  | File: "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
16:21:06 gunicorn.1  |   self.__bootstrap_inner()
16:21:06 gunicorn.1  | File: "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
16:21:06 gunicorn.1  |   self.run()
16:21:06 gunicorn.1  | File: "/usr/lib/python2.7/threading.py", line 763, in run
16:21:06 gunicorn.1  |   self.__target(*self.__args, **self.__kwargs)
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/grpc/_channel.py", line 852, in _poll_connectivity
16:21:06 gunicorn.1  |   event = completion_queue.poll()
16:21:06 gunicorn.1  | File: "/env/local/lib/python2.7/site-packages/gunicorn/workers/base.py", line 191, in handle_abort
16:21:06 gunicorn.1  |   self.cfg.worker_abort(self)
16:21:06 gunicorn.1  | File: "gunicorn_config.py", line 41, in worker_abort
16:21:06 gunicorn.1  |   for filename, lineno, name, line in traceback.extract_stack( stack ):

@yaalaa
Copy link
Author

yaalaa commented May 24, 2017

I suppose each worker timeout is preceded by query or get.

@tcroiset
Copy link

We have the same issue (and indeed grpcio==1.3.0 fix it)

@dhermes
Copy link
Contributor

dhermes commented May 31, 2017

@nathanielmanistaatgoogle Bump

@dhermes
Copy link
Contributor

dhermes commented May 31, 2017

@tcroiset Which subpackage are you using? (Is it google-cloud-datastore or something else?) Could you provide a code snippet that fails?

@tcroiset
Copy link

It's google-cloud
The service is a custom flex engine based on gcr.io/google_appengine/python

It's part of a big project but I'll try by tomorrow to extract only the relevant code and reproduce the issue

@dhermes
Copy link
Contributor

dhermes commented May 31, 2017

Thanks @tcroiset.


It's worth pointing out that google-cloud doesn't actually have any code in it. It is literally just a setup.py, a README and some other package artifacts:

$ wget https://pypi.python.org/packages/10/a8/\
> 94cb99ec6e61814fc6e1093f9b81570b998028b7a5a9a6efe3ebc116acf7/\
> google-cloud-0.25.0.tar.gz#md5=59fa23b57e806e267153ac041c834fbc
$
$ tar xzvf google-cloud-0.25.0.tar.gz
google-cloud-0.25.0/
google-cloud-0.25.0/google_cloud.egg-info/
google-cloud-0.25.0/google_cloud.egg-info/not-zip-safe
google-cloud-0.25.0/google_cloud.egg-info/top_level.txt
google-cloud-0.25.0/google_cloud.egg-info/dependency_links.txt
google-cloud-0.25.0/google_cloud.egg-info/PKG-INFO
google-cloud-0.25.0/google_cloud.egg-info/requires.txt
google-cloud-0.25.0/google_cloud.egg-info/SOURCES.txt
google-cloud-0.25.0/MANIFEST.in
google-cloud-0.25.0/PKG-INFO
google-cloud-0.25.0/setup.cfg
google-cloud-0.25.0/setup.py
google-cloud-0.25.0/README.rst

So in general, you may be happier being explicit with the google-cloud-${SERVICE} packages you use (these are the ones that actually have code in them).

@nathanielmanistaatgoogle

We've known for a while that gRPC Python doesn't yet play nicely with gevent (and gunicorn? I don't know as much about gunicorn), but the most likely change responsible for what you're seeing is pull request 11154. It added a use of threading, and gevent breaks use of threading, so... there's probably not an immediate or readily available solution other than pinning the grpcio dependency to the version that happens to work for you.

@kpayson64, anything to add that I've missed?

@dhermes
Copy link
Contributor

dhermes commented May 31, 2017

Thanks for the pointers @nathanielmanistaatgoogle

@lukesneeringer
Copy link
Contributor

As this seems to be an upstream dependency issue, closing here.

@AnilkumarKomanduru
Copy link

Hi,
Im using ttn version 2.1.1, and still seeing same issue,
I did not use any certificate for your info for testing.

gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
gapic-google-cloud-speech-v1beta1==0.15.3
gapic-google-cloud-vision-v1==0.90.3
gitdb==0.6.4
gitdb2==2.0.2
gitpython==2.1.10
google-api-core==0.1.4
google-api-python-client==1.7.3
google-auth==1.5.0
google-auth-httplib2==0.0.3
google-cloud==0.33.1
google-cloud-bigquery==0.28.0
google-cloud-bigquery-datatransfer==0.1.1
google-cloud-bigtable==0.28.1
google-cloud-container==0.1.1
google-cloud-core==0.28.1
google-cloud-datastore==1.4.0
google-cloud-dns==0.28.0
google-cloud-error-reporting==0.28.0
google-cloud-firestore==0.28.0
google-cloud-language==1.0.2
google-cloud-logging==1.4.0
google-cloud-monitoring==0.28.1
google-cloud-pubsub==0.30.1
google-cloud-resource-manager==0.28.1
google-cloud-runtimeconfig==0.28.1
google-cloud-spanner==0.29.0
google-cloud-speech==0.30.0
google-cloud-storage==1.6.0
google-cloud-trace==0.17.0
google-cloud-translate==1.3.1
google-cloud-videointelligence==1.0.1
google-cloud-vision==0.29.0
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
grpc-google-iam-v1==0.11.4
grpcio==1.7.3
httplib2==0.11.3
idna==2.7
imagesize==0.7.1
intelhex==2.1
Jinja2==2.9.6
MarkupSafe==1.0
oauth2client==3.0.0
packaging==17.1
paho-mqtt==1.3.1
ply==3.8
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3
proto-google-cloud-speech-v1beta1==0.15.3
proto-google-cloud-vision-v1==0.90.3
protobuf==3.6.0
psutil==5.4.6
PTable==0.9.2
pyasn1==0.4.3
pyasn1-modules==0.2.1
pycparser==2.18
Pygments==2.2.0
pyparsing==2.2.0
pyreadline==2.1
pyserial==3.3
python-jose==3.0.0
pytz==2017.2
PyYAML==3.12
requests==2.19.1
rsa==3.4.2
selenium==2.53.6
six==1.10.0
smmap==0.9.0
smmap2==2.0.3
snowballstemmer==1.2.1
Sphinx==1.5.1
spider==3.3.1
ttn==2.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grpc type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants