Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out waiting for UP message from ForkProcess #1337

Closed
gzhukov opened this issue Sep 22, 2023 · 7 comments · Fixed by #1345
Closed

Timed out waiting for UP message from ForkProcess #1337

gzhukov opened this issue Sep 22, 2023 · 7 comments · Fixed by #1345

Comments

@gzhukov
Copy link

gzhukov commented Sep 22, 2023

Hello,
We tried to upgrade querybook from 2.4.0 to 3.28.0. Everything were good (errors on webserver, migrations, etc) but all requests to query-engine got stuck and such errors appeared on worker:

[2023-09-22 11:09:31,258: INFO/MainProcess] celery@bc18e9efe089 ready.
[2023-09-22 11:09:31,262: INFO/MainProcess] Task tasks.run_query.run_query_task[01756288-9a00-4d16-b7e8-17c6fb897cb7] received
[2023-09-22 11:09:31,264: INFO/MainProcess] Task tasks.run_query.run_query_task[3d377c95-0717-415e-b3a5-779d6d0bf3a0] received
[2023-09-22 11:09:32,307: INFO/ForkPoolWorker-1] POST http://querybook-elasticsearch:9200/search_query_executions_v1/_update/87508 [status:201 request:0.927s]
[2023-09-22 11:09:32,315: INFO/MainProcess] Task tasks.log_query_per_table.log_query_per_table_task[3c1a2ecd-e52b-49fd-aee6-117e25363f05] received
[2023-09-22 11:09:32,316: INFO/ForkPoolWorker-1] Task tasks.run_query.run_query_task[01756288-9a00-4d16-b7e8-17c6fb897cb7] succeeded in 1.0514106303453445s: (3, 87508)
[2023-09-22 11:09:36,553: ERROR/MainProcess] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-151, started daemon)>
[2023-09-22 11:09:36,560: ERROR/MainProcess] Process 'ForkPoolWorker-151' pid:232 exited with 'signal 9 (SIGKILL)'
[2023-09-22 11:09:40,674: ERROR/MainProcess] Timed out waiting for UP message from <ForkProcess(ForkPoolWorker-152, started daemon)>
[2023-09-22 11:09:40,680: ERROR/MainProcess] Process 'ForkPoolWorker-152' pid:233 exited with 'signal 9 (SIGKILL)'

We tried to change query-engine, start empty redis and elasticsearch but without any results.
We can find task_id in our redis:

127.0.0.1:6379[12]> keys *
1) "celery-task-meta-01756288-9a00-4d16-b7e8-17c6fb897cb7"
2) "unacked"
3) "_kombu.binding.celeryev"
4) "celery-task-meta-dbfaa772-15ef-4d20-93b4-b9723564270a"
5) "_kombu.binding.celery"
6) "unacked_index"
7) "_kombu.binding.celery.pidbox"
127.0.0.1:6379[12]> get celery-task-meta-01756288-9a00-4d16-b7e8-17c6fb897cb7
"{\"status\": \"SUCCESS\", \"result\": [3, 87508], \"traceback\": null, \"children\": [[[\"3c1a2ecd-e52b-49fd-aee6-117e25363f05\", null], null]], \"date_done\": \"2023-09-22T08:09:32.315638\", \"task_id\": \"01756288-9a00-4d16-b7e8-17c6fb897cb7\"}"

Could you please give me a hit with that issue?

@mlivirov
Copy link

mlivirov commented Sep 24, 2023

+1
faced the same issue in prod deployment, but on the local machine with dev build it works as expected.

there is a different behaviour in worker startup script depending on the presense of production flag.
https://github.com/pinterest/querybook/blob/master/querybook/server/tasks/all_tasks.py

so as workaround I've added production=false to the worker env variables which helped. not sure what side effects it may have added.

@gzhukov
Copy link
Author

gzhukov commented Sep 24, 2023

Thx. I have prod environment too.

@adamstruck
Copy link
Contributor

I experienced the same issue. Adding kombu==5.3.1 in requirements/base.txt fixed the issue in my production deployment.

@jczhong84
Copy link
Collaborator

jczhong84 commented Sep 26, 2023

@mlivirov @adamstruck thanks for sharing findings.

for the flag of production=false in all_tasks.py, the only difference is it will run the query execution cleanup when production is true. This reminds me of what @baumandm mentioned in https://querybook.slack.com/archives/CHCNR2Y5B/p1695153621351919.

for the package of kombu, it seems to be a dep package of celery, and we do did a celery version upgrade. Does changing kombu to 5.3.1 work for other people?

@baumandm
Copy link
Contributor

I ran into this issue running locally (via make) only with production=true set, while I was investigating the worker startup issue.

Fortunately we haven't seen it in our production instance, but we are building our own Docker image and it's possible that the requirements are slightly different.

@czgu
Copy link
Collaborator

czgu commented Oct 11, 2023

Hey all, we faced a similar issue to this. Turns out using celery with -P gevent would resolve this. I am defaulting to use gevent for workers to avoid this issue in future

@czgu
Copy link
Collaborator

czgu commented Oct 12, 2023

turns out this is should be the issue celery/kombu#1785
will fix to 5.3.1 for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants