Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many threads generated till -su: fork: retry: Resource temporarily unavailable #1511

Closed
jack-gits opened this issue Mar 15, 2022 · 4 comments · Fixed by #1552
Closed
Assignees
Labels
bug Something isn't working urgent workflowx Issues related to workflow / ensemble models
Milestone

Comments

@jack-gits
Copy link
Contributor

Please have a look at FAQ's and Troubleshooting guide, your query may be already addressed.

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

  • torchserve version: pytorch/torchserve:latest-gpu
  • torch-model-archiver version: 0.5.2
  • torch version: 1.6.0
  • torchvision version [if any]: 0.7.0
  • torchtext version [if any]:
  • torchaudio version [if any]:
  • java version: openjdk version "1.8.0_292"
  • Operating System and version: Ubuntu 16.04

Your Environment

  • Installed using source? [yes/no]:
  • Are you planning to deploy it using docker container? [yes/no]: yes
  • Is it a CPU or GPU environment?: GPU
  • Using a default/custom handler? [If possible upload/share custom handler/model]: custom handler
  • What kind of model is it e.g. vision, text, audio?: vision
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? local model
    [If public url then provide link.]:
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs:
  • Link to your project [if any]:

Expected Behavior

I'm using workflow of torchserve in docker. when inferencing, system generate lots of threads till the system "-su: fork: retry: Resource temporarily unavailable"

image

Current Behavior

Possible Solution

Steps to Reproduce

  1. torchserve --start
  2. register workflow by api
    3, inferencing, there's about 6000 cases to be inferenced.

Failure Logs [if any]

inference/torchserver# 2022-03-14T15:03:32,999 [ERROR] pool-3-thread-2 org.pytorch.serve.metrics.MetricCollector -
java.io.IOException: Cannot run program "/usr/bin/python3" (in directory "/usr/local/lib/python3.6/dist-packages"): error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) ~[?:?]
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) ~[?:?]
at java.lang.Runtime.exec(Runtime.java:592) ~[?:?]
at org.pytorch.serve.metrics.MetricCollector.run(MetricCollector.java:42) ~[model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.ProcessImpl.forkAndExec(Native Method) ~[?:?]
at java.lang.ProcessImpl.(ProcessImpl.java:340) ~[?:?]
at java.lang.ProcessImpl.start(ProcessImpl.java:271) ~[?:?]
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) ~[?:?]
... 9 more
[15806.571s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 136k, guardsize: 0k, detached.

@msaroufim msaroufim added the workflowx Issues related to workflow / ensemble models label Mar 16, 2022
@jack-gits
Copy link
Contributor Author

any update?

@msaroufim msaroufim added bug Something isn't working urgent labels Mar 22, 2022
@msaroufim msaroufim added this to the v0.6.0 milestone Mar 22, 2022
@maaquib
Copy link
Collaborator

maaquib commented Mar 23, 2022

Can reproduce using the dog-cat classification workflow example

$ curl -X POST "http://127.0.0.1:8081/workflows?url=dog_breed_wf.war"
{
  "status": "Workflow dog_breed_wf has been registered and scaled successfully."
}
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
52
$ curl https://raw.githubusercontent.com/udacity/dog-project/master/images/Labrador_retriever_06457.jpg -o Dog1.jpg
$ curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
60
$ curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null
model-server@1182613e41ce:~$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
66
$ for i in {1..100}; do curl -s http://127.0.0.1:8080/wfpredict/dog_breed_wf -T Dog1.jpg > /dev/null; done
$ ps -efT | cat | grep wf_store | grep -v grep | wc -l
407
$ ps -efT | cat | grep wf_store | head -1
model-s+    15    15     1  0 18:13 pts/0    00:00:00 java -Dmodel_server_home=/home/venv/lib/python3.8/site-packages -Djava.io.tmpdir=/home/model-server/tmp -cp .:/home/venv/lib/python3.8/site-packages/ts/frontend/* org.pytorch.serve.ModelServer --python /home/venv/bin/python -s model_store/ -w wf_store/ -ncs
  • Num of WAITING (parking) threads increases by 3 with every inference request
$ jstack 15 | grep WAITING | wc -l
335
$ jstack 15 | grep "ThreadPoolExecutor.runWorker" | wc -l
334

From heap dump

"pool-100-thread-1" #378 prio=5 os_prio=0 cpu=0.35ms elapsed=726.22s tid=0x00007f3070213800 nid=0x281 waiting on condition  [0x00007f2f23027000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.13/Native Method)
	- parking to wait for  <0x0000000424864c40> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.13/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.13/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.13/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.13/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.13/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.13/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.13/Thread.java:829)

maaquib added a commit to maaquib/serve that referenced this issue Mar 23, 2022
@jack-gits
Copy link
Contributor Author

when will be released?

@msaroufim
Copy link
Member

As soon as the PR is merged it needs a day to be added in nightly builds https://pypi.org/project/torchserve-nightly/

For an official release will probably add this in 0.6, still discussing an exact date with the team

maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
maaquib added a commit to maaquib/serve that referenced this issue Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working urgent workflowx Issues related to workflow / ensemble models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants