Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentModificationException when doing batch classification for more than one instance #1283

Closed
johann-petrak opened this issue Oct 14, 2021 · 10 comments
Labels
triaged Issue has been reviewed and triaged

Comments

@johann-petrak
Copy link

  • torchserve version: 0.4.1
  • torch-model-archiver version: 0.4.1
  • torch version: 1.7.1+cu101
  • torchvision version [if any]:
  • torchtext version [if any]:
  • torchaudio version [if any]:
  • java version: openjdk version "11.0.12"
  • Operating System and version: Ubuntu 18.04.5 LTS

Getting the following exception:

2021-10-14 22:57:59,490 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
java.util.ConcurrentModificationException
	at java.base/java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
	at java.base/java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:741)
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:81)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:194)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-10-14 22:57:59,491 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Unexpected job: 460bf910-0e24-4b65-9ed1-bc2d376352dc

This happens when within the maxBatchDelay time, more than one request gets sent to the server. If only one request is sent, everything is fine and I get back the expected array of a single prediction.

The handler is implemented in a way that the method handle(self, data, context) returns a list of dicts with as many elements as data contains.

Before returning the list of dicts, the following code is also executed:

for idx, response_content_type in enumerate(response_content_types):
                context.set_response_content_type(idx, response_content_type)

where response_content_types is a list with as many elements as the length of data or the length of the returned list. (in this case, all elements are just 'application/json'

The exception seems to indicate some problem in the Java code which I assume should be completely independent of the python code running in the handler (the ConcurrentModificationException in java usually indicates that an element of a collection is removed or added while iterating over the collection, I do not think my handler has anything to do with this).
I therefore assume this is a severe bug in torch serve or, should it be caused by a client problem, definitely not the right way to indicate the client problem by running into such an exception.

@johann-petrak
Copy link
Author

This is the config file:

enable_envvars_config=true
decode_input_request=false
default_response_timeout=60
inference_address=http://0.0.0.0:9090
management_address=http://0.0.0.0:9090
load_models=model.mar
models={ "model": { "1.0": { "defaultVersion": true, "minWorkers": 1, "maxWorkers": 1, "batchSize": 3, "maxBatchDelay": 5000, "responseTimeout": 120}}}

No environment variables TS_* are set.

@msaroufim
Copy link
Member

msaroufim commented Oct 15, 2021

Hi @johann-petrak do you mind trying out this PR and building from source and seeing if it fixes your problem? It worked for me but It'd be great to test it out on more than 1 model #1272

@msaroufim msaroufim added the triaged Issue has been reviewed and triaged label Oct 15, 2021
@johann-petrak
Copy link
Author

Would have loved to try it but the procedure for installation from source requires sudo, insists on changing the global system configuration and is not that easy to adapt because it is not using bash but a python program. Also insists on installing dependencies which are incompatible with the dependencies of what I need in my handler.

@msaroufim
Copy link
Member

Hmm yeah we need to fix that, would you mind listing the issues you have with our installation script here? #1254

Would be super helpful so I can fix this

@johann-petrak
Copy link
Author

I took a closer look at what those scripts actually try to do and ran those commands relevant to me (or their equivalent) locally, then installed from the pythonbatchtest branch commit 66fe2ec

When testing my handler with this version, everything seems to work just fine for submitting a few requests (no heavy load testing was done).

@johann-petrak
Copy link
Author

OK, noticed there were some new commits in the meantime, re-installed from f713d49
and tested again, works fine!

@msaroufim
Copy link
Member

Thank you for testing, I appreciate it!

@johann-petrak
Copy link
Author

Thank you for your work and the PR, hope it gets reviewed/merged/released soon!

@lxning
Copy link
Collaborator

lxning commented Oct 22, 2021

PR1272 merged

@lxning lxning closed this as completed Oct 22, 2021
@johann-petrak
Copy link
Author

johann-petrak commented Nov 3, 2021

@lxning any chance of a hotfix release with this?
This is a very important feature to have working for deploying production models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been reviewed and triaged
Projects
None yet
Development

No branches or pull requests

3 participants