ConcurrentModificationException when doing batch classification for more than one instance #1283

johann-petrak · 2021-10-14T21:16:41Z

torchserve version: 0.4.1
torch-model-archiver version: 0.4.1
torch version: 1.7.1+cu101
torchvision version [if any]:
torchtext version [if any]:
torchaudio version [if any]:
java version: openjdk version "11.0.12"
Operating System and version: Ubuntu 18.04.5 LTS

Getting the following exception:

2021-10-14 22:57:59,490 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
java.util.ConcurrentModificationException
	at java.base/java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)
	at java.base/java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:741)
	at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:81)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:194)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-10-14 22:57:59,491 [ERROR] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Unexpected job: 460bf910-0e24-4b65-9ed1-bc2d376352dc

This happens when within the maxBatchDelay time, more than one request gets sent to the server. If only one request is sent, everything is fine and I get back the expected array of a single prediction.

The handler is implemented in a way that the method handle(self, data, context) returns a list of dicts with as many elements as data contains.

Before returning the list of dicts, the following code is also executed:

for idx, response_content_type in enumerate(response_content_types):
                context.set_response_content_type(idx, response_content_type)

where response_content_types is a list with as many elements as the length of data or the length of the returned list. (in this case, all elements are just 'application/json'

The exception seems to indicate some problem in the Java code which I assume should be completely independent of the python code running in the handler (the ConcurrentModificationException in java usually indicates that an element of a collection is removed or added while iterating over the collection, I do not think my handler has anything to do with this).
I therefore assume this is a severe bug in torch serve or, should it be caused by a client problem, definitely not the right way to indicate the client problem by running into such an exception.

The text was updated successfully, but these errors were encountered:

johann-petrak · 2021-10-14T21:23:25Z

This is the config file:

enable_envvars_config=true
decode_input_request=false
default_response_timeout=60
inference_address=http://0.0.0.0:9090
management_address=http://0.0.0.0:9090
load_models=model.mar
models={ "model": { "1.0": { "defaultVersion": true, "minWorkers": 1, "maxWorkers": 1, "batchSize": 3, "maxBatchDelay": 5000, "responseTimeout": 120}}}

No environment variables TS_* are set.

msaroufim · 2021-10-15T01:35:56Z

Hi @johann-petrak do you mind trying out this PR and building from source and seeing if it fixes your problem? It worked for me but It'd be great to test it out on more than 1 model #1272

johann-petrak · 2021-10-15T10:45:14Z

Would have loved to try it but the procedure for installation from source requires sudo, insists on changing the global system configuration and is not that easy to adapt because it is not using bash but a python program. Also insists on installing dependencies which are incompatible with the dependencies of what I need in my handler.

msaroufim · 2021-10-15T21:19:47Z

Hmm yeah we need to fix that, would you mind listing the issues you have with our installation script here? #1254

Would be super helpful so I can fix this

johann-petrak · 2021-10-17T11:41:05Z

I took a closer look at what those scripts actually try to do and ran those commands relevant to me (or their equivalent) locally, then installed from the pythonbatchtest branch commit 66fe2ec

When testing my handler with this version, everything seems to work just fine for submitting a few requests (no heavy load testing was done).

johann-petrak · 2021-10-17T11:45:26Z

OK, noticed there were some new commits in the meantime, re-installed from f713d49
and tested again, works fine!

msaroufim · 2021-10-17T23:54:03Z

Thank you for testing, I appreciate it!

johann-petrak · 2021-10-18T16:11:36Z

Thank you for your work and the PR, hope it gets reviewed/merged/released soon!

lxning · 2021-10-22T20:40:27Z

PR1272 merged

johann-petrak · 2021-11-03T11:33:32Z

@lxning any chance of a hotfix release with this?
This is a very important feature to have working for deploying production models.

msaroufim added the triaged Issue has been reviewed and triaged label Oct 15, 2021

msaroufim mentioned this issue Oct 15, 2021

Bert base batch test #1272

Merged

lxning closed this as completed Oct 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConcurrentModificationException when doing batch classification for more than one instance #1283

ConcurrentModificationException when doing batch classification for more than one instance #1283

johann-petrak commented Oct 14, 2021

johann-petrak commented Oct 14, 2021

msaroufim commented Oct 15, 2021 •

edited

Loading

johann-petrak commented Oct 15, 2021

msaroufim commented Oct 15, 2021

johann-petrak commented Oct 17, 2021

johann-petrak commented Oct 17, 2021

msaroufim commented Oct 17, 2021

johann-petrak commented Oct 18, 2021

lxning commented Oct 22, 2021

johann-petrak commented Nov 3, 2021 •

edited

Loading

ConcurrentModificationException when doing batch classification for more than one instance #1283

ConcurrentModificationException when doing batch classification for more than one instance #1283

Comments

johann-petrak commented Oct 14, 2021

johann-petrak commented Oct 14, 2021

msaroufim commented Oct 15, 2021 • edited Loading

johann-petrak commented Oct 15, 2021

msaroufim commented Oct 15, 2021

johann-petrak commented Oct 17, 2021

johann-petrak commented Oct 17, 2021

msaroufim commented Oct 17, 2021

johann-petrak commented Oct 18, 2021

lxning commented Oct 22, 2021

johann-petrak commented Nov 3, 2021 • edited Loading

msaroufim commented Oct 15, 2021 •

edited

Loading

johann-petrak commented Nov 3, 2021 •

edited

Loading