Engine Thread Pool Configuration #1490

nickdgriffin · 2020-02-28T15:37:55Z

Hello!

Long time browser, first time raiser. We recently ran into an issue where making more than one request concurrently to the predictions endpoint of the Engine would result in all requests hanging indefinitely.

It turned out to be a combination of the complexity of our graph (max depth of 7) and the fact that the Engine is using Spring's default configuration for the thread pool which has an unbounded queue but a core pool size of 8. This results in only 8 threads being spawned and tasks being queued indefinitely, which in our case would cause a blockage due to the fact that the PredictiveUnitBean blocks the thread whilst it waits for the child. As one thread is taken up by the GRPC server, the other 7 are required for a single request to complete and any additional threads cause a lock up - although it's not reported as a deadlock, and as the queue is unbounded there are no task rejections being reported.

My request is that these properties are exposed or documented so that they can be configured - at the moment we've resorted to setting them via JAVA_OPTS - or that more sensible defaults are set along with logging around when it is unable to process as all we could see was the REST controller logging the request and nothing else. We had to inspect the heap and perform a thread dump to see what was happening under the hood.

Alternatively, PredictiveUnitBeans shouldn't block the thread until the child completes as this will create issues for larger graphs that will require even larger thread pools to be able to support their operation whilst becoming more susceptible to lock up.

Thanks,
Nick

EDIT:
We applied the JAVA_OPTS changes with the following annotation on the SeldonDeployment resource (it's actually done via a Helm chart in our case, so we can configure those pool values):

seldon.io/engine-java-opts: "-server -Dcom.sun.management.jmxremote.rmi.port=9090 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9090 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=127.0.0.1 -Dspring.task.execution.pool.core-size=15 -Dspring.task.execution.pool.max-size=70 -Dspring.task.execution.pool.queue-capacity=1"

The pool values were based on:

core-size: 1 thread for the GRPC server, then 2x our graph depth
max-size: 10 concurrent graphs being processed
queue-capacity: 1, to force it to grow the queue whilst it can - as soon as the queue is full it will reject any other tasks but you get an exception for that and the requester should also receive an error (which we figured was better than it just hanging until a timeout kicks in)

There's room for tuning, but it was adequate for what we needed. Keep an eye on memory utilisation when tuning the pool too, as you might need to allocate more.

The other values were taken from the defaults that are applied, as it seems this annotation overwrites them.

The text was updated successfully, but these errors were encountered:

ukclivecox · 2020-02-28T15:48:32Z

Thanks. Yes would be great to add a PR for this.
From 1.1 we are releasing a Golang rewrite of the engine called the "executor". https://github.com/SeldonIO/seldon-core/tree/master/executor
It would be great if you have a chance to test your graphs on this by installing from master which defaults to executor.

amigniox · 2020-03-23T15:30:38Z

@nickdgriffin we are having similar issue with the service orchestrator hanging requests. Could you please share how you solve the problem by configuring JAVA_OPTS? thank you!

nickdgriffin · 2020-03-23T17:21:56Z

@amigniox Yes, that would have been helpful - I've edited the original post to show how we did it.

eavidan · 2020-04-07T10:30:46Z

Thanks @nickdgriffin this solved the issue

nickdgriffin added the triage Needs to be triaged and prioritised accordingly label Feb 28, 2020

ukclivecox added bug and removed triage Needs to be triaged and prioritised accordingly labels Feb 28, 2020

ukclivecox added this to the 1.2 milestone Feb 28, 2020

ukclivecox mentioned this issue Mar 23, 2020

Inference graph becomes unavalible with no error in logs #1584

Closed

ukclivecox removed this from the 1.2 milestone Apr 23, 2020

ukclivecox closed this as completed Apr 23, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine Thread Pool Configuration #1490

Engine Thread Pool Configuration #1490

nickdgriffin commented Feb 28, 2020 •

edited

Loading

ukclivecox commented Feb 28, 2020

amigniox commented Mar 23, 2020

nickdgriffin commented Mar 23, 2020

eavidan commented Apr 7, 2020

Engine Thread Pool Configuration #1490

Engine Thread Pool Configuration #1490

Comments

nickdgriffin commented Feb 28, 2020 • edited Loading

ukclivecox commented Feb 28, 2020

amigniox commented Mar 23, 2020

nickdgriffin commented Mar 23, 2020

eavidan commented Apr 7, 2020

nickdgriffin commented Feb 28, 2020 •

edited

Loading