Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine Thread Pool Configuration #1490

Closed
nickdgriffin opened this issue Feb 28, 2020 · 4 comments
Closed

Engine Thread Pool Configuration #1490

nickdgriffin opened this issue Feb 28, 2020 · 4 comments
Labels

Comments

@nickdgriffin
Copy link

nickdgriffin commented Feb 28, 2020

Hello!

Long time browser, first time raiser. We recently ran into an issue where making more than one request concurrently to the predictions endpoint of the Engine would result in all requests hanging indefinitely.

It turned out to be a combination of the complexity of our graph (max depth of 7) and the fact that the Engine is using Spring's default configuration for the thread pool which has an unbounded queue but a core pool size of 8. This results in only 8 threads being spawned and tasks being queued indefinitely, which in our case would cause a blockage due to the fact that the PredictiveUnitBean blocks the thread whilst it waits for the child. As one thread is taken up by the GRPC server, the other 7 are required for a single request to complete and any additional threads cause a lock up - although it's not reported as a deadlock, and as the queue is unbounded there are no task rejections being reported.

My request is that these properties are exposed or documented so that they can be configured - at the moment we've resorted to setting them via JAVA_OPTS - or that more sensible defaults are set along with logging around when it is unable to process as all we could see was the REST controller logging the request and nothing else. We had to inspect the heap and perform a thread dump to see what was happening under the hood.

Alternatively, PredictiveUnitBeans shouldn't block the thread until the child completes as this will create issues for larger graphs that will require even larger thread pools to be able to support their operation whilst becoming more susceptible to lock up.

Thanks,
Nick

EDIT:
We applied the JAVA_OPTS changes with the following annotation on the SeldonDeployment resource (it's actually done via a Helm chart in our case, so we can configure those pool values):

seldon.io/engine-java-opts: "-server -Dcom.sun.management.jmxremote.rmi.port=9090 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9090 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=127.0.0.1 -Dspring.task.execution.pool.core-size=15 -Dspring.task.execution.pool.max-size=70 -Dspring.task.execution.pool.queue-capacity=1"

The pool values were based on:

  • core-size: 1 thread for the GRPC server, then 2x our graph depth
  • max-size: 10 concurrent graphs being processed
  • queue-capacity: 1, to force it to grow the queue whilst it can - as soon as the queue is full it will reject any other tasks but you get an exception for that and the requester should also receive an error (which we figured was better than it just hanging until a timeout kicks in)

There's room for tuning, but it was adequate for what we needed. Keep an eye on memory utilisation when tuning the pool too, as you might need to allocate more.

The other values were taken from the defaults that are applied, as it seems this annotation overwrites them.

@nickdgriffin nickdgriffin added the triage Needs to be triaged and prioritised accordingly label Feb 28, 2020
@ukclivecox
Copy link
Contributor

Thanks. Yes would be great to add a PR for this.
From 1.1 we are releasing a Golang rewrite of the engine called the "executor". https://github.com/SeldonIO/seldon-core/tree/master/executor
It would be great if you have a chance to test your graphs on this by installing from master which defaults to executor.

@ukclivecox ukclivecox added bug and removed triage Needs to be triaged and prioritised accordingly labels Feb 28, 2020
@ukclivecox ukclivecox added this to the 1.2 milestone Feb 28, 2020
@amigniox
Copy link

@nickdgriffin we are having similar issue with the service orchestrator hanging requests. Could you please share how you solve the problem by configuring JAVA_OPTS? thank you!

@nickdgriffin
Copy link
Author

@amigniox Yes, that would have been helpful - I've edited the original post to show how we did it.

@eavidan
Copy link

eavidan commented Apr 7, 2020

Thanks @nickdgriffin this solved the issue

@ukclivecox ukclivecox removed this from the 1.2 milestone Apr 23, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants