-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop training from blocking other requests to the server #4774
Conversation
Not only training, Agent load local model is also blocking other requests. We can use the same way to resolve it by running |
Good point. I've updated the description to reflect that this PR is only addressing some of that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring out the test 👍 I know it was a lot of pain, but the test will keep us safe forever (even if we would replace sanic) . This definitely earns you a 🥇😀
tests/test_server.py
Outdated
server_ready = ( | ||
requests.get("http://localhost:5005/status").status_code == 200 | ||
) | ||
except requests.exceptions.ConnectionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a variable max_tries
to avoid having a test which runs forever?
@erohmensing is this something that could be part of a patch? |
@akelad yes, i wouldn't see why not. Just have to get the test to pass 🙄 |
well, at this point we're releasing 1.5 soon right? so may as well merge into master... but we should def get it in before release if possible :D |
Yes Tobi and I are meeting this morning to try to get this resolved! |
So,
In case there is no way to get this test robust, we should just dump it. |
@federicotdn I'll have a look and tag you for review then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 👍
I added a few non-blocking comments (no pun intended)
model_path: Optional[Text] = None | ||
# pass `None` to run in default executor | ||
model_path = await loop.run_in_executor( | ||
None, functools.partial(train_model, **info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here we've chosen to run the training function on a separate thread instead of a separate process, right? Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests? (If Tensorflow does not lock the GIL significantly then I guess it's not a problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests?
That shouldn't matter. When running things in a separate thread Python will switch between the threads every x milliseconds so that other threads can continue processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop = asyncio.get_event_loop() | ||
try: | ||
loop = asyncio.get_event_loop() | ||
except RuntimeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only happen when running from inside the executor, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, should I add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
@@ -779,14 +780,25 @@ def _get_events_from_request_body(request: Request) -> List[Event]: | |||
with app.active_training_processes.get_lock(): | |||
app.active_training_processes.value += 1 | |||
|
|||
model_path = await train_async( | |||
info = dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really wish run_in_executor
also accepted **kwargs
! Using partial
was a good workaround 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also the suggested way in the docs (Ella pointed it out to me :-) )
# Fake training function which blocks until we tell it to stop blocking | ||
# If we can send a status request while this is blocking, we can be sure that the | ||
# actual training is also not blocking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible, in the future, to add a similar test to test_train_status_is_not_blocked_by_training
, but using the real rasa.train
? This new test is very helpful, but I think it would also be interesting to test our choice of threading vs. processing with the real training function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ella tried this and it makes the test very fragile cause you don't know how long the training is gonna take on travis and all these timings make the test very shaky.
@@ -0,0 +1 @@ | |||
Requests to ``/model/train`` do not longer block other requests to the Rasa server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had this thought: what happens if I call this endpoint and then call it immediately again, while the previous request has not finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will train a second model. Since this doesn't store any information in global variables, we should be fine.
Proposed changes:
Status (please check what you already did):
black
(please check Readme for instructions)