Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop training from blocking other requests to the server #4774

Merged
merged 27 commits into from
Jan 28, 2020
Merged

Conversation

erohmensing
Copy link
Contributor

@erohmensing erohmensing commented Nov 14, 2019

Proposed changes:

Status (please check what you already did):

  • made PR ready for code review
  • added some tests for the functionality
  • updated the documentation
  • updated the changelog
  • reformat files using black (please check Readme for instructions)

@erohmensing erohmensing changed the title Train http stop training from blocking other requests to the server Nov 14, 2019
@charlielin
Copy link

Not only training, Agent load local model is also blocking other requests. We can use the same way to resolve it by running Agent.load_local_model() in an executor.

@erohmensing
Copy link
Contributor Author

Good point. I've updated the description to reflect that this PR is only addressing some of that issue.

Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out the test 👍 I know it was a lot of pain, but the test will keep us safe forever (even if we would replace sanic) . This definitely earns you a 🥇😀

rasa/core/train.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
server_ready = (
requests.get("http://localhost:5005/status").status_code == 200
)
except requests.exceptions.ConnectionError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a variable max_tries to avoid having a test which runs forever?

tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
tests/test_server.py Outdated Show resolved Hide resolved
@akelad
Copy link
Contributor

akelad commented Nov 15, 2019

@erohmensing is this something that could be part of a patch?

@erohmensing
Copy link
Contributor Author

erohmensing commented Nov 15, 2019

@akelad yes, i wouldn't see why not. Just have to get the test to pass 🙄

@akelad
Copy link
Contributor

akelad commented Nov 25, 2019

well, at this point we're releasing 1.5 soon right? so may as well merge into master... but we should def get it in before release if possible :D

@erohmensing
Copy link
Contributor Author

Yes Tobi and I are meeting this morning to try to get this resolved!

@wochinge
Copy link
Contributor

So,

  • test how long requests take on travis
  • try with a small connection timeout and a bigger read timeout

In case there is no way to get this test robust, we should just dump it.

@akelad akelad added this to the Rasa 1.7 milestone Jan 23, 2020
@wochinge
Copy link
Contributor

@federicotdn I'll have a look and tag you for review then.

Copy link
Contributor

@federicotdn federicotdn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 👍
I added a few non-blocking comments (no pun intended)

model_path: Optional[Text] = None
# pass `None` to run in default executor
model_path = await loop.run_in_executor(
None, functools.partial(train_model, **info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we've chosen to run the training function on a separate thread instead of a separate process, right? Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests? (If Tensorflow does not lock the GIL significantly then I guess it's not a problem)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests?

That shouldn't matter. When running things in a separate thread Python will switch between the threads every x milliseconds so that other threads can continue processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop = asyncio.get_event_loop()
try:
loop = asyncio.get_event_loop()
except RuntimeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only happen when running from inside the executor, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, should I add a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

@@ -779,14 +780,25 @@ def _get_events_from_request_body(request: Request) -> List[Event]:
with app.active_training_processes.get_lock():
app.active_training_processes.value += 1

model_path = await train_async(
info = dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really wish run_in_executor also accepted **kwargs! Using partial was a good workaround 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also the suggested way in the docs (Ella pointed it out to me :-) )

Comment on lines +156 to +158
# Fake training function which blocks until we tell it to stop blocking
# If we can send a status request while this is blocking, we can be sure that the
# actual training is also not blocking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible, in the future, to add a similar test to test_train_status_is_not_blocked_by_training, but using the real rasa.train? This new test is very helpful, but I think it would also be interesting to test our choice of threading vs. processing with the real training function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ella tried this and it makes the test very fragile cause you don't know how long the training is gonna take on travis and all these timings make the test very shaky.

@@ -0,0 +1 @@
Requests to ``/model/train`` do not longer block other requests to the Rasa server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had this thought: what happens if I call this endpoint and then call it immediately again, while the previous request has not finished?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will train a second model. Since this doesn't store any information in global variables, we should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants