stop training from blocking other requests to the server #4774

erohmensing · 2019-11-14T17:18:59Z

Proposed changes:

run training process in an executor so that other requests (healthcheck, status check) can get responses while training is running
update test for training status to make sure that status checks are responding promptly by restricting them with timeouts
address part of http endpoints become unresponsive while nlu model is updating #3910

Status (please check what you already did):

made PR ready for code review
added some tests for the functionality
updated the documentation
updated the changelog
reformat files using black (please check Readme for instructions)

charlielin · 2019-11-15T08:52:53Z

Not only training, Agent load local model is also blocking other requests. We can use the same way to resolve it by running Agent.load_local_model() in an executor.

erohmensing · 2019-11-15T08:54:07Z

Good point. I've updated the description to reflect that this PR is only addressing some of that issue.

wochinge

Thanks for figuring out the test 👍 I know it was a lot of pain, but the test will keep us safe forever (even if we would replace sanic) . This definitely earns you a 🥇😀

rasa/core/train.py

tests/test_server.py

wochinge · 2019-11-15T13:45:19Z

tests/test_server.py

+            server_ready = (
+                requests.get("http://localhost:5005/status").status_code == 200
+            )
+        except requests.exceptions.ConnectionError:


should we have a variable max_tries to avoid having a test which runs forever?

tests/test_server.py

akelad · 2019-11-15T17:30:27Z

@erohmensing is this something that could be part of a patch?

erohmensing · 2019-11-15T18:02:40Z

@akelad yes, i wouldn't see why not. Just have to get the test to pass 🙄

akelad · 2019-11-25T08:26:05Z

well, at this point we're releasing 1.5 soon right? so may as well merge into master... but we should def get it in before release if possible :D

erohmensing · 2019-11-25T09:01:08Z

Yes Tobi and I are meeting this morning to try to get this resolved!

wochinge · 2019-11-25T09:51:12Z

So,

test how long requests take on travis
try with a small connection timeout and a bigger read timeout

In case there is no way to get this test robust, we should just dump it.

wochinge · 2020-01-24T15:55:57Z

@federicotdn I'll have a look and tag you for review then.

federicotdn

Looks great 👍
I added a few non-blocking comments (no pun intended)

federicotdn · 2020-01-28T12:16:22Z

rasa/server.py

+            model_path: Optional[Text] = None
+            # pass `None` to run in default executor
+            model_path = await loop.run_in_executor(
+                None, functools.partial(train_model, **info)


So here we've chosen to run the training function on a separate thread instead of a separate process, right? Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests? (If Tensorflow does not lock the GIL significantly then I guess it's not a problem)

Is there a chance the training thread might use the GIL in a way that might block Sanic from processing requests?

That shouldn't matter. When running things in a separate thread Python will switch between the threads every x milliseconds so that other threads can continue processing.

5 ms btw https://stackoverflow.com/a/37911403/3429596

federicotdn · 2020-01-28T12:18:52Z

rasa/train.py

-        loop = asyncio.get_event_loop()
+        try:
+            loop = asyncio.get_event_loop()
+        except RuntimeError:


This can only happen when running from inside the executor, right?

yep, should I add a comment?

added a comment

federicotdn · 2020-01-28T12:20:17Z

rasa/server.py

@@ -779,14 +780,25 @@ def _get_events_from_request_body(request: Request) -> List[Event]:
            with app.active_training_processes.get_lock():
                app.active_training_processes.value += 1

-            model_path = await train_async(
+            info = dict(


Really wish run_in_executor also accepted **kwargs! Using partial was a good workaround 👍

That's also the suggested way in the docs (Ella pointed it out to me :-) )

federicotdn · 2020-01-28T12:27:15Z

tests/test_server.py

+    # Fake training function which blocks until we tell it to stop blocking
+    # If we can send a status request while this is blocking, we can be sure that the
+    # actual training is also not blocking


Would it be possible, in the future, to add a similar test to test_train_status_is_not_blocked_by_training, but using the real rasa.train? This new test is very helpful, but I think it would also be interesting to test our choice of threading vs. processing with the real training function.

Ella tried this and it makes the test very fragile cause you don't know how long the training is gonna take on travis and all these timings make the test very shaky.

federicotdn · 2020-01-28T12:30:44Z

changelog/4774.bugfix.rst

@@ -0,0 +1 @@
+Requests to ``/model/train`` do not longer block other requests to the Rasa server.


Just had this thought: what happens if I call this endpoint and then call it immediately again, while the previous request has not finished?

it will train a second model. Since this doesn't store any information in global variables, we should be fine.

erohmensing added 5 commits November 11, 2019 22:17

run blocking train code in executor

13a6dcf

intermediate working test

e534dc3

intermediate new test

c43abec

clean up test

d21aad9

remove accidental addition

0c3e9bc

erohmensing changed the title ~~Train http~~ stop training from blocking other requests to the server Nov 14, 2019

erohmensing requested a review from wochinge November 14, 2019 17:19

Merge branch 'master' into train-http

9248dc9

try lengthening timeout

4164684

wochinge approved these changes Nov 15, 2019

View reviewed changes

small review changes

ab8dbe0

erohmensing and others added 2 commits November 15, 2019 19:31

run server in a fixture

8bd6036

Merge branch 'master' into train-http

d1fae65

erohmensing added 2 commits November 25, 2019 10:44

testing test

c5ad678

Merge branch 'train-http' of github.com:RasaHQ/rasa into train-http

4e2da8f

erohmensing added 3 commits November 25, 2019 11:29

no timeout to see response time

d8dcd19

call train in executor

f68c8ff

hack the tests shorter

69e2c68

akelad assigned wochinge Jan 23, 2020

akelad added this to the Rasa 1.7 milestone Jan 23, 2020

wochinge added 2 commits January 27, 2020 15:42

test unblocking with fake blocking function

7cd0db9

remove print

e149c47

wochinge added 2 commits January 27, 2020 15:58

Merge branch 'master' into train-http

25b8369

cleanup

a012de7

wochinge force-pushed the train-http branch from f3b7dc4 to a012de7 Compare January 27, 2020 16:44

wochinge added 7 commits January 27, 2020 18:10

revert .travis file

9af22e0

use 'terminate' instead of 'kill' to be Py3.6 compliant

265a41f

clean up imports

9cc6e44

don't shadow train function

7a158af

add variable type to cheat pytype

69cea6b

try to cheat pytype by declaring variable upfront

fb806f1

add changelog

0faef61

wochinge requested a review from federicotdn January 28, 2020 08:53

Merge branch 'master' into train-http

9c2ee88

federicotdn approved these changes Jan 28, 2020

View reviewed changes

federicotdn reviewed Jan 28, 2020

View reviewed changes

wochinge merged commit 3d8f4b4 into master Jan 28, 2020

wochinge deleted the train-http branch January 28, 2020 15:49

erohmensing mentioned this pull request Feb 7, 2020

http endpoints become unresponsive while nlu model is updating #3910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop training from blocking other requests to the server #4774

stop training from blocking other requests to the server #4774

erohmensing commented Nov 14, 2019 •

edited

Loading

charlielin commented Nov 15, 2019

erohmensing commented Nov 15, 2019

wochinge left a comment

wochinge Nov 15, 2019

akelad commented Nov 15, 2019

erohmensing commented Nov 15, 2019 •

edited

Loading

akelad commented Nov 25, 2019

erohmensing commented Nov 25, 2019

wochinge commented Nov 25, 2019

wochinge commented Jan 24, 2020

federicotdn left a comment

federicotdn Jan 28, 2020

wochinge Jan 28, 2020

wochinge Jan 28, 2020

federicotdn Jan 28, 2020

wochinge Jan 28, 2020

wochinge Jan 28, 2020

federicotdn Jan 28, 2020

wochinge Jan 28, 2020

federicotdn Jan 28, 2020

wochinge Jan 28, 2020

federicotdn Jan 28, 2020

wochinge Jan 28, 2020

		@@ -0,0 +1 @@
		Requests to ``/model/train`` do not longer block other requests to the Rasa server.

stop training from blocking other requests to the server #4774

stop training from blocking other requests to the server #4774

Conversation

erohmensing commented Nov 14, 2019 • edited Loading

charlielin commented Nov 15, 2019

erohmensing commented Nov 15, 2019

wochinge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akelad commented Nov 15, 2019

erohmensing commented Nov 15, 2019 • edited Loading

akelad commented Nov 25, 2019

erohmensing commented Nov 25, 2019

wochinge commented Nov 25, 2019

wochinge commented Jan 24, 2020

federicotdn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erohmensing commented Nov 14, 2019 •

edited

Loading

erohmensing commented Nov 15, 2019 •

edited

Loading