You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 24, 2021. It is now read-only.
(apologies for the mostly copy/paste from the mailing list)
At the moment, there is a useful but simple approach which is to restart containers when they die.
On a very simple level, the two obvious approaches are
Let single containers die alone
If one container dies, kill all
However then it'd be useful to start adding slight different rules around stopping cleanly and dying unexpectedly:
Let containers stop alone, kill all if one dies
If one container dies, die alone
If one container stops, kill all
Then there's an additional rule, which would simply be to restart containers if they die.
Then we may have different rules for different containers, you may want to stop everything if your computation finishes but restart a database if it crashes or restart a stateless webserver but panic and restart everything if the load balancer crashes.
Beyond this is different strategies for restarting, try restarting one of your webservers 5 times and if that doesn't work kill all of them and start again but only 4 times in 60 seconds before terminating, but if the database crashes then kill all of the webservers, start the DB again and fire up the webservers again.
My examples may not match the behaviour you'd want, but I'm sure we can all think of cases where we'd want to start/stop/restart things differently depending on how things die.
I think it would be great if this was extended to consider trees of supervisors and restart strategies as used heavily in Erlang.
If your web serving containers die, try restarting them, but no more often than 4 times in a minutes. If that happens, kill the group and let the layer above deal with the problem. The layer above then kills all containers and restarts them.
There are probably several different stages to this, each valuable
Some basic restart strategies, "restart" or "don't restart"
Groups, with a few extra strategies (one_for_one, all_for_one, rest_for_one)
Trees of these groups
One bit I'm not sure about is what to do when the root dies. Personally I'd like the instance to be stopped given my use cases but maybe others would rather keep it up to inspect?
The text was updated successfully, but these errors were encountered:
(apologies for the mostly copy/paste from the mailing list)
At the moment, there is a useful but simple approach which is to restart containers when they die.
On a very simple level, the two obvious approaches are
However then it'd be useful to start adding slight different rules around stopping cleanly and dying unexpectedly:
Then there's an additional rule, which would simply be to restart containers if they die.
Then we may have different rules for different containers, you may want to stop everything if your computation finishes but restart a database if it crashes or restart a stateless webserver but panic and restart everything if the load balancer crashes.
Beyond this is different strategies for restarting, try restarting one of your webservers 5 times and if that doesn't work kill all of them and start again but only 4 times in 60 seconds before terminating, but if the database crashes then kill all of the webservers, start the DB again and fire up the webservers again.
My examples may not match the behaviour you'd want, but I'm sure we can all think of cases where we'd want to start/stop/restart things differently depending on how things die.
I think it would be great if this was extended to consider trees of supervisors and restart strategies as used heavily in Erlang.
Some background on supervisor trees:
An quick example of what this could look like in the manifest file (may be syntax errors):
If your web serving containers die, try restarting them, but no more often than 4 times in a minutes. If that happens, kill the group and let the layer above deal with the problem. The layer above then kills all containers and restarts them.
There are probably several different stages to this, each valuable
One bit I'm not sure about is what to do when the root dies. Personally I'd like the instance to be stopped given my use cases but maybe others would rather keep it up to inspect?
The text was updated successfully, but these errors were encountered: