-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨Autoscaling: terminate long pending EC2s #5832
✨Autoscaling: terminate long pending EC2s #5832
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5832 +/- ##
=========================================
- Coverage 84.5% 80.6% -4.0%
=========================================
Files 10 1366 +1356
Lines 214 56724 +56510
Branches 25 1284 +1259
=========================================
+ Hits 181 45755 +45574
- Misses 23 10702 +10679
- Partials 10 267 +257
Flags with carried forward coverage won't be shown. Click here to find out more.
|
c555f5b
to
80d5512
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
services/autoscaling/src/simcore_service_autoscaling/modules/auto_scaling_core.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION
in your PR description mean? Should this variable be removed?
@mrnicegyu11 that variable was in osparc-config but not in the docker-compose in osparc-simcore. Weird, not the end of the world, but now it is in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks looks good
services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py
Show resolved
Hide resolved
) | ||
|
||
# 2. running again several times the autoscaler, the node does not join | ||
for i in range(7): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7
could be set via a fixture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I guess more like a faker.pyint(min_value=3, max_value=9) but that is not really necessary. how would I call the fixture?
+ app_settings.AUTOSCALING_EC2_INSTANCES.EC2_INSTANCES_MAX_START_TIME | ||
- now | ||
).total_seconds() + 1 | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question for my learning: in pytests one prints instead of using loggers? is there a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well when you run pytest -s
it shows prints, otherwise you run pytest --log-cli-level=INFO
and then you see the logs. most of the time in simcore we used print in tests. not really sure why.
What do these changes do?
This PR fixes #4880.
Details
The autoscaling service creates EC2 instances on demand when it detects services with the correct service labels that are missing resources.
Normally the EC2 instances are started and they join either the simcore docker swarm (dynamic mode) or the computational cluster docker swarm (computational mode). Usually this process takes about 30 seconds.
Sometimes, the EC2 instance somehow fails to join the swarm for whatever reason (the instance is broken - e.g. AWS own checks are failing, or some networking issue arise). All in all these instances stay there forever until manual intervention. They also tend to block the user from running services as the autoscaling still "thinks" the instance will eventually join the party.
This PR now leverage the
EC2_INSTANCES_MAX_START_TIME
ENV variable and if an EC2 instance takes more than this time to join the swarm, it will be terminated right away and if there is still a service waiting then another instance will be created through the usual process.IMPORTANT NOTE: the time it usually takes to start a machine and join the swarm is about 30 seconds. Nevertheless the value used now is 1 minute and shall remain that way so that we do not start terminating EC2s that took slightly more than 30 seconds.
Driving tests:
test_long_pending_ec2_is_detected_as_broken_terminated_and_restarted
in services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py and services/autoscaling/tests/unit/test_modules_auto_scaling_dynamic.pyRelated issue/s
How to test
Dev-ops checklist
EC2_INSTANCES_MAX_START_TIME
WORKERS_EC2_INSTANCES_MAX_START_TIME
EC2_INSTANCES_TIME_BEFORE_TERMINATION