Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Autoscaling: terminate long pending EC2s #5832

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented May 15, 2024

What do these changes do?

This PR fixes #4880.

Details

The autoscaling service creates EC2 instances on demand when it detects services with the correct service labels that are missing resources.
Normally the EC2 instances are started and they join either the simcore docker swarm (dynamic mode) or the computational cluster docker swarm (computational mode). Usually this process takes about 30 seconds.
Sometimes, the EC2 instance somehow fails to join the swarm for whatever reason (the instance is broken - e.g. AWS own checks are failing, or some networking issue arise). All in all these instances stay there forever until manual intervention. They also tend to block the user from running services as the autoscaling still "thinks" the instance will eventually join the party.

This PR now leverage the EC2_INSTANCES_MAX_START_TIME ENV variable and if an EC2 instance takes more than this time to join the swarm, it will be terminated right away and if there is still a service waiting then another instance will be created through the usual process.
IMPORTANT NOTE: the time it usually takes to start a machine and join the swarm is about 30 seconds. Nevertheless the value used now is 1 minute and shall remain that way so that we do not start terminating EC2s that took slightly more than 30 seconds.

Driving tests:

test_long_pending_ec2_is_detected_as_broken_terminated_and_restarted in services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py and services/autoscaling/tests/unit/test_modules_auto_scaling_dynamic.py

Related issue/s

How to test

Dev-ops checklist

@sanderegg sanderegg added the a:autoscaling autoscaling service in simcore's stack label May 15, 2024
@sanderegg sanderegg added this to the Leeroy Jenkins milestone May 15, 2024
@sanderegg sanderegg self-assigned this May 15, 2024
Copy link

codecov bot commented May 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.6%. Comparing base (cafbf96) to head (a73aef9).
Report is 206 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #5832      +/-   ##
=========================================
- Coverage    84.5%   80.6%    -4.0%     
=========================================
  Files          10    1366    +1356     
  Lines         214   56724   +56510     
  Branches       25    1284    +1259     
=========================================
+ Hits          181   45755   +45574     
- Misses         23   10702   +10679     
- Partials       10     267     +257     
Flag Coverage Δ
integrationtests 58.4% <ø> (?)
unittests 79.7% <100.0%> (-4.9%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...g/src/simcore_service_autoscaling/core/settings.py 100.0% <100.0%> (ø)
...oscaling/src/simcore_service_autoscaling/models.py 100.0% <100.0%> (ø)
...e_service_autoscaling/modules/auto_scaling_core.py 94.8% <100.0%> (ø)
...c/simcore_service_clusters_keeper/core/settings.py 96.1% <ø> (ø)

... and 1338 files with indirect coverage changes

@sanderegg sanderegg force-pushed the autoscaling/terminate-long-pending-instances branch from c555f5b to 80d5512 Compare May 16, 2024 12:05
Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@sanderegg sanderegg marked this pull request as ready for review May 16, 2024 13:00
Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks!

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@odeimaiz odeimaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION in your PR description mean? Should this variable be removed?

@sanderegg
Copy link
Member Author

sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION in your PR description mean? Should this variable be removed?

@mrnicegyu11 that variable was in osparc-config but not in the docker-compose in osparc-simcore. Weird, not the end of the world, but now it is in.

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks looks good

)

# 2. running again several times the autoscaler, the node does not join
for i in range(7):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 could be set via a fixture

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I guess more like a faker.pyint(min_value=3, max_value=9) but that is not really necessary. how would I call the fixture?

+ app_settings.AUTOSCALING_EC2_INSTANCES.EC2_INSTANCES_MAX_START_TIME
- now
).total_seconds() + 1
print(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question for my learning: in pytests one prints instead of using loggers? is there a reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well when you run pytest -s it shows prints, otherwise you run pytest --log-cli-level=INFO and then you see the logs. most of the time in simcore we used print in tests. not really sure why.

@sanderegg sanderegg merged commit 8abd506 into ITISFoundation:master May 16, 2024
56 checks passed
@sanderegg sanderegg deleted the autoscaling/terminate-long-pending-instances branch May 16, 2024 14:00
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Jun 12, 2024
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscaling: mark pending instances as failed after they do not join the swarm for too long
5 participants