✨Autoscaling: terminate long pending EC2s #5832

sanderegg · 2024-05-15T15:35:48Z

What do these changes do?

This PR fixes #4880.

Details

The autoscaling service creates EC2 instances on demand when it detects services with the correct service labels that are missing resources.
Normally the EC2 instances are started and they join either the simcore docker swarm (dynamic mode) or the computational cluster docker swarm (computational mode). Usually this process takes about 30 seconds.
Sometimes, the EC2 instance somehow fails to join the swarm for whatever reason (the instance is broken - e.g. AWS own checks are failing, or some networking issue arise). All in all these instances stay there forever until manual intervention. They also tend to block the user from running services as the autoscaling still "thinks" the instance will eventually join the party.

This PR now leverage the EC2_INSTANCES_MAX_START_TIME ENV variable and if an EC2 instance takes more than this time to join the swarm, it will be terminated right away and if there is still a service waiting then another instance will be created through the usual process.
IMPORTANT NOTE: the time it usually takes to start a machine and join the swarm is about 30 seconds. Nevertheless the value used now is 1 minute and shall remain that way so that we do not start terminating EC2s that took slightly more than 30 seconds.

Driving tests:

test_long_pending_ec2_is_detected_as_broken_terminated_and_restarted in services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py and services/autoscaling/tests/unit/test_modules_auto_scaling_dynamic.py

Related issue/s

How to test

Dev-ops checklist

new autoscaling ENV variable: EC2_INSTANCES_MAX_START_TIME
new clusters-keeper ENV variable: WORKERS_EC2_INSTANCES_MAX_START_TIME
missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION
https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/597
No ENV changes or I properly updated ENV (read the instruction)

codecov · 2024-05-15T15:46:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.6%. Comparing base (cafbf96) to head (a73aef9).
Report is 206 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #5832      +/-   ##
=========================================
- Coverage    84.5%   80.6%    -4.0%     
=========================================
  Files          10    1366    +1356     
  Lines         214   56724   +56510     
  Branches       25    1284    +1259     
=========================================
+ Hits          181   45755   +45574     
- Misses         23   10702   +10679     
- Partials       10     267     +257

Flag	Coverage Δ
integrationtests	`58.4% <ø> (?)`
unittests	`79.7% <100.0%> (-4.9%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...g/src/simcore_service_autoscaling/core/settings.py	`100.0% <100.0%> (ø)`
...oscaling/src/simcore_service_autoscaling/models.py	`100.0% <100.0%> (ø)`
...e_service_autoscaling/modules/auto_scaling_core.py	`94.8% <100.0%> (ø)`
...c/simcore_service_clusters_keeper/core/settings.py	`96.1% <ø> (ø)`

... and 1338 files with indirect coverage changes

sonarqubecloud · 2024-05-16T12:41:05Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

matusdrobuliak66

Looking good, thanks!

GitHK

👍

services/autoscaling/src/simcore_service_autoscaling/modules/auto_scaling_core.py

odeimaiz

👍

mrnicegyu11

sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION in your PR description mean? Should this variable be removed?

sanderegg · 2024-05-16T13:48:40Z

sorry, what does missing autoscaling ENV variable: EC2_INSTANCES_TIME_BEFORE_TERMINATION in your PR description mean? Should this variable be removed?

@mrnicegyu11 that variable was in osparc-config but not in the docker-compose in osparc-simcore. Weird, not the end of the world, but now it is in.

mrnicegyu11

thanks looks good

services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py

mrnicegyu11 · 2024-05-16T13:48:06Z

services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py

+    )
+
+    # 2. running again several times the autoscaler, the node does not join
+    for i in range(7):


7 could be set via a fixture

well I guess more like a faker.pyint(min_value=3, max_value=9) but that is not really necessary. how would I call the fixture?

mrnicegyu11 · 2024-05-16T13:48:47Z

services/autoscaling/tests/unit/test_modules_auto_scaling_computational.py

+        + app_settings.AUTOSCALING_EC2_INSTANCES.EC2_INSTANCES_MAX_START_TIME
+        - now
+    ).total_seconds() + 1
+    print(


question for my learning: in pytests one prints instead of using loggers? is there a reason?

well when you run pytest -s it shows prints, otherwise you run pytest --log-cli-level=INFO and then you see the logs. most of the time in simcore we used print in tests. not really sure why.

sanderegg added the a:autoscaling autoscaling service in simcore's stack label May 15, 2024

sanderegg added this to the Leeroy Jenkins milestone May 15, 2024

sanderegg self-assigned this May 15, 2024

sanderegg added 10 commits May 16, 2024 14:05

starting something

56a7eab

some renaming and prepare new test

892e22d

setup test

8a398f5

check launch time

dd2a45c

test ready

6d87a67

modified cluster model to include broken ec2s

e46353c

test runs through

eb9df1a

fix syntax

276d6a6

coverage of settings complete

a9ea5ec

test both computational and dynamic

80d5512

sanderegg force-pushed the autoscaling/terminate-long-pending-instances branch from c555f5b to 80d5512 Compare May 16, 2024 12:05

sanderegg added 4 commits May 16, 2024 14:17

improve docs

553fd46

update docs on variable

207d400

missing ENV variables

4863828

fix coverage

a73aef9

sanderegg marked this pull request as ready for review May 16, 2024 13:00

sanderegg requested review from mrnicegyu11 and YuryHrytsuk as code owners May 16, 2024 13:00

sanderegg requested review from GitHK, odeimaiz, mguidon, matusdrobuliak66 and bisgaard-itis May 16, 2024 13:01

matusdrobuliak66 approved these changes May 16, 2024

View reviewed changes

GitHK approved these changes May 16, 2024

View reviewed changes

services/autoscaling/src/simcore_service_autoscaling/modules/auto_scaling_core.py Show resolved Hide resolved

odeimaiz approved these changes May 16, 2024

View reviewed changes

mrnicegyu11 reviewed May 16, 2024

View reviewed changes

mrnicegyu11 approved these changes May 16, 2024

View reviewed changes

sanderegg merged commit 8abd506 into ITISFoundation:master May 16, 2024
56 checks passed

sanderegg deleted the autoscaling/terminate-long-pending-instances branch May 16, 2024 14:00

matusdrobuliak66 mentioned this pull request Jun 12, 2024

🚀 Release v1.73.0 #5934

Closed

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨Autoscaling: terminate long pending EC2s #5832

✨Autoscaling: terminate long pending EC2s #5832

sanderegg commented May 15, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading

sonarqubecloud bot commented May 16, 2024

matusdrobuliak66 left a comment

GitHK left a comment

odeimaiz left a comment

mrnicegyu11 left a comment

sanderegg commented May 16, 2024

mrnicegyu11 left a comment

mrnicegyu11 May 16, 2024

sanderegg May 16, 2024

mrnicegyu11 May 16, 2024

sanderegg May 16, 2024

✨Autoscaling: terminate long pending EC2s #5832

✨Autoscaling: terminate long pending EC2s #5832

Conversation

sanderegg commented May 15, 2024 • edited Loading

What do these changes do?

Details

Driving tests:

Related issue/s

How to test

Dev-ops checklist

codecov bot commented May 15, 2024 • edited Loading

Codecov Report

sonarqubecloud bot commented May 16, 2024

Quality Gate passed

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

GitHK left a comment

Choose a reason for hiding this comment

odeimaiz left a comment

Choose a reason for hiding this comment

mrnicegyu11 left a comment

Choose a reason for hiding this comment

sanderegg commented May 16, 2024

mrnicegyu11 left a comment

Choose a reason for hiding this comment

mrnicegyu11 May 16, 2024

Choose a reason for hiding this comment

sanderegg May 16, 2024

Choose a reason for hiding this comment

mrnicegyu11 May 16, 2024

Choose a reason for hiding this comment

sanderegg May 16, 2024

Choose a reason for hiding this comment

sanderegg commented May 15, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading