Abstract out node disposal #1686

tevoinea · 2022-03-02T15:49:26Z

Summary of the Pull Request

Introduces a new way for nodes to be reaped from a scale set. This allows azure auto scale to scale in nodes when appropriate.

PR Checklist

Closes Create abstraction allowing us to choose node disposal strategy #1685
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Info on Pull Request

This PR includes:

Abstracting out how nodes are disposed (azure auto-scale scale-in or node delete/reimage)
The capability for onefuzz to correctly handle when azure scales the nodes in
Removal of the resize state that is no longer necessary
Tuning auto scale rules according to azure guidance

Validation Steps Performed

Create a new scale set with some nodes
Do not submit any jobs
Watch auto scale delete the nodes
Verify the size of the scale set is accurately represented in azure table
Submit a job
Watch the scale set grow and items from the pool queue being consumed
Repeat step 4
Verify the protection policy on the busy nodes is set to "Scale in actions" in the "Instances" tab of the scale set
Once the job is complete, watch the nodes scale back in

tevoinea · 2022-03-02T15:50:14Z

src/api-service/__app__/onefuzzlib/azure/auto_scale.py

                    # When there's more than 1 message in the pool queue
-                    operator=ComparisonOperationType.GREATER_THAN,
+                    operator=ComparisonOperationType.GREATER_THAN_OR_EQUAL,


This was a bug, would not spin up new nodes if there is only 1 message in the queue.

tevoinea · 2022-03-02T15:50:56Z

src/api-service/__app__/onefuzzlib/azure/auto_scale.py

@@ -123,23 +123,47 @@ def create_auto_scale_profile(min: int, max: int, queue_uri: str) -> AutoscalePr
                metric_trigger=MetricTrigger(
                    metric_name="ApproximateMessageCount",
                    metric_resource_uri=queue_uri,
-                    # Check every minute
-                    time_grain=timedelta(minutes=1),
+                    # Check every 15 minutes


Some of these timing numbers were tuned after reading this guidance on autoscaling: https://docs.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling

Please add this link as comment to the code

These are just default values, right? We're (eventually) providing a way for the admin to customize these values? Or do we expect them to use the portal for tweeks.

I think the values that we should include in the cli are:

max/default - required to create the scale set in the first place

min - optional but defaults to 1

scale-out-amount/scale-out-cooldown/scale-in-amount/scale-in-cooldown - will vary by scale set and use case so it's convenient to have it configurable. Ex: busier systems will want bigger scale-{in|out}-amount values, if nodes take long to set up they'll want longer cooldowns.

We can keep the current values as defaults since I think they're appropriate for a less busy deployment.

tevoinea · 2022-03-02T16:08:26Z

src/api-service/__app__/onefuzzlib/workers/scalesets.py

            logging.info(
                SCALESET_LOG_PREFIX + "unexpected scaleset size, resizing.  "
                "scaleset_id:%s expected:%d actual:%d",
                self.scaleset_id,
                self.size,
                size,
            )
-            self.set_state(ScalesetState.resize)


Synchronizing the state of the number of instances in a scale set with azure doesn't require resizing.

…t functionality

src/pytypes/onefuzztypes/enums.py

tevoinea added 3 commits March 1, 2022 20:21

Abstract node disposal strategy

d43b9d2

Cleanup + lint

f52f3d2

Handle possibile scalesets being in resize state

c09d8a8

tevoinea commented Mar 2, 2022

View reviewed changes

Setting the size is still exposed via CLI, we don't want to break tha…

3978662

…t functionality

tevoinea marked this pull request as ready for review March 2, 2022 19:12

tevoinea added 2 commits March 2, 2022 14:44

Merge branch 'main' into tevoinea/AbstractOutNodeDisposal

9bd25c2

Merge branch 'main' into tevoinea/AbstractOutNodeDisposal

3370c76

chkeita reviewed Mar 3, 2022

View reviewed changes

src/pytypes/onefuzztypes/enums.py Outdated Show resolved Hide resolved

tevoinea added 3 commits March 4, 2022 15:00

PR comments

e76b3e2

Merge branch 'main' into tevoinea/AbstractOutNodeDisposal

dee1af6

Merge branch 'main' into tevoinea/AbstractOutNodeDisposal

e16abcd

chkeita approved these changes Mar 8, 2022

View reviewed changes

tevoinea merged commit 4d1c1f5 into microsoft:main Mar 8, 2022

ghost locked as resolved and limited conversation to collaborators Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract out node disposal #1686

Abstract out node disposal #1686

tevoinea commented Mar 2, 2022 •

edited

Loading

tevoinea Mar 2, 2022

tevoinea Mar 2, 2022 •

edited

Loading

chkeita Mar 3, 2022

mgreisen Mar 7, 2022

tevoinea Mar 8, 2022

tevoinea Mar 2, 2022 •

edited

Loading

Abstract out node disposal #1686

Abstract out node disposal #1686

Conversation

tevoinea commented Mar 2, 2022 • edited Loading

Summary of the Pull Request

PR Checklist

Info on Pull Request

Validation Steps Performed

tevoinea Mar 2, 2022

Choose a reason for hiding this comment

tevoinea Mar 2, 2022 • edited Loading

Choose a reason for hiding this comment

chkeita Mar 3, 2022

Choose a reason for hiding this comment

mgreisen Mar 7, 2022

Choose a reason for hiding this comment

tevoinea Mar 8, 2022

Choose a reason for hiding this comment

tevoinea Mar 2, 2022 • edited Loading

Choose a reason for hiding this comment

tevoinea commented Mar 2, 2022 •

edited

Loading

tevoinea Mar 2, 2022 •

edited

Loading

tevoinea Mar 2, 2022 •

edited

Loading