Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Abstract out node disposal #1686

Merged
merged 9 commits into from
Mar 8, 2022

Conversation

tevoinea
Copy link
Member

@tevoinea tevoinea commented Mar 2, 2022

Summary of the Pull Request

Introduces a new way for nodes to be reaped from a scale set. This allows azure auto scale to scale in nodes when appropriate.

PR Checklist

Info on Pull Request

This PR includes:

  • Abstracting out how nodes are disposed (azure auto-scale scale-in or node delete/reimage)
  • The capability for onefuzz to correctly handle when azure scales the nodes in
  • Removal of the resize state that is no longer necessary
  • Tuning auto scale rules according to azure guidance

Validation Steps Performed

  1. Create a new scale set with some nodes
  2. Do not submit any jobs
  3. Watch auto scale delete the nodes
  4. Verify the size of the scale set is accurately represented in azure table
  5. Submit a job
  6. Watch the scale set grow and items from the pool queue being consumed
  7. Repeat step 4
  8. Verify the protection policy on the busy nodes is set to "Scale in actions" in the "Instances" tab of the scale set
  9. Once the job is complete, watch the nodes scale back in

# When there's more than 1 message in the pool queue
operator=ComparisonOperationType.GREATER_THAN,
operator=ComparisonOperationType.GREATER_THAN_OR_EQUAL,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug, would not spin up new nodes if there is only 1 message in the queue.

@@ -123,23 +123,47 @@ def create_auto_scale_profile(min: int, max: int, queue_uri: str) -> AutoscalePr
metric_trigger=MetricTrigger(
metric_name="ApproximateMessageCount",
metric_resource_uri=queue_uri,
# Check every minute
time_grain=timedelta(minutes=1),
# Check every 15 minutes
Copy link
Member Author

@tevoinea tevoinea Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these timing numbers were tuned after reading this guidance on autoscaling: https://docs.microsoft.com/en-us/azure/architecture/best-practices/auto-scaling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this link as comment to the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just default values, right? We're (eventually) providing a way for the admin to customize these values? Or do we expect them to use the portal for tweeks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the values that we should include in the cli are:

  • max/default - required to create the scale set in the first place
  • min - optional but defaults to 1
  • scale-out-amount/scale-out-cooldown/scale-in-amount/scale-in-cooldown - will vary by scale set and use case so it's convenient to have it configurable. Ex: busier systems will want bigger scale-{in|out}-amount values, if nodes take long to set up they'll want longer cooldowns.

We can keep the current values as defaults since I think they're appropriate for a less busy deployment.

logging.info(
SCALESET_LOG_PREFIX + "unexpected scaleset size, resizing. "
"scaleset_id:%s expected:%d actual:%d",
self.scaleset_id,
self.size,
size,
)
self.set_state(ScalesetState.resize)
Copy link
Member Author

@tevoinea tevoinea Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronizing the state of the number of instances in a scale set with azure doesn't require resizing.

@tevoinea tevoinea marked this pull request as ready for review March 2, 2022 19:12
@tevoinea tevoinea merged commit 4d1c1f5 into microsoft:main Mar 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create abstraction allowing us to choose node disposal strategy
3 participants