Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aws-asg) Prevent scaling in case of ongoing InstanceRefresh #597

Merged
merged 4 commits into from
Dec 2, 2022

Conversation

tirimia
Copy link
Contributor

@tirimia tirimia commented Aug 11, 2022

As of version 0.3.7, the nomad-autoscaler does not take into account pending or currently running AWS ASG Instance Refresh events. See #596

This has lead to us having the autoscaler constantly try to adjust as the instance refresh was also spinning nodes up and down.

To avoid such problems, this PR extends the Scale action to also check the last Instance Refresh event on the ASG and not go ahead with scaling if it has a status of Pending or InProgress.

Happy to get any feedback you can provide :)

Closes #596

@tirimia tirimia requested review from jrasell and lgfa29 as code owners August 11, 2022 14:45
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tirimia 👋

Thank you so much for the PR, and apologies for the delay in getting this review.

I had some suggestions that I tried to apply myself, but I don't have permission to update your branch (I think it may have been because it's main).

Here are the changes I propose:

diff --git a/CHANGELOG.md b/CHANGELOG.md
index f4b6e0c..e9f8467 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,8 @@
 ## UNRELEASED
 
+IMPROVEMENTS:
+ * plugin/target/aws: Prevent scaling if an instance refresh is in progress [[GH-597](https://github.com/hashicorp/nomad-autoscaler/pull/597)]
+
 ## 0.3.7 (June 10, 2022)
 
 IMPROVEMENTS:
diff --git a/plugins/builtin/target/aws-asg/plugin/plugin.go b/plugins/builtin/target/aws-asg/plugin/plugin.go
index dc8f92e..7beacdc 100644
--- a/plugins/builtin/target/aws-asg/plugin/plugin.go
+++ b/plugins/builtin/target/aws-asg/plugin/plugin.go
@@ -13,6 +13,7 @@ import (
 	"github.com/hashicorp/nomad-autoscaler/plugins/target"
 	"github.com/hashicorp/nomad-autoscaler/sdk"
 	"github.com/hashicorp/nomad-autoscaler/sdk/helper/nomad"
+	"github.com/hashicorp/nomad-autoscaler/sdk/helper/ptr"
 	"github.com/hashicorp/nomad-autoscaler/sdk/helper/scaleutils"
 )
 
@@ -122,13 +123,10 @@ func (t *TargetPlugin) Scale(action sdk.ScalingAction, config map[string]string)
 	}
 
 	// Autoscaling can interfere with a running instance refresh so we
-	// prevent any scaling action while a refresh is Pending or InProgress
-	var maxRecords int32 = 1
+	// prevent any scaling action while a refresh is Pending or InProgress.
 	input := autoscaling.DescribeInstanceRefreshesInput{
 		AutoScalingGroupName: &asgName,
-		InstanceRefreshIds:   []string{},
-		MaxRecords:           &maxRecords,
-		NextToken:            nil,
+		MaxRecords:           ptr.Int32ToPtr(10),
 	}
 
 	refreshes, err := t.asg.DescribeInstanceRefreshes(ctx, &input)
@@ -136,15 +134,15 @@ func (t *TargetPlugin) Scale(action sdk.ScalingAction, config map[string]string)
 		return fmt.Errorf("failed to describe AWS InstanceRefresh: %v", err)
 	}
 
-	// We might get 0 results if no InstanceRefreshes were ever run on the ASG
-	if len(refreshes.InstanceRefreshes) == 1 {
-		refreshID := refreshes.InstanceRefreshes[0].InstanceRefreshId
-		status := refreshes.InstanceRefreshes[0].Status
-		if status == types.InstanceRefreshStatusInProgress ||
-			status == types.InstanceRefreshStatusPending {
+	for _, refresh := range refreshes.InstanceRefreshes {
+		active := refresh.Status == types.InstanceRefreshStatusInProgress ||
+			refresh.Status == types.InstanceRefreshStatusPending
+
+		if active {
 			t.logger.Warn("scaling will not take place due to InstanceRefresh",
 				"asg_name", asgName,
-				"refresh_id", refreshID)
+				"refresh_id", refresh.InstanceRefreshId,
+				"refresh_status", refresh.Status)
 			return nil
 		}
 	}
diff --git a/sdk/helper/ptr/ptr.go b/sdk/helper/ptr/ptr.go
index d6aee38..520b8d2 100644
--- a/sdk/helper/ptr/ptr.go
+++ b/sdk/helper/ptr/ptr.go
@@ -8,6 +8,10 @@ func IntToPtr(i int) *int {
 	return &i
 }
 
+func Int32ToPtr(i int32) *int32 {
+	return &i
+}
+
 func Int64ToPtr(i int64) *int64 {
 	return &i
 }

The reason for this is that it wasn't immediately clear to me from the documentation if it would be possible to have more than one instance refresh action active at the same time.

I would imagine that this would not be possible (otherwise we would have the same problem as we're trying to fix here), but just in case it may be better to loop over a few of them. I picked 10 just because it felt like a good enough number to detect this, hopefully people are not refreshing instances this aggressively 😅

I also added a CHANGELOG entry and a new Int32ToPtr which is what we use to convert things to pointers.

Let me know what you think 🙂

@tirimia
Copy link
Contributor Author

tirimia commented Dec 2, 2022

Hi, @lgfa29 !

Should have done my changes on a different branch, sorry about that

I've integrated your proposed changes, they look great.

Only one instance refresh can happen at a time, but the AWS SDK returns a paginated history of Instance Refresh events, starting with the latest one. That is the reason I will only query for one.
On that note, not sure what I was thinking with checking the length 🤣, the for range is a much more elegant solution.

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one instance refresh can happen at a time

Great, thank you for the confirmation. I read the docs but this wasn't clear to me, so I was wondering if there could be a scenario where you start a refresh, then start a new one that finishes before the previous one.

If that's not possible then querying just the latest make sense 👍

Thanks for the PR!

@lgfa29 lgfa29 merged commit 8486473 into hashicorp:main Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS ASG Instance Refresh conflicts with autoscaling
2 participants