fix: correctly handle lack of capacity of AWS spot ASGs #2008

piontec · 2019-05-10T09:46:59Z

This fixes a situation, where AWS cloud provider is used and at least one of the managed ASGs is using spot instances. Instances requested from such ASG can never come up due to too low bid or lack of capacity in spot pool. RIght now, this triggers a situation, where CAS considers such new nodes as "coming up", because comig_up_nodes = requested - real. As there's no instance identity to track, timeout mechanism for startup time won't kick in.

This patch introduces artificial placeholder IDs for instances requested from any ASG. If such instance won't be available before the timeout kicks in, the size of the ASG is decreased and the ASG group is marked as "backoff".

Fixes #1996 and most probably also #1133 and #1795

MaciekPytel · 2019-05-10T12:55:59Z

cluster-autoscaler/core/utils.go

+				// a failed scale up
+				if wasPlaceholder {
+					klog.Warningf("Timeout trying to scale node group %s, enabling backoff for the group", nodeGroup.Id())
+					clusterStateRegistry.RegisterFailedScaleUp(nodeGroup, metrics.Timeout, time.Now())


The failed scale-up logic should happen regardless, triggered from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clusterstate/clusterstate.go#L237. At minimum this will mess up metrics (you'll double count failed scale-ups), not sure if it breaks anything else.

MaciekPytel · 2019-05-10T13:00:43Z

I think adding the placeholders alone is enough, all the changes outside AWS cloudprovider are redundant (and mess up metrics). The timeout/backoff mechanism has worked for years in GCP, all the core parts should be there already (it may not be obvious from code because in GCP placeholder instances are created by MIG, so they're already returned by API calls to GCP).

piontec · 2019-05-10T13:20:57Z

Hmm, so if tracking scaleUpRequest is enough, why it didn't work in the first place? After scaleUpRequest times out, backoffNodeGroup should solve the problem...?

MaciekPytel · 2019-05-10T13:47:48Z

backoffNodeGroup temporarily prevents CA from trying another scale-up on the same NodeGroup, but doesn't make it ignore already upcoming nodes. My guess (no way to test...) would be that NodeGroup was backedOff, but it didn't matter as you were already stuck with forever-upcoming nodes preventing any future scale-up.

piontec · 2019-05-10T14:01:49Z

OK, that might be it. Let me remove the code in utils.go and retest - I completely missed that Request status update logic

aleksandra-malinowska · 2019-05-10T14:20:48Z

Hmm, so if tracking scaleUpRequest is enough, why it didn't work in the first place? After scaleUpRequest times out, backoffNodeGroup should solve the problem...?

I may be missing some context here, but I think I recall there was a problem with ASG immediately changing its target size back to actual size when the instance fails to start (see #1263). So scaleUpRequest never times out.

Jeffwan · 2019-05-10T20:36:26Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

@@ -323,6 +350,11 @@ func (m *asgCache) regenerate() error {
 		return err
 	}

+	// If currently any ASG has more Desired than running Instances, introduce placeholders
+	// for the instances to come up. This is required to track Desired instances that
+	// will never come up, like with Spot Request that can't be fulfilled


We do see several issue leads to desire > current. Sometimes onDemand nodes can run out of capacity like hitting limit or nodes with strict condition like in one placement group.

piontec · 2019-05-13T10:21:56Z

backoffNodeGroup temporarily prevents CA from trying another scale-up on the same NodeGroup, but doesn't make it ignore already upcoming nodes. My guess (no way to test...) would be that NodeGroup was backedOff, but it didn't matter as you were already stuck with forever-upcoming nodes preventing any future scale-up.

OK, I think I know better now what is going on. Indeed, after scaling up a node group that won't increase its size the Long* counters are not updated, so on every check CAS computes the upcoming node as Desired - Running, which gives a positive number and keeps other scaling groups from scaling up.

Still, my first proposal is (almost) good. The problem is that at some point the node group needs to be scaled down by the Increase amount of the ScaleUpRequest. In my patch, I did that 1-by-1: every time a node startup time timeout was triggered, I was decreasing the size of node group and returning error indicating a backoff is necessary.

The above approach is needed, as we can't rely on ScaleUpRequest to timeout. Imagine a situation, where CAS crashes and node group state is already "Wanted = 6, Running = 4". Now, CAS starts up and sees 2 nodes coming up in the node group, but there's no ScaleUpRequest object. So, the only thing we can rely on is node startup timeout, after which we have to immediately trigger backoff, as otherwise the same node group will be used to scale up in the next iteration. So, the only problem in my original patch is double logging of failed metrics.

Am I right? What do you think?

==edit==
OK, it's even a little bit "funnier" :P Even if we start up CAS with an ASG already "Wanted = 6, Running = 4", there'll be no ScaleUpRequest. But, if we use Node Startup Timeout to scale the ASG down, then even if CAS selects this node group again for scale out, there is going to be ScaleUpRequest created now. When this one times out, the node group will be backed off, and when the node startup timeout triggers, the group will be scaled down and won't be used again, as it is "backoff". So, after all, it will work, but consistency will be achieved much later, after at least 2 * NodeStartupTime, instead of single timeout. This seems to be a corner case, but it seems pretty real for us, running (almost) the whole cluster on spots. Hence, I'm still in favor of the solution discussed above, where not only is the node group scaled down on Node Startup Timeout, but also CAS immediately marks the group as 'backoff', without waiting for ScaleUpRequest timeout.

mvisonneau · 2019-05-14T12:08:10Z

Thanks a lot @piontec! I managed to test it and it seems to be working like a charm 👌

losipiuk · 2019-05-15T11:52:35Z

cluster-autoscaler/core/utils.go

-				logRecorder.Eventf(apiv1.EventTypeWarning, "DeleteUnregisteredFailed",
-					"Failed to remove node %s: %v", unregisteredNode.Node.Name, err)
-				return removedAny, err
+				_, wasPlaceholder := err.(*cloudprovider.PlaceholderDeleteError)


I do not believe this is right approach to determine the nature of deleted instance based on error returned from DeleteNodes() (especially given the fact that DeleteNodes takes multiple nodes as an argument).

Instead we should based the behavior on what we got from NodeGroup.Nodes().
Actually it seems you already have all the needed fields in the interface. The NodeGroup.Nodes() returns InstanceStatus for each node. And there is InstanceState field which can currently be one of Instance{Running,Creating,Deleting}. It seems to me that you want to backoff nodegroup if deleted node is in InstanceCreating state. If you think you would need to extend the InstanceStatus in some way (e.g. introduce more error classes or something - we can discuss it).

I don't think the above is right. We want to have feedback about the whole DeleteNodes(), we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.
Of course, we can set InstanceState for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout

I don't think the above is right. We want to have feedback about the whole DeleteNodes(), we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.

I agree that we want to backoff in such case.

Of course, we can set InstanceState for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout

InstanceState should still be InstanceCreating. Maybe the wording is not best but InstanceCreating means both:

instance which is being created and we still do not know what will be outcome of that process

instance which was created and creation result in a failure

For the latter case the ErrorInfo in the InstanceStatus is set to non-nil value. And IMO you should use this field to differentiate between successful and failed creation. It seems that you can still use the OtherErrorClass when you set ErrorInfo.

OK, I think I can have a look at that today. BTW, is someone of you going to KubeCon Barcelona next week?

THanks. I will be in BCN and bunch of other people from autoscaling team too.

There's a problem with this approach. It's not easy to update Node.InstanceState within aws provider, as it only keeps track of instances refs (a single string). This is common in the whole provider. Only when Nodes() is called, the array of strings (instance ids) is converted to []Nodes, so there's no easy way to keep additional state of nodes without heavily changing interfaces and returned types of aws provider.

I understand. Yet I do not believe that changing signature a DeleteNodes is the way to go. I do not like the idea of making point changes to general interface, so it is somewhat less of an effort to support it for one specific implementation.

If exposing InstanceState for all the instances in AWS is problematic, maybe you can fairly easily set those to the placeholder nodes only?

piontec · 2019-05-17T18:39:35Z

I did just that, please check now, although it's still not pretty

piontec · 2019-05-29T13:02:52Z

ping @losipiuk @MaciekPytel , what do you think now?

seh · 2019-05-29T13:44:02Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+	scaleToZeroSupported          = true
+	placeholderInstanceNamePrefix = "i-placeholder-"
+	// TimeoutedPlaceholderName is used to mark placeholder instances that did not come up with timeout
+	TimeoutedPlaceholderName = "i-timeouted-placeholder"


s/timeouted/timedout/

seh · 2019-05-29T13:47:06Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

-			return err
+		// check if the instance is a placeholder - a requested instance that was never created by the node group
+		// if it is, just decrease the size of the node group, as there's no specific instance we can remove
+		matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name)


We could use regexp.Compile to compile this regular expression just once, given that the prefix string is fixed. The resulting regexp.Regex value is safe for repeated and concurrent use.

Also it would be nicer to extract the isPlaceHolderInstance(instance) method instead inlining regex.MatchString

seh · 2019-05-29T13:47:43Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+			klog.V(4).Infof("instance %s is detected as a placeholder, decreasing ASG requested size instead "+
+				"of deleting instance", instance.Name)
+			m.decreaseAsgSizeByOneNoLock(commonAsg)
+			// mark this instance using its name as a timeouted placeholder


s/timeouted/timed out/

seh · 2019-05-29T13:52:21Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+		for i := real; i < desired; i++ {
+			id := fmt.Sprintf("%s%d", placeholderInstanceNamePrefix, i)
+			klog.V(4).Infof("Instance group %s has only %d instances created while requested count is %d."+
+				"Creating placeholder instance with ID %s", *g.AutoScalingGroupName, real, desired, id)


Either add a leading space before "Creating" or a trailing space after the period on the preceding line. As written, the two concatenated sentences will wind up with no intervening space character.

Also, punctuate the second sentence like the first one.

seh · 2019-05-29T13:54:24Z

cluster-autoscaler/core/utils.go

+			}
+			if failedPlaceholder {
+				// this means only a placeholder instance was deleted - it is an instance, that was requested,
+				// but was not create before StartUpTimeout. It means something's wrong with this specific


s/create/created/

losipiuk · 2019-05-30T09:31:21Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

-			return err
+		// check if the instance is a placeholder - a requested instance that was never created by the node group
+		// if it is, just decrease the size of the node group, as there's no specific instance we can remove
+		matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name)


Also it would be nicer to extract the isPlaceHolderInstance(instance) method instead inlining regex.MatchString

losipiuk · 2019-05-30T09:55:30Z

cluster-autoscaler/clusterstate/clusterstate.go

@@ -41,7 +41,7 @@ import (

 const (
 	// MaxNodeStartupTime is the maximum time from the moment the node is registered to the time the node is ready.
-	MaxNodeStartupTime = 15 * time.Minute
+	MaxNodeStartupTime = 1 * time.Minute


please revert all changes outside of cloudprovider.
If you correctly set the InstanceStatus backing off should be handled automatically by ClusterStateRegistry.handleOutOfResourcesErrorsForNodeGroup

losipiuk · 2019-05-30T10:09:22Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+			// mark this instance using its name as a timeouted placeholder
+			asg := m.instanceToAsg[*instance]
+			delete(m.instanceToAsg, *instance)
+			instance.Name = TimeoutedPlaceholderName


I seems very much not right.

IMO the instance should be marked as "timed-out" in some routine which refreshes node groups state. I guess in the same place when you are creating placeholders you should also monitor if some of the place holders are "old enough" to treat them as timed out. Then you can mark those as such and when instances are listed set OutOfResources error for them.

Then the DeleteInstances will be called by the core logic for timed-out instances. And here you will only remove the place holder, and decrease the asg size.

I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.

We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a DeleteInstances call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when already required > running for an ASG and the instances never come up.

I think there are the following possible cases for placeholder instances:

ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).

As I wrote above, on CAS start "required > running" already; in that case, DeleteNodes will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.

What do you think?

I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.

We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a DeleteInstances call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when already required > running for an ASG and the instances never come up.

My understanding was that you wanted speed up the reaction time of CA when node insntances coming from spot-backed ASG are not showing up.
If this is just about fixing bug that the 15-minutes timeout mechanism built-in into CA does not trigger up situation is simpler (but also benefits are smaller).

ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).

Yes. Should work that way.

As I wrote above, on CAS start "required > running" already; in that case, DeleteNodes will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.

If we just added placeholders and not modify code in any other way we would get to stable state after DeleteNodes are called. Just ASG would not be backed-off this time. Is that a problem? Do we care about (possibly rare) case when CA is restarted while scale-up is in progress? The state will get stable eventually. I do not believe we want to optimize for restart scenario. CA was never designed to be restarted very often. It survives restarts but it does not necessarily behaves optimally in such situations.

Is that 15 minute timeout the one that's configured by the --max-node-provision-time command-line flag? There are several flags with default values of 15 minutes, and there's a const MaxNodeStartupTime with the same value. It doesn't look like that one can be changed without rebuilding the program.

15 minutes is too long for us to wait to react to an ASG's failure to acquire a machine, so we'd like to be able to dial this down.

@losipiuk OK, I'll try to split this patch in 2: first provide a general fix, that has the worst case of recovering in 2 * NodeStartupTime, then create another one, that can work in just 1 * NodeStartupTime

@seh It's --max-node-provision-time. And yeah... and the general recommendation is "do not change that". Still, in my tests, I dialed it down to 2 minutes and it seemed to work. YMMV.

seh · 2019-05-31T14:42:10Z

I've been testing this patch in a cluster running Kubernetes version 1.13.4. I just saw the container crash like this:

I0531 14:40:04.609595       1 static_autoscaler.go:161] Starting main loop
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2164a15]

goroutine 99 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc0009dd7c0, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:569 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc0009dd7c0, 0xc00132d6c0, 0x5, 0x8, 0xc000aa2270, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:303 +0x234
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000a3c280, 0xc00132d6c0, 0x5, 0x8, 0xc000aa2270, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0xc00132d700, 0x8)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:541 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000a3c280, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:191 +0x5a8
main.run(0xc0002db360)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:322 +0x201
main.main.func2(0x2ef6780, 0xc00040cac0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:182 +0xec

I don't know yet whether that's related to these changes.

seh · 2019-05-31T15:57:31Z

That failure above seems to happen within about two seconds after deleting a bunch of placeholder instances and reducing the desired count on the corresponding ASG. I can reproduce it easily.

seh · 2019-06-23T21:08:52Z

We have indeed been running this version in our clusters for a couple of weeks, and have found that it mostly alleviates the problem with “stuck” ASGs.

The chattiness of the log messages about placeholder instances still makes me anxious, wondering if everything is working as it should, but the net observable behavior is what we asked for.

The backoff delay parameters could be exposed as configuration, but I understand that doing so would then require validating that they all work sensibly together, and we’d have to document several more command-line flags. We don’t need to take that on here. This patch is about enabling behavior that was already supposed to work this way, as opposed to introducing new features.

I hope @piontec agrees with my summary.

losipiuk

Sorry for huge delay on review.
Looks good. Just please clean up the unneede changes remaining after previous approaches.

losipiuk · 2019-06-24T07:54:28Z

cluster-autoscaler/core/utils.go

@@ -484,7 +484,7 @@ func sanitizeTemplateNode(node *apiv1.Node, nodeGroup string, ignoredTaints tain

 // Removes unregistered nodes if needed. Returns true if anything was removed and error if such occurred.
 func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNode, context *context.AutoscalingContext,
-	currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) {
+	clusterStateRegistry *clusterstate.ClusterStateRegistry, currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) {


Please revert.

losipiuk · 2019-06-24T07:54:32Z

cluster-autoscaler/core/utils.go

@@ -514,6 +514,7 @@ func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNod
 					"Failed to remove node %s: %v", unregisteredNode.Node.Name, err)
 				return removedAny, err
 			}
+


plaese revert

losipiuk · 2019-06-24T07:54:37Z

cluster-autoscaler/core/utils_test.go

@@ -451,12 +451,12 @@ func TestRemoveOldUnregisteredNodes(t *testing.T) {
 	assert.Equal(t, 1, len(unregisteredNodes))

 	// Nothing should be removed. The unregistered node is not old enough.
-	removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, now.Add(-50*time.Minute), fakeLogRecorder)
+	removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now.Add(-50*time.Minute), fakeLogRecorder)


please revert

losipiuk · 2019-06-24T07:54:43Z

cluster-autoscaler/core/utils_test.go

 	assert.NoError(t, err)
 	assert.False(t, removed)

 	// ng1_2 should be removed.
-	removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, now, fakeLogRecorder)
+	removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now, fakeLogRecorder)


please revert

losipiuk · 2019-06-24T07:55:47Z

cluster-autoscaler/clusterstate/clusterstate.go

@@ -274,6 +274,14 @@ func (csr *ClusterStateRegistry) updateScaleRequests(currentTime time.Time) {
 	csr.scaleDownRequests = newScaleDownRequests
 }

+// BackoffNodeGroup is used to force the specified nodeGroup to go into backoff mode, which
+// means it won't be used for scaling out temporarily
+func (csr *ClusterStateRegistry) BackoffNodeGroup(nodeGroup cloudprovider.NodeGroup, currentTime time.Time) {


please revert

gjtempleton · 2019-06-26T11:34:13Z

cluster-autoscaler/cloudprovider/aws/aws_manager.go

+	if err := m.asgCache.DeleteInstances(instances); err != nil {
+		return err
+	}
+	return m.forceRefresh()


It feels like we should either have a log message here stating that a refresh is being forced or update the message in forceRefresh as currently it could be confusing for users with the log message
in forceRefresh stating: "Refreshed ASG list, next refresh after %v", m.lastRefresh.Add(refreshInterval))

gjtempleton · 2019-06-26T11:39:38Z

Awesome work, really glad to see someone finally getting around to resolving this.

One brief comment in addition to @losipiuk 's.

seh · 2019-07-17T12:11:40Z

@piontec, do you anticipate being able to address @losipiuk's and @gjtempleton's suggestions soon?

k8s-ci-robot · 2019-07-17T12:15:27Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jaypipes

Besides the code that needs to be reverted per @losipiuk's requests, this code looks reasonable to me. My only question is around e2e testing. I don't see any functional test that would stress the code paths introduced by this patch. Are those functional tests missing because they would be difficult to reproduce? Or was the test absence simply an oversight?

Best,
-jay

seh · 2019-07-31T15:44:37Z

I'm willing to take over this patch, and can handle the requests to revert a few lines. I'm not sure what's involved with writing end-to-end tests for this change in behavior.

What would the procedure be? Submit a fresh PR that starts from where this branch leaves off?

MaciekPytel · 2019-07-31T16:23:18Z

Regarding e2e tests: I'm not exactly sure what you mean by that. Currently CA uses unittests and e2e tests in real cluster - the latter only exist for GCP. Obviously there is no way to cover this without porting them to AWS. I have no idea how much work that would involve (I expect a lot), but I don't think this PR should be gated on it.

Jeffwan · 2019-07-31T16:49:21Z

Regarding e2e tests: I'm not exactly sure what you mean by that. Currently CA uses unittests and e2e tests in real cluster - the latter only exist for GCP. Obviously there is no way to cover this without porting them to AWS. I have no idea how much work that would involve (I expect a lot), but I don't think this PR should be gated on it.

Agree. It's more like some efforts we can put on sig-testing separately.

I'm willing to take over this patch, and can handle the requests to revert a few lines. I'm not sure >what's involved with writing end-to-end tests for this change in behavior.
What would the procedure be? Submit a fresh PR that starts from where this branch leaves off?

Hi @seh Is there a way we can get in touch with @piontec, not sure what happend to missing Github user

seh · 2019-07-31T18:31:36Z

@Jeffwan, I see that Lukasz does publish an email address in his GitHub profile, but I expect that he's receiving updates for each message posted here. I'm willing to write to him once, just in case I'm wrong about those notifications. I don't want to bother him if he's had to move on to other projects.

Stay tuned.

gjtempleton · 2019-08-01T09:00:15Z

cluster-autoscaler/core/static_autoscaler.go

@@ -254,7 +254,9 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError
 	unregisteredNodes := a.clusterStateRegistry.GetUnregisteredNodes()
 	if len(unregisteredNodes) > 0 {
 		klog.V(1).Infof("%d unregistered nodes present", len(unregisteredNodes))
-		removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, currentTime, autoscalingContext.LogRecorder)
+		removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, a.clusterStateRegistry,


Just twigged that this will need reverting as well whilst playing around with backporting this into a 1.3 version.

piontec · 2019-08-03T10:30:33Z

Hey! I no longer work at aurea/devfactory and I lost write access there. I need to switch this work to my private github account. Please continue the discussion here: #2235. I did the requested cleanup there already.

Jeffwan · 2019-08-03T21:07:36Z

Hey! I no longer work at aurea/devfactory and I lost write access there. I need to switch this work to my private github account. Please continue the discussion here: #2235. I did the requested cleanup there already.

Good to now. Thanks!

JacobHenner · 2020-09-18T18:56:43Z

To those who might be trying to track down issues with speedy transitions from spot to OnDemand, see #3241.

k8s-ci-robot · 2020-09-18T18:56:51Z

@piontec: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2019

k8s-ci-robot requested review from feiskyer and losipiuk May 10, 2019 09:47

piontec force-pushed the fix/aws_spots_squashed branch 2 times, most recently from 6fa2503 to 31d13f0 Compare May 10, 2019 10:33

MaciekPytel reviewed May 10, 2019

View reviewed changes

Jeffwan reviewed May 10, 2019

View reviewed changes

piontec force-pushed the fix/aws_spots_squashed branch from 31d13f0 to 4000692 Compare May 13, 2019 12:29

MaciekPytel mentioned this pull request May 13, 2019

Stop waiting for upcoming nodes from unhealthy node groups #1980

Closed

losipiuk added the area/cluster-autoscaler label May 14, 2019

losipiuk reviewed May 15, 2019

View reviewed changes

piontec force-pushed the fix/aws_spots_squashed branch from 4000692 to 1cfffab Compare May 17, 2019 18:37

seh reviewed May 29, 2019

View reviewed changes

losipiuk reviewed May 30, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2019

Jeffwan approved these changes Jun 23, 2019

View reviewed changes

losipiuk reviewed Jun 24, 2019

View reviewed changes

gjtempleton reviewed Jun 26, 2019

View reviewed changes

okgolove mentioned this pull request Jul 11, 2019

Scale from 0, unwanted nodes #2165

Closed

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 17, 2019

jaypipes reviewed Jul 23, 2019

View reviewed changes

gjtempleton reviewed Aug 1, 2019

View reviewed changes

piontec mentioned this pull request Aug 3, 2019

correctly handle lack of capacity of AWS spot ASGs #2235

Merged

piontec closed this Aug 3, 2019

This was referenced Aug 20, 2019

cluster autoscaler failed to scale up when AWS couldn't start a new instance #1996

Closed

Node pool scale up timeout #1133

Closed

gjtempleton mentioned this pull request Sep 3, 2019

REQUEST: New membership for gjtempleton kubernetes/org#1153

Closed

6 tasks

seh mentioned this pull request Sep 16, 2019

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2020

MaciekPytel mentioned this pull request Oct 29, 2020

Max Node Provision Time + Priority Expander + Node Unavailability #3490

Closed

drmorr0 mentioned this pull request Sep 27, 2021

Cluster Autoscaler should remove oldUnregistered placeholder nodes, even if this would go below minSize #4351

Closed

fix: correctly handle lack of capacity of AWS spot ASGs #2008

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Conversation

piontec commented May 10, 2019

Choose a reason for hiding this comment

MaciekPytel commented May 10, 2019

piontec commented May 10, 2019

MaciekPytel commented May 10, 2019

piontec commented May 10, 2019

aleksandra-malinowska commented May 10, 2019

Choose a reason for hiding this comment

piontec commented May 13, 2019 • edited Loading

mvisonneau commented May 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piontec commented May 17, 2019

piontec commented May 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seh commented May 31, 2019

seh commented May 31, 2019

seh commented Jun 23, 2019

losipiuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjtempleton commented Jun 26, 2019

seh commented Jul 17, 2019

k8s-ci-robot commented Jul 17, 2019

jaypipes left a comment

Choose a reason for hiding this comment

seh commented Jul 31, 2019

MaciekPytel commented Jul 31, 2019

Jeffwan commented Jul 31, 2019

seh commented Jul 31, 2019

Choose a reason for hiding this comment

piontec commented Aug 3, 2019

Jeffwan commented Aug 3, 2019

JacobHenner commented Sep 18, 2020

k8s-ci-robot commented Sep 18, 2020

piontec commented May 13, 2019 •

edited

Loading