Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correctly handle lack of capacity of AWS spot ASGs #2008

Closed

Conversation

piontec
Copy link
Contributor

@piontec piontec commented May 10, 2019

This fixes a situation, where AWS cloud provider is used and at least one of the managed ASGs is using spot instances. Instances requested from such ASG can never come up due to too low bid or lack of capacity in spot pool. RIght now, this triggers a situation, where CAS considers such new nodes as "coming up", because comig_up_nodes = requested - real. As there's no instance identity to track, timeout mechanism for startup time won't kick in.

This patch introduces artificial placeholder IDs for instances requested from any ASG. If such instance won't be available before the timeout kicks in, the size of the ASG is decreased and the ASG group is marked as "backoff".

Fixes #1996 and most probably also #1133 and #1795

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2019
@piontec piontec force-pushed the fix/aws_spots_squashed branch 2 times, most recently from 6fa2503 to 31d13f0 Compare May 10, 2019 10:33
// a failed scale up
if wasPlaceholder {
klog.Warningf("Timeout trying to scale node group %s, enabling backoff for the group", nodeGroup.Id())
clusterStateRegistry.RegisterFailedScaleUp(nodeGroup, metrics.Timeout, time.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failed scale-up logic should happen regardless, triggered from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clusterstate/clusterstate.go#L237. At minimum this will mess up metrics (you'll double count failed scale-ups), not sure if it breaks anything else.

@MaciekPytel
Copy link
Contributor

I think adding the placeholders alone is enough, all the changes outside AWS cloudprovider are redundant (and mess up metrics). The timeout/backoff mechanism has worked for years in GCP, all the core parts should be there already (it may not be obvious from code because in GCP placeholder instances are created by MIG, so they're already returned by API calls to GCP).

@piontec
Copy link
Contributor Author

piontec commented May 10, 2019

Hmm, so if tracking scaleUpRequest is enough, why it didn't work in the first place? After scaleUpRequest times out, backoffNodeGroup should solve the problem...?

@MaciekPytel
Copy link
Contributor

backoffNodeGroup temporarily prevents CA from trying another scale-up on the same NodeGroup, but doesn't make it ignore already upcoming nodes. My guess (no way to test...) would be that NodeGroup was backedOff, but it didn't matter as you were already stuck with forever-upcoming nodes preventing any future scale-up.

@piontec
Copy link
Contributor Author

piontec commented May 10, 2019

OK, that might be it. Let me remove the code in utils.go and retest - I completely missed that Request status update logic

@aleksandra-malinowska
Copy link
Contributor

Hmm, so if tracking scaleUpRequest is enough, why it didn't work in the first place? After scaleUpRequest times out, backoffNodeGroup should solve the problem...?

I may be missing some context here, but I think I recall there was a problem with ASG immediately changing its target size back to actual size when the instance fails to start (see #1263). So scaleUpRequest never times out.

@@ -323,6 +350,11 @@ func (m *asgCache) regenerate() error {
return err
}

// If currently any ASG has more Desired than running Instances, introduce placeholders
// for the instances to come up. This is required to track Desired instances that
// will never come up, like with Spot Request that can't be fulfilled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do see several issue leads to desire > current. Sometimes onDemand nodes can run out of capacity like hitting limit or nodes with strict condition like in one placement group.

@piontec
Copy link
Contributor Author

piontec commented May 13, 2019

backoffNodeGroup temporarily prevents CA from trying another scale-up on the same NodeGroup, but doesn't make it ignore already upcoming nodes. My guess (no way to test...) would be that NodeGroup was backedOff, but it didn't matter as you were already stuck with forever-upcoming nodes preventing any future scale-up.

OK, I think I know better now what is going on. Indeed, after scaling up a node group that won't increase its size the Long* counters are not updated, so on every check CAS computes the upcoming node as Desired - Running, which gives a positive number and keeps other scaling groups from scaling up.

Still, my first proposal is (almost) good. The problem is that at some point the node group needs to be scaled down by the Increase amount of the ScaleUpRequest. In my patch, I did that 1-by-1: every time a node startup time timeout was triggered, I was decreasing the size of node group and returning error indicating a backoff is necessary.

The above approach is needed, as we can't rely on ScaleUpRequest to timeout. Imagine a situation, where CAS crashes and node group state is already "Wanted = 6, Running = 4". Now, CAS starts up and sees 2 nodes coming up in the node group, but there's no ScaleUpRequest object. So, the only thing we can rely on is node startup timeout, after which we have to immediately trigger backoff, as otherwise the same node group will be used to scale up in the next iteration. So, the only problem in my original patch is double logging of failed metrics.

Am I right? What do you think?

==edit==
OK, it's even a little bit "funnier" :P Even if we start up CAS with an ASG already "Wanted = 6, Running = 4", there'll be no ScaleUpRequest. But, if we use Node Startup Timeout to scale the ASG down, then even if CAS selects this node group again for scale out, there is going to be ScaleUpRequest created now. When this one times out, the node group will be backed off, and when the node startup timeout triggers, the group will be scaled down and won't be used again, as it is "backoff". So, after all, it will work, but consistency will be achieved much later, after at least 2 * NodeStartupTime, instead of single timeout. This seems to be a corner case, but it seems pretty real for us, running (almost) the whole cluster on spots. Hence, I'm still in favor of the solution discussed above, where not only is the node group scaled down on Node Startup Timeout, but also CAS immediately marks the group as 'backoff', without waiting for ScaleUpRequest timeout.

@mvisonneau
Copy link

Thanks a lot @piontec! I managed to test it and it seems to be working like a charm 👌

logRecorder.Eventf(apiv1.EventTypeWarning, "DeleteUnregisteredFailed",
"Failed to remove node %s: %v", unregisteredNode.Node.Name, err)
return removedAny, err
_, wasPlaceholder := err.(*cloudprovider.PlaceholderDeleteError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe this is right approach to determine the nature of deleted instance based on error returned from DeleteNodes() (especially given the fact that DeleteNodes takes multiple nodes as an argument).

Instead we should based the behavior on what we got from NodeGroup.Nodes().
Actually it seems you already have all the needed fields in the interface. The NodeGroup.Nodes() returns InstanceStatus for each node. And there is InstanceState field which can currently be one of Instance{Running,Creating,Deleting}. It seems to me that you want to backoff nodegroup if deleted node is in InstanceCreating state. If you think you would need to extend the InstanceStatus in some way (e.g. introduce more error classes or something - we can discuss it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the above is right. We want to have feedback about the whole DeleteNodes(), we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.
Of course, we can set InstanceState for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the above is right. We want to have feedback about the whole DeleteNodes(), we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.

I agree that we want to backoff in such case.

Of course, we can set InstanceState for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout

InstanceState should still be InstanceCreating. Maybe the wording is not best but InstanceCreating means both:

  • instance which is being created and we still do not know what will be outcome of that process
  • instance which was created and creation result in a failure

For the latter case the ErrorInfo in the InstanceStatus is set to non-nil value. And IMO you should use this field to differentiate between successful and failed creation. It seems that you can still use the OtherErrorClass when you set ErrorInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I can have a look at that today. BTW, is someone of you going to KubeCon Barcelona next week?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks. I will be in BCN and bunch of other people from autoscaling team too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a problem with this approach. It's not easy to update Node.InstanceState within aws provider, as it only keeps track of instances refs (a single string). This is common in the whole provider. Only when Nodes() is called, the array of strings (instance ids) is converted to []Nodes, so there's no easy way to keep additional state of nodes without heavily changing interfaces and returned types of aws provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. Yet I do not believe that changing signature a DeleteNodes is the way to go. I do not like the idea of making point changes to general interface, so it is somewhat less of an effort to support it for one specific implementation.

If exposing InstanceState for all the instances in AWS is problematic, maybe you can fairly easily set those to the placeholder nodes only?

@piontec
Copy link
Contributor Author

piontec commented May 17, 2019

I did just that, please check now, although it's still not pretty

@piontec
Copy link
Contributor Author

piontec commented May 29, 2019

ping @losipiuk @MaciekPytel , what do you think now?

scaleToZeroSupported = true
placeholderInstanceNamePrefix = "i-placeholder-"
// TimeoutedPlaceholderName is used to mark placeholder instances that did not come up with timeout
TimeoutedPlaceholderName = "i-timeouted-placeholder"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/timeouted/timedout/

return err
// check if the instance is a placeholder - a requested instance that was never created by the node group
// if it is, just decrease the size of the node group, as there's no specific instance we can remove
matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use regexp.Compile to compile this regular expression just once, given that the prefix string is fixed. The resulting regexp.Regex value is safe for repeated and concurrent use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it would be nicer to extract the isPlaceHolderInstance(instance) method instead inlining regex.MatchString

klog.V(4).Infof("instance %s is detected as a placeholder, decreasing ASG requested size instead "+
"of deleting instance", instance.Name)
m.decreaseAsgSizeByOneNoLock(commonAsg)
// mark this instance using its name as a timeouted placeholder
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/timeouted/timed out/

for i := real; i < desired; i++ {
id := fmt.Sprintf("%s%d", placeholderInstanceNamePrefix, i)
klog.V(4).Infof("Instance group %s has only %d instances created while requested count is %d."+
"Creating placeholder instance with ID %s", *g.AutoScalingGroupName, real, desired, id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either add a leading space before "Creating" or a trailing space after the period on the preceding line. As written, the two concatenated sentences will wind up with no intervening space character.

Also, punctuate the second sentence like the first one.

}
if failedPlaceholder {
// this means only a placeholder instance was deleted - it is an instance, that was requested,
// but was not create before StartUpTimeout. It means something's wrong with this specific
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/create/created/

return err
// check if the instance is a placeholder - a requested instance that was never created by the node group
// if it is, just decrease the size of the node group, as there's no specific instance we can remove
matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it would be nicer to extract the isPlaceHolderInstance(instance) method instead inlining regex.MatchString

@@ -41,7 +41,7 @@ import (

const (
// MaxNodeStartupTime is the maximum time from the moment the node is registered to the time the node is ready.
MaxNodeStartupTime = 15 * time.Minute
MaxNodeStartupTime = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert all changes outside of cloudprovider.
If you correctly set the InstanceStatus backing off should be handled automatically by ClusterStateRegistry.handleOutOfResourcesErrorsForNodeGroup

// mark this instance using its name as a timeouted placeholder
asg := m.instanceToAsg[*instance]
delete(m.instanceToAsg, *instance)
instance.Name = TimeoutedPlaceholderName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seems very much not right.

IMO the instance should be marked as "timed-out" in some routine which refreshes node groups state. I guess in the same place when you are creating placeholders you should also monitor if some of the place holders are "old enough" to treat them as timed out. Then you can mark those as such and when instances are listed set OutOfResources error for them.

Then the DeleteInstances will be called by the core logic for timed-out instances. And here you will only remove the place holder, and decrease the asg size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.

We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a DeleteInstances call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when already required > running for an ASG and the instances never come up.

I think there are the following possible cases for placeholder instances:

  1. ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).
  2. As I wrote above, on CAS start "required > running" already; in that case, DeleteNodes will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.

We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a DeleteInstances call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when already required > running for an ASG and the instances never come up.

My understanding was that you wanted speed up the reaction time of CA when node insntances coming from spot-backed ASG are not showing up.
If this is just about fixing bug that the 15-minutes timeout mechanism built-in into CA does not trigger up situation is simpler (but also benefits are smaller).

  1. ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).

Yes. Should work that way.

  1. As I wrote above, on CAS start "required > running" already; in that case, DeleteNodes will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.

If we just added placeholders and not modify code in any other way we would get to stable state after DeleteNodes are called. Just ASG would not be backed-off this time. Is that a problem? Do we care about (possibly rare) case when CA is restarted while scale-up is in progress? The state will get stable eventually. I do not believe we want to optimize for restart scenario. CA was never designed to be restarted very often. It survives restarts but it does not necessarily behaves optimally in such situations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that 15 minute timeout the one that's configured by the --max-node-provision-time command-line flag? There are several flags with default values of 15 minutes, and there's a const MaxNodeStartupTime with the same value. It doesn't look like that one can be changed without rebuilding the program.

15 minutes is too long for us to wait to react to an ASG's failure to acquire a machine, so we'd like to be able to dial this down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@losipiuk OK, I'll try to split this patch in 2: first provide a general fix, that has the worst case of recovering in 2 * NodeStartupTime, then create another one, that can work in just 1 * NodeStartupTime

@seh It's --max-node-provision-time. And yeah... and the general recommendation is "do not change that". Still, in my tests, I dialed it down to 2 minutes and it seemed to work. YMMV.

@seh
Copy link

seh commented May 31, 2019

I've been testing this patch in a cluster running Kubernetes version 1.13.4. I just saw the container crash like this:

I0531 14:40:04.609595       1 static_autoscaler.go:161] Starting main loop
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2164a15]

goroutine 99 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc0009dd7c0, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:569 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc0009dd7c0, 0xc00132d6c0, 0x5, 0x8, 0xc000aa2270, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:303 +0x234
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000a3c280, 0xc00132d6c0, 0x5, 0x8, 0xc000aa2270, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0xc00132d700, 0x8)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:541 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000a3c280, 0xbf346d1124555c8c, 0x2c0acc8908, 0x4b7e140, 0x0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:191 +0x5a8
main.run(0xc0002db360)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:322 +0x201
main.main.func2(0x2ef6780, 0xc00040cac0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:394 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:182 +0xec

I don't know yet whether that's related to these changes.

@seh
Copy link

seh commented May 31, 2019

That failure above seems to happen within about two seconds after deleting a bunch of placeholder instances and reducing the desired count on the corresponding ASG. I can reproduce it easily.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 23, 2019
@seh
Copy link

seh commented Jun 23, 2019

We have indeed been running this version in our clusters for a couple of weeks, and have found that it mostly alleviates the problem with “stuck” ASGs.

The chattiness of the log messages about placeholder instances still makes me anxious, wondering if everything is working as it should, but the net observable behavior is what we asked for.

The backoff delay parameters could be exposed as configuration, but I understand that doing so would then require validating that they all work sensibly together, and we’d have to document several more command-line flags. We don’t need to take that on here. This patch is about enabling behavior that was already supposed to work this way, as opposed to introducing new features.

I hope @piontec agrees with my summary.

Copy link
Contributor

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for huge delay on review.
Looks good. Just please clean up the unneede changes remaining after previous approaches.

@@ -484,7 +484,7 @@ func sanitizeTemplateNode(node *apiv1.Node, nodeGroup string, ignoredTaints tain

// Removes unregistered nodes if needed. Returns true if anything was removed and error if such occurred.
func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNode, context *context.AutoscalingContext,
currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) {
clusterStateRegistry *clusterstate.ClusterStateRegistry, currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert.

@@ -514,6 +514,7 @@ func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNod
"Failed to remove node %s: %v", unregisteredNode.Node.Name, err)
return removedAny, err
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plaese revert

@@ -451,12 +451,12 @@ func TestRemoveOldUnregisteredNodes(t *testing.T) {
assert.Equal(t, 1, len(unregisteredNodes))

// Nothing should be removed. The unregistered node is not old enough.
removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, now.Add(-50*time.Minute), fakeLogRecorder)
removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now.Add(-50*time.Minute), fakeLogRecorder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

assert.NoError(t, err)
assert.False(t, removed)

// ng1_2 should be removed.
removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, now, fakeLogRecorder)
removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now, fakeLogRecorder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

@@ -274,6 +274,14 @@ func (csr *ClusterStateRegistry) updateScaleRequests(currentTime time.Time) {
csr.scaleDownRequests = newScaleDownRequests
}

// BackoffNodeGroup is used to force the specified nodeGroup to go into backoff mode, which
// means it won't be used for scaling out temporarily
func (csr *ClusterStateRegistry) BackoffNodeGroup(nodeGroup cloudprovider.NodeGroup, currentTime time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert

if err := m.asgCache.DeleteInstances(instances); err != nil {
return err
}
return m.forceRefresh()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we should either have a log message here stating that a refresh is being forced or update the message in forceRefresh as currently it could be confusing for users with the log message
in forceRefresh stating: "Refreshed ASG list, next refresh after %v", m.lastRefresh.Add(refreshInterval))

@gjtempleton
Copy link
Member

Awesome work, really glad to see someone finally getting around to resolving this.

One brief comment in addition to @losipiuk 's.

@seh
Copy link

seh commented Jul 17, 2019

@piontec, do you anticipate being able to address @losipiuk's and @gjtempleton's suggestions soon?

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 17, 2019
Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the code that needs to be reverted per @losipiuk's requests, this code looks reasonable to me. My only question is around e2e testing. I don't see any functional test that would stress the code paths introduced by this patch. Are those functional tests missing because they would be difficult to reproduce? Or was the test absence simply an oversight?

Best,
-jay

@seh
Copy link

seh commented Jul 31, 2019

I'm willing to take over this patch, and can handle the requests to revert a few lines. I'm not sure what's involved with writing end-to-end tests for this change in behavior.

What would the procedure be? Submit a fresh PR that starts from where this branch leaves off?

@MaciekPytel
Copy link
Contributor

Regarding e2e tests: I'm not exactly sure what you mean by that. Currently CA uses unittests and e2e tests in real cluster - the latter only exist for GCP. Obviously there is no way to cover this without porting them to AWS. I have no idea how much work that would involve (I expect a lot), but I don't think this PR should be gated on it.

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 31, 2019

Regarding e2e tests: I'm not exactly sure what you mean by that. Currently CA uses unittests and e2e tests in real cluster - the latter only exist for GCP. Obviously there is no way to cover this without porting them to AWS. I have no idea how much work that would involve (I expect a lot), but I don't think this PR should be gated on it.

Agree. It's more like some efforts we can put on sig-testing separately.

I'm willing to take over this patch, and can handle the requests to revert a few lines. I'm not sure >what's involved with writing end-to-end tests for this change in behavior.
What would the procedure be? Submit a fresh PR that starts from where this branch leaves off?

Hi @seh Is there a way we can get in touch with @piontec, not sure what happend to missing Github user

@seh
Copy link

seh commented Jul 31, 2019

@Jeffwan, I see that Lukasz does publish an email address in his GitHub profile, but I expect that he's receiving updates for each message posted here. I'm willing to write to him once, just in case I'm wrong about those notifications. I don't want to bother him if he's had to move on to other projects.

Stay tuned.

@@ -254,7 +254,9 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError
unregisteredNodes := a.clusterStateRegistry.GetUnregisteredNodes()
if len(unregisteredNodes) > 0 {
klog.V(1).Infof("%d unregistered nodes present", len(unregisteredNodes))
removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, currentTime, autoscalingContext.LogRecorder)
removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, a.clusterStateRegistry,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just twigged that this will need reverting as well whilst playing around with backporting this into a 1.3 version.

@piontec
Copy link
Contributor Author

piontec commented Aug 3, 2019

Hey! I no longer work at aurea/devfactory and I lost write access there. I need to switch this work to my private github account. Please continue the discussion here: #2235. I did the requested cleanup there already.

@piontec piontec closed this Aug 3, 2019
@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 3, 2019

Hey! I no longer work at aurea/devfactory and I lost write access there. I need to switch this work to my private github account. Please continue the discussion here: #2235. I did the requested cleanup there already.

Good to now. Thanks!

@JacobHenner
Copy link

To those who might be trying to track down issues with speedy transitions from spot to OnDemand, see #3241.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2020
@k8s-ci-robot
Copy link
Contributor

@piontec: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster autoscaler failed to scale up when AWS couldn't start a new instance