Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add status conditions for nodeClass #6438

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jigisha620
Copy link
Contributor

Fixes #N/A

Description
This PR adds status conditions to EC2nodeClass.

How was this change tested?
/karpenter snapshot

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jigisha620 jigisha620 requested a review from a team as a code owner July 1, 2024 18:45
Copy link

netlify bot commented Jul 1, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit bb128ee
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6687334f8644f20008b20e03

@coveralls
Copy link

coveralls commented Jul 1, 2024

Pull Request Test Coverage Report for Build 9749476087

Details

  • 173 of 266 (65.04%) changed or added relevant lines in 8 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.4%) to 77.934%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 5 0.0%
pkg/apis/v1/ec2nodeclass_status.go 0 6 0.0%
pkg/controllers/nodeclass/status/subnet/controller.go 30 36 83.33%
pkg/controllers/nodeclass/status/ami/controller.go 48 65 73.85%
pkg/controllers/nodeclass/status/launch_template/controller.go 17 36 47.22%
pkg/controllers/nodeclass/status/instance_profile/controller.go 31 51 60.78%
pkg/controllers/nodeclass/status/security_group/controller.go 41 61 67.21%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instanceprofile/instanceprofile.go 2 80.0%
pkg/apis/v1/ec2nodeclass_status.go 2 0.0%
Totals Coverage Status
Change from base Build 9739358800: -0.4%
Covered Lines: 5863
Relevant Lines: 7523

💛 - Coveralls

Copy link
Contributor Author

@jigisha620 jigisha620 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

github-actions bot commented Jul 1, 2024

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-a6d733a4217dfcdd5401ee8cc05b21e92515edec.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-a6d733a4217dfcdd5401ee8cc05b21e92515edec" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

launchTemplateProvider launchtemplate.Provider
}

func (n Readiness) Reconcile(ctx context.Context, nodeClass *v1beta1.EC2NodeClass) (reconcile.Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just not move these checks to each reconcile function of each resource? Instead of having a controller for each field we change

@@ -0,0 +1,120 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to have separate controllers for each resource?

@coveralls
Copy link

coveralls commented Jul 4, 2024

Pull Request Test Coverage Report for Build 9800927619

Details

  • 172 of 271 (63.47%) changed or added relevant lines in 8 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.5%) to 78.295%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 5 0.0%
pkg/apis/v1/ec2nodeclass_status.go 0 6 0.0%
pkg/controllers/nodeclass/status/instance_profile/controller.go 17 25 68.0%
pkg/controllers/nodeclass/status/ami/controller.go 48 66 72.73%
pkg/controllers/nodeclass/status/launch_template/controller.go 17 37 45.95%
pkg/controllers/nodeclass/status/security_group/controller.go 41 62 66.13%
pkg/controllers/nodeclass/status/subnet/controller.go 43 64 67.19%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instanceprofile/instanceprofile.go 2 80.0%
pkg/apis/v1/ec2nodeclass_status.go 2 0.0%
Totals Coverage Status
Change from base Build 9786416051: -0.5%
Covered Lines: 6017
Relevant Lines: 7685

💛 - Coveralls

func (c *Controller) Reconcile(ctx context.Context, nodeClass *v1beta1.EC2NodeClass) (reconcile.Result, error) {
ctx = injection.WithControllerName(ctx, "nodeclass.ami")

if !controllerutil.ContainsFinalizer(nodeClass, v1beta1.TerminationFinalizer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we run into a potential race condition here with multiple controllers adding this termination finalizer? I.e. if multiple controllers make the read before any make the write, and multiple patches get applied.

if errs != nil {
return reconcile.Result{}, errs
}
if !nodeClass.StatusConditions().IsTrue(v1beta1.ConditionTypeAMIsReady) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should be rereconciling immediately in this case. If this is caused by genuine user misconfiguration, we won't be able to resolve any additional AMIs unless the user updates their configuration (or does something like tag an existing AMI to match). If this is caused by an API outage, retrying immediately doesn't help us either. We're hitting internal caches so it shouldn't cause a retry storm, but there's still not a reason to retry immediately here IMO.

}
stored := nodeClass.DeepCopy()
amis, err := c.amiProvider.List(ctx, nodeClass)
var errs error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This multierr is only used to capture the error from amiProvider.List, can we just drop it?

}
}
if !equality.Semantic.DeepEqual(stored, nodeClass) {
if err = c.kubeClient.Status().Update(ctx, nodeClass); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be an update over a patch? We're only updating the existing status condition right? Do we expect any other controllers to be updating this status condition which we could conflict with? Same comment for all other controllers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to follow what we have done with most of our controllers for status updates. Is there a benefit of using patch over update?

res, err := reconciler.Reconcile(ctx, nodeClass)
errs = multierr.Append(errs, err)
results = append(results, res)
if nodeClass.Spec.Role != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO it would be cleaner to pull this logic out into a helper function, and do away with the multierror. This is going to be a theme in my review, but we should avoid multierrors if they're not needed since it's use implies there will be multiple errors IMO.

func (c *Controller) Reconcile(...) {
    ...
    instanceProfile, err := c.resolveInstanceProfile(nodeClass)
    if err != nil {
        nodeClass.StatusConditions().SetFalse(v1beta1.ConditionTypeInstanceProfileReady, "InstanceProfileCreateError", "Error creating instance profile")
    } else {
        nodeClass.Status.InstanceProfile = instanceProfile
        nodeClass.StatusConditions().SetTrue(v1beta1.ConditionTypeInstanceProfileReady)
    }

    if !equality.Semantic.DeepEqual(stored, nodeClass) {
        ...
    }
    if err != nil {
        return reconcile.Result{}, err
    }
    ...
}

func (c *Controller) resolveInstanceProfile(nc *v1beta1.EC2NodeClass) (string, error) { 
    if nodeClass.Spec.Role == "" {
        return lo.FromPtr(nodeClass.Spec.InstanceProfile), nil
    }
    return c.instanceProfileProvider.Create(ctx, nc)
}

@@ -12,62 +12,44 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

package status
package instance_profile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package names shouldn't include underscores, same comment for launch_template and security_group.

stored := nodeClass.DeepCopy()

securityGroups, err := c.securityGroupProvider.List(ctx, nodeClass)
var errs error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also rework this to not use multierror, maybe by moving into a helper function. In general there's a lot of nesting here, moving it into a helper function with some early returns could help readability.

if errs != nil {
return reconcile.Result{}, errs
}
if !nodeClass.StatusConditions().IsTrue(v1beta1.ConditionTypeSecurityGroupsReady) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like with AMIs I'm not sure we should be rereconciling immediately here. Is there a reason we would fail to resolve security groups once but succeed immediately after (without any changes to the NodeClass).

}
stored := nodeClass.DeepCopy()
subnets, err := c.subnetProvider.List(ctx, nodeClass)
var errs error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment I've given elsewhere about multierr and reducing nesting with a helper function.

if errs != nil {
return reconcile.Result{}, errs
}
if !nodeClass.StatusConditions().IsTrue(v1beta1.ConditionTypeSubnetsReady) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment I've given on other controllers about the immediate requeue.

@jigisha620 jigisha620 changed the title chore: Add status conditions for nodeClass WIP: Add status conditions for nodeClass Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants