Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Kubelet and Linux Node configurations in GKE node pools #2279

Merged
merged 4 commits into from
Aug 19, 2020

Conversation

rmanyari
Copy link
Contributor

@rmanyari rmanyari commented Jul 20, 2020

Per issue 6773, this PR adds support for KubeletConfig and LinuxNodeConfig attributes at GKE cluster and node creation. Updates are also supported at the node pool level.

Signed-off-by: rodrigo@twosigma.com

container: added support for `kubelet_config` and `linux_node_config` to GKE node pools (beta)

@ghost ghost added the size/xl label Jul 20, 2020
@ghost ghost requested review from rileykarson July 20, 2020 18:01
@rmanyari
Copy link
Contributor Author

This is still WIP, the acceptance test for Linux node configs doesn't pass yet but still wanted to get this out to get some early feedback.

Copy link
Collaborator

@rileykarson rileykarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared a few early thoughts! I haven't run the test yet, do you mind sharing what kind of issue you're encountering?

Steps: []resource.TestStep{
{
Config: testAccContainerNodePool_withKubeletConfig(cluster, np, "static"),
Check: resource.ComposeTestCheckFunc(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you're importing the resource, you can actually just omit all the Check stuff. It's superseded by some default checks done when importing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

google-beta/node_config.go Outdated Show resolved Hide resolved
google-beta/node_config.go Show resolved Hide resolved
@rmanyari
Copy link
Contributor Author

I was running into some de-serialization problem. Fixed that in the process of changing from TypeList to TypeMap.

Just finished running the new acceptance tests:

$ GOOGLE_CREDENTIALS=$(cat ../key.json) GOOGLE_PROJECT=<redacted> GOOGLE_REGION=us-centr
al1 GOOGLE_ZONE=us-central1-a TF_ACC=true go test -v ./... -run withLinuxNodeConfig -timeout 45m
=== RUN   TestAccContainerNodePool_withLinuxNodeConfig
=== PAUSE TestAccContainerNodePool_withLinuxNodeConfig
=== CONT  TestAccContainerNodePool_withLinuxNodeConfig
--- PASS: TestAccContainerNodePool_withLinuxNodeConfig (706.08s)
PASS
ok      github.com/terraform-providers/terraform-provider-google-beta/google-beta       706.121s
$ GOOGLE_CREDENTIALS=$(cat ../key.json) GOOGLE_PROJECT=<redacted> GOOGLE_REGION=us-centr
al1 GOOGLE_ZONE=us-central1-a TF_ACC=true go test -v ./... -run withKubeletConfig -timeout 45m                                                                                                              
=== RUN   TestAccContainerNodePool_withKubeletConfig
=== PAUSE TestAccContainerNodePool_withKubeletConfig
=== CONT  TestAccContainerNodePool_withKubeletConfig
--- PASS: TestAccContainerNodePool_withKubeletConfig (577.83s)
PASS
ok      github.com/terraform-providers/terraform-provider-google-beta/google-beta       577.855s

Also updated a bit the documentation while I'm at it.

It's pretty much ready to go.

@rmanyari rmanyari changed the title [WIP] Support for Kubelet and Linux Node configurations in GKE node pools Support for Kubelet and Linux Node configurations in GKE node pools Jul 21, 2020
Copy link
Collaborator

@rileykarson rileykarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to hear that helped fix them! I hoped as much 🙂

LGTM- thanks for your contribution @rmanyari!

Prior to merging this PR I've got to upstream the changes to our code generator repo and merge it there. Once that's done, I'll merge it here. It shouldn't be long- probably just a couple hours.

@rileykarson
Copy link
Collaborator

Dropping a brief update given this has taken days and not hours like I expected. I ran into some issues due to a bug introduced in the GKE resource by another PR initially, and resolved those. Now, I'm trying to get a successful test run in order to merge this. Terraform is sending the correct messages to the API, but the API isn't working how we'd expect. I've filed a bug internally against GKE, so I'll see if that gets traction.

Specifically, I've been unsuccessful testing TestAccContainerNodePool_withKubeletConfig. Upstream I added a second stage that performs an update, and that second stage fails when testing in CI. Actually, for my local test project, the first stage doesn't even pass (I suspect due to a network rule or something). I'd be interested in your experience with the same test, if you have a few minutes. If you run off [the auto-pr-3760 branch in the modular-magician fork], I'd expect it to fail when the update hangs, ultimately running for ~50 minutes.

@rmanyari
Copy link
Contributor Author

Welp, sorry for the late reply. Let me try that out.

@rmanyari
Copy link
Contributor Author

rmanyari commented Aug 3, 2020

Yep - I think I'm running into the same prob. The test hangs because the node pool gets stuck in "repairing" I think.

Looking at the output of a gcloud command, I can see these values for kubeletConfig:

$ gcloud beta --project cloud-latency-167422 container node-pools --zone us-central1-a describe tf-test-np-fy3ipqmpfh --cluster tf-test-cluster-w2cqunuw94 | grep -A5 kubelet
  kubeletConfig:
    cpuCfsQuotaPeriod: 100us
    cpuManagerPolicy: static
  machineType: n1-standard-1
  metadata:
    disable-legacy-endpoints: 'true'

But the GCE instance shows:

$ gcloud --project cloud-latency-167422 compute instances describe gke-tf-test-cluster--tf-test-np-fy3ip-e3411f53-ndcg --zone us-central1-a | grep -A10 CFS
      cpuCFSQuota: true
      cpuCFSQuotaPeriod: 200us
      cpuManagerPolicy: default
      enableDebuggingHandlers: true
      evictionHard:
        memory.available: 100Mi
        nodefs.available: 10%
        nodefs.inodesFree: 5%
        pid.available: 10%
      featureGates:
        DynamicKubeletConfig: false
        RotateKubeletServerCertificate: true

Which don't really match... So maybe we wait until the backend fixes are pushed before trying again?

@rileykarson
Copy link
Collaborator

Sorry again for the delay @rmanyari! I've been following up upstream based on the results of the update test. It turns out that while the GKE API accepts [static, default], the actual configuration options that are expected are [static, none]. It passed the option directly on to the nodes & they're failing to register because they're configured with an invalid option.

The GKE team is now aware and working on fixing it. In the meantime, I'll check with my team early next week whether we can merge prior to the API working correctly. The static option works & we'll be able to ensure that none is the correct option available in Terraform even if the underlying API doesn't know it.

@rmanyari
Copy link
Contributor Author

That would be great. We are really eager to start using these features in our clusters. Thanks again for shepherding this!

@rileykarson
Copy link
Collaborator

Spoke with my team- we'll validate for the correct options for kubelet_config.cpu_manager_policy and warn in our documentation that one of them doesn't currently work.

@rmanyari: Do you mind rebasing your change on top of the master branch to resolve the merge conflict that's since cropped up?

It's possible you'll get some weirdness from your Git upstream because we've moved from terraform-providers/terraform-provider-google-beta to hashicorp/terraform-provider-google-beta but I believe the move should be transparent.

@rmanyari
Copy link
Contributor Author

It went pretty well :)

@rileykarson
Copy link
Collaborator

Glad to hear!

Thanks again for your contribution, and my apologies for the extended holding pattern. Merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants