Skip to content
This repository has been archived by the owner on Dec 16, 2020. It is now read-only.

k8s-tiller module #9

Closed
wants to merge 9 commits into from

Conversation

yorinasub17
Copy link
Contributor

@yorinasub17 yorinasub17 commented Jan 30, 2019

depends on gruntwork-io/terratest#231 and gruntwork-io/kubergrunt#7 MERGED

This implements the k8s-tiller module and example that allows you to deploy Tiller following security best practices.

Notes:

  • This uses kubergrunt under the hood.
  • This uses a pattern for module dependencies. This was necessary to force destroy order. Otherwise terraform started destroying the roles and role bindings from the tiller service account, preventing it from destroying itself.

# SET MODULE DEPENDENCY RESOURCE
# This works around a terraform limitation where we can not specify module dependencies natively.
# See https://github.com/hashicorp/terraform/issues/1178 for more discussion.
# ---------------------------------------------------------------------------------------------------------------------
Copy link
Contributor Author

@yorinasub17 yorinasub17 Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rileykarson Ok I found this pattern that was posted in December on module dependencies and it appears to be working well. I think this can be used to now chain the namespace module to the role binding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is meant to be used for? That is, what are you depending on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue with GKE is that the default permissions are not enough to create additional RBAC roles. So you have to first promote the user to a cluster admin, and then you can proceed to create these roles. So in terraform, we need to make sure we apply these rules after the cluster admin role binding is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, your identity needs to have a superset of the permissions of a role to create that role. See https://cloud.google.com/kubernetes-engine/docs/how-to/role-based-access-control#defining_permissions_in_a_role

@yorinasub17 yorinasub17 changed the title WIP NOT READY k8s-tiller module Jan 31, 2019
@yorinasub17
Copy link
Contributor Author

yorinasub17 commented Jan 31, 2019

@brikis98 I'd like a review from you on this one, specifically around the module dependencies system implemented here. The current implementation works (I had issues with resources being deleted too early which are now resolved), but I am not sure if there are other gotchas that we don't know about in implementing the module dependencies.

@yorinasub17
Copy link
Contributor Author

@brikis98 Forgot to mention your review is not urgent, given that it currently works. Just wanted to hear your thoughts, eventually.

- [minikube](https://kubernetes.io/docs/tasks/tools/install-minikube/)
- [Kubernetes on Docker for Mac](https://docs.docker.com/docker-for-mac/kubernetes/)
- EKS
- GKE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to our modules?

1. Install [Terraform](https://www.terraform.io/), minimum version: `0.9.7`.
1. Install [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/).
1. Install [helm client](https://docs.helm.sh/using_helm/#install-helm)
1. Install [kubergrunt](https://github.com/gruntwork-io/kubergrunt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of stuff to install. I think I've asked before, but does it make sense to embed kubectl or helm client in kubergrunt (or call the APIs they are using directly)? Or do their versions/APIs change so often that we'd never be able to keep up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into this yet for helm, but for kubectl, we can't embed it because a lot of the logic is baked into the command (package main) as opposed to a library.

That said, based on what I know, I think it is safe to assume that everyone will have kubectl installed if they are doing anything related to Kubernetes. You lose A LOT when you don't have that, so that one is more or less a moot operation.

I think the helm client might be avoidable, but I will have to look into that more. I would probably file that away as a "nice to have" though. I filed gruntwork-io/kubergrunt#13 so we don't forget this thought.

examples/k8s-tiller/main.tf Outdated Show resolved Hide resolved

resource "null_resource" "dependency_getter" {
provisioner "local-exec" {
command = "echo ${length(var.dependencies)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does echo exist on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Powershell yes. I have verified this. This does not exist on CMD though so not the most portable, but probably good enough based on prior conversations.

# SET MODULE DEPENDENCY RESOURCE
# This works around a terraform limitation where we can not specify module dependencies natively.
# See https://github.com/hashicorp/terraform/issues/1178 for more discussion.
# ---------------------------------------------------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is meant to be used for? That is, what are you depending on?

@@ -12,3 +12,8 @@ output "rbac_access_read_only_role" {
description = "The name of the RBAC role that grants read only permissions on the namespace."
value = "${kubernetes_role.rbac_role_access_read_only.metadata.0.name}"
}

output "depended_on" {
description = "This output can be used to depend on the resources in this module."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example use case?


resource "null_resource" "tiller" {
triggers {
tiller_namespace = "${var.tiller_namespace}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever need to update tiller? I ask as this trigger will only fire on creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. You will most likely want to upgrade Tiller versions at some point. However, you can use the helm client to upgrade:

helm init --upgrade --tiller-namespace TILLER_NAMESPACE

I will add this to the README.

provisioner "local-exec" {
command = "kubergrunt helm undeploy ${local.kubectl_config_options} --home ${local.helm_home_with_default} --tiller-namespace ${var.tiller_namespace} ${local.undeploy_args}"
when = "destroy"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this code seems to be calling out to kubergrunt. Sanity check: should this really be Terraform code or just a new command that does a bunch of steps for you built into kubergrunt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, is there anything in this module that actually uses a real Terraform resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sanity check. You are right that there is nothing in this module that uses a real Terraform resource, so yes I am forcing Terraform where the operator could just call out to kubergrunt. I am not 100% sure if this is a good idea yet (I don't think I have enough Terraform scars to make a judgment call), but I have my reasons for at least attempting it.

There is one concrete use case in my head for doing this, and a few potential use case / benefits:

Concrete:

  • For EKS, there are what I call "essential" helm charts. These are helm charts that almost all clusters will want to install. I don't know yet if there is an equivalent for GKE, but presumably you will want at least an ingress controller. In this use case, it is nice to be able to chain the modules in one terraform apply: provision EKS => configure kubectl => deploy Tiller => install "essential" helm charts (kube2iam / kiam and alb-controller).

Potential:

  • There is something nice about being able to manage access in a declarative manner. Once I implement the revoke command (deprioritized for now in favor of being able deploy services to k8s), changing the granted RBAC entity list should do the right thing and revoke users that didn't have access before and grant new ones. Note that right now this is append only because I am missing the revoke command, but eventually I intend to implement revoke.
  • Similarly, there is something nice about declaratively managing Tiller deployments as code. If we just reduce to a kubergrunt command, how does one find out what Tillers were deployed? Which namespaces does Tiller manage? We could use labels and such to make it easier to discover, but I don't think that beats opening your terraform config and finding the k8s-tiller module calls.

Ideally all this would be a new terraform provider, or baked into helm provider. However, there are a few reasons why I avoided that in this initial implementation:

  • We already are going to the direction of recommending people install kubergrunt for other reasons. So it would be one less thing to install if we could just use kubergrunt.
  • I am not too familiar with provider development (@rileykarson probably has more insight here), but with the limited knowledge I have, I think the advantages of a provider over this approach is being able to implement updates (e.g change name or append labels) and better drift tracking via state management. Both are nice to haves, but I am not sure if it is worth the overhead of a new provider since what's here gets us 80% of what we need.
  • Eventually I was going to tackle adding some of this functionality into the helm provider (Add tiller resource  hashicorp/terraform-provider-helm#134), which would be the best case scenario for implementing the chaining. But I wanted to make sure we had a short term escape hatch in case something went horribly wrong there (the community rejects the idea, yak shaving on provider, etc), which is this.
  • Not everyone will want to use Terraform for managing helm. kubergrunt satisfies those peoples' use case. Some people will want to use Terraform. This will satisfy those peoples' use case. So there is benefit to having the kubergrunt commands, and wrapping it in Terraform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm worried that having this as a Terraform module will create the expectation that create (apply), update (apply again—will work for append but probably not replace?), and delete (destroy) will all work correctly, but right now, as far as I can tell, only create works. I agree it's nice to capture, as code, what you did... but if you can't update or clean it up, I worry it will lead to confusion and errors. It's also complicated to use, as you have to install lots of dependencies besides Terraform itself, and we always have the potential for cross-OS portability issues.

Writing our own provider or doing a PR to the helm provider is a good idea, but I agree, likely to be too expensive to be worth doing now.

Would a middle ground be to turn this set of steps into a single kubergrunt command (plus docs), and add another kubergrunt command that shows you exactly what Tillers were deployed and the namespaces they manage (i.e., something like a tiller-status command)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry it will lead to confusion and errors

That is probably a fair point. Thinking through a few update scenarios, I realized that you are probably right that this may lead to a lot of confusion.

It's also complicated to use, as you have to install lots of dependencies besides Terraform itself

Yes I agree about that. But what if we have:

  • embed helm into kubergrunt to remove the need for helm.
  • verify kubectl is unnecessary and if it is, update kubergrunt so it is not necessary (I think I only have this as a dependency right now because of the kubeconfig setup and management).
  • Implement fetch_exec module that fetches kubergrunt and runs it. This is what we talked about. This replaces kubergrunt with python, so slightly better but arguably is a wash.

These are probably good things to do regardless of the decision made.

and we always have the potential for cross-OS portability issues.

I don't quite understand the potential cross-OS portability issue that is specific to this approach. What is in here that leads to that concern?

only create works.

FWIW, destroy actually works.

Would a middle ground be to turn this set of steps into a single kubergrunt command (plus docs), and add another kubergrunt command that shows you exactly what Tillers were deployed and the namespaces they manage (i.e., something like a tiller-status command)?

I'm not sure it is necessary to have the joined command (maybe a few more options to deploy to autoconfigure the deployer's local helm client?), but tiller-status is definitely good to have regardless.

Alright you've convinced me. I think with some time we can actually make this work, but it is arguable if it is worth the effort now. Especially considering the ticking time bomb that is helmv3 where all of this will be put to waste in the Tiller-less world.

=========================================

So to summarize:

  • Add --local-rbac-user, --local-rbac-group, --local-rbac-service-account options to kubergrunt helm deploy that can be used to configure the local helm client with that rbac entity.
  • Implement a tiller-status command into kubergrunt that shows you all the installed Tillers and what RBAC entities have been granted access.
  • We will abandon the efforts to modularize Tiller here. The consequence is that this repo needs to be renamed (possibly terraform-kubernetes-namespace or terraform-kubernetes-namespace-factory?). @brikis98 Have we renamed a released repo before? Does github redirect old names?

@robmorgan and @rileykarson does that make sense given the discussion in this thread? What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the argument against storing in state makes enough sense, thanks! I guess our [the Google provider team] view has generally been that Terraform requires access to privileged data and consequently should be able to manage it. There are definitely a lot of places that certs/creds are required to use certain features.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terraform state and secrets is not a great situation. Since secrets are stored in plain text in state files, we try to minimize what secrets we put in them when possible. Unfortunately, this often isn't possible with Terraform, and we sort of cringe, and hope that whatever protections we put in place for the state storage (e.g., S3 IAM permissions) are enough.

I don't think I'd make this the reason to design our helm integration entirely differently. That said, we need to be practical. How long would it take to get hashicorp/terraform-provider-helm#134 done? That is, to write the code, test it, get it reviewed, merged, and released? And this will likely take multiple rounds of code/test/review/merge/release, as we're unlikely to get a perfect, bug-free implementation on the first try. If this is a few days of work, then that seems like the way to go. If this is multiple weeks of work, with a non-trivial chance of being blocked for months more waiting on reviews or releases, then we probably don't want to stall the entire GCP project waiting on that now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right- at first glance resolving hashicorp/terraform-provider-helm#134 would probably take 2-5 eng days to complete and then a week or 2 for review passes + then waiting for a release to be cut (although that provider would possibly cut one at will? not sure)

Given the TLS/secret situation and the latency that making provider changes would have, I think that the proposed kubergrunt functionality makes sense 👍

Copy link
Contributor Author

@yorinasub17 yorinasub17 Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those sanity checks @brikis98 and @rileykarson! Always good to hear second opinions after being solo on the thoughts for a while :)

Ok to summarize:

  • Ideally we will use terraform resources for everything, but due to constraints on secrets management and provider resource deficiency, we will work outside of that.
  • The kubergrunt functionality is a "good enough" solution for now. It has the added bonus of being potentially useful to those that don't use terraform either.
  • The dependency logic for service accounts and namespaces is necessary for GKE, so that should be implemented.

We still have an open question about this module. Here are the Pros and Cons I gather:

Pros:

  • You can feed in the outputs of k8s-namespace and k8s-service-account to kubergrunt more easily than copy paste from outputs.
  • Implements destroy
  • Provides documentation as code of all the tiller deployments you have potentially made

Cons:

  • Is not idiomatic Terraform, and thus is unintuitive:
    • Depends on external tools
    • Does not implement plan and updates correctly
    • Could it be disaster inducing? E.g can you easily undeploy Tiller and all the releases accidentally with a typo?

=====

For now, I am leaning towards making this an example with a null_resource calling out to kubergrunt instead of a k8s-tiller submodule for now, so it is clearer that we don't implement all the terraform niceties, and document our intention to update when the helm provider is updated. If it turns out that it would be nice to have an actual module making the calls to kubergrunt after implementing the examples in EKS and GKE, we can revisit.

So given that, here is what I am going to do:

  • Close this PR and the related feature branch
  • Open a new PR with the dependency logic in k8s-namespace and k8s-service-account so that GKE works.
  • Open a new PR with an example that calls out to kubergrunt and put documentation there about the approach, alternatives we considered, and why it is that way right now.
  • File github issues:

Let me know if there are any objections. I am going to continue with this motion, but I believe it will be easy to reverse any of these actions so we can continue even after I have started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm!

count = "${trimspace(local.grant_args) == "" ? 0 : 1}"

triggers {
grant_args = "${local.grant_args}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you update these args, will kubergrunt helm grant remove the old ones and grant the new ones? Or is it append-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is append only for now, but I plan on implementing revoke, at which point I will change this to a loop and grant / revoke each one separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding as a comment or to the docs

# INTERPOLATE AND CONSTRUCT COMMAND ARGUMENTS
# ---------------------------------------------------------------------------------------------------------------------

// TODO: When we have package-terraform-utilities available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could open source this. It's not exactly secret sauce...

@yorinasub17 yorinasub17 mentioned this pull request Feb 2, 2019
@yorinasub17
Copy link
Contributor Author

As discussed, this has been replaced with

@yorinasub17 yorinasub17 closed this Feb 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants