Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Set up dynamic provisioning of persistent volumes on AWS #423

Merged
merged 2 commits into from
Jun 10, 2020

Conversation

BrainBlasted
Copy link
Contributor

Sets up the plumbing required to support dynamic provisioning on AWS.

For #379

@BrainBlasted
Copy link
Contributor Author

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this works after fixing a small issue I mention in the review, great work!

My main concern with this is not using the upstream Helm chart, I think we should use it if possible, see my comment about that.

We should consider exposing a knob to turn this on and off, some users might not be interested in this.

I think this won't be updated when we update the chart in Lokomotive, maybe @invidian has some ideas on how to do this.

Also, I think we should have a test that checks that dynamic provisioning works, the test would deploy a minimal app with a PVC and check it's bound or something similar.

@invidian
Copy link
Member

I agree with @iaguis that this should perhaps be optional behavior. I suggest following changes:

  • move IAM role creation code to controller module and make it optional, using Terraform variable and count = var.set_worker_iam_role ? 1 : 0. I don't think it is trivial to make it optional per worker pool, as then the pods which require storage must be scheduled on the pools, which has storage enabled.
  • produce IAM role name as an Terraform module output in controller module and pass it to all worker module instances in Go template. If role is disabled, perhaps pass null or empty string ("").
  • try to use upstream helm chart to avoid maintaining code on our own. If you have some issues with it @BrainBlasted, please let us know and we are happy to help.
  • add CSI driver as a component, instead of cluster component. This way:
    • we have an easy method to upgrade it (as we currently don't have a way to upgrade platform-specific controlplane components)
    • we don't do so much in Terraform, which should be preferred (see Reduce the amount of tasks handled by Terraform #181)
    • we treat this CSI driver as yet another storage provider for Kubernetes

@iaguis
Copy link
Contributor

iaguis commented May 20, 2020

I tested the new EBS component and it worked fine 👍. What I would like to see to move this forward is address the rest of @invidian's comments in #423 (comment):

  • Use an upstream Helm chart
  • Make setting the IAM role optional by using a Terraform variable and exposing it to the AWS cluster configuration in the lokocfg file.
  • Move IAM role creation to the controller Terraform module and pass its name to all worker module instances so they can use it.

If you have any questions about this let us know.

@invidian
Copy link
Member

One thing, which definitely should be added is a storage class, with parametrized default annotation, without that, prometheus operator will be in pending state until one creates the default storage class manually and removes the PVCs and pods to trigger the recreation.

Sample storage class, which I think can be used:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ebs-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer

Another thing which should think about is handling orphan volumes. I suspect, that if you create the volumes, then you destroy the cluster, the volumes will remain in AWS account.

@invidian
Copy link
Member

As expected, I can find orphan EBS volumes in AWS console after destroying the cluster 😕
image

We need to find a way to deal with those somehow.

@BrainBlasted BrainBlasted force-pushed the cdavis/dynamic-pvcs-aws branch 2 times, most recently from 2440fda to 4dc2cee Compare May 20, 2020 20:27
@johananl
Copy link
Member

@BrainBlasted I wanted to ensure you've seen #379 (comment). Sorry to add this info now (this came up in today's sprint planning). I hope I'm not adding too much work.

@BrainBlasted
Copy link
Contributor Author

Yes, I saw. Once everything else is finished I'll add a policy similar to the ones in #464 that allows this to work

@BrainBlasted BrainBlasted force-pushed the cdavis/dynamic-pvcs-aws branch 2 times, most recently from ad0721b to a559e06 Compare May 22, 2020 22:52
@BrainBlasted
Copy link
Contributor Author

  • Use an upstream Helm chart
  • Make setting the IAM role optional by using a Terraform variable and exposing it to the AWS cluster configuration in the lokocfg file.
  • Move IAM role creation to the controller Terraform module and pass its name to all worker module instances so they can use it.

I think these are largely done now, save for the third item. However, everything seems to work without the worker nodes using the IAM roles now that the controller is using it (which is needed for the upstream helm chart to work)

@BrainBlasted
Copy link
Contributor Author

I don't quite understand why the CI failed here.

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why the CI failed here.

About that, I'm not sure either. Better to just run the job again I guess, to see if it fails consistently.

The PR looks very good @BrainBlasted, nice work :)

We need to add documentation of new parameters to docs/configuration-reference/platforms/aws.md for new flag and to docs/configuration-reference/components/ for new component.

If you could also re-organize the commits, perhaps into 2: one for adding IAM-related code and one for new component, it would make it easier for other reviewers to review the PR and it will make our git history a bit simpler after merging.

If the IAM role is only attached to the controller nodes, the severity of blocking access to metadata drops in my opinion. What do you think @iaguis @johananl?

pkg/platform/aws/aws.go Outdated Show resolved Hide resolved
@BrainBlasted BrainBlasted force-pushed the cdavis/dynamic-pvcs-aws branch 2 times, most recently from 07861c7 to 3ac4fe2 Compare May 25, 2020 09:03
@BrainBlasted
Copy link
Contributor Author

Reworked into two commits, as requested.

@BrainBlasted BrainBlasted changed the title WIP: Set up dynamic provisioning of persistent volumes on AWS Set up dynamic provisioning of persistent volumes on AWS May 25, 2020
cli/cmd/component.go Show resolved Hide resolved
docs/configuration-reference/platforms/aws.md Outdated Show resolved Hide resolved
pkg/components/aws-ebs-csi-driver/component.go Outdated Show resolved Hide resolved
pkg/components/aws-ebs-csi-driver/component.go Outdated Show resolved Hide resolved
pkg/components/aws-ebs-csi-driver/component.go Outdated Show resolved Hide resolved
@BrainBlasted BrainBlasted requested a review from pothos as a code owner June 2, 2020 22:14
@BrainBlasted
Copy link
Contributor Author

I made that change, but CI is still hanging on that test for some reason.

@invidian
Copy link
Member

invidian commented Jun 3, 2020

Oops, component "aws-ebs-csi-driver" {} line was missing in ci/aws/aws-cluster.lokocfg.envsubst. I added it now, I hope you don't mind @BrainBlasted.

@BrainBlasted
Copy link
Contributor Author

BrainBlasted commented Jun 3, 2020 via email

invidian
invidian previously approved these changes Jun 3, 2020
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@johananl johananl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @BrainBlasted. I've added some comments.

resource "aws_iam_role_policy" "csi-driver-role-policy" {
count = var.set_csi_driver_iam_role ? 1 : 0
name = "${var.cluster_name}-policy"
role = join("", aws_iam_role.csi-driver-role.*.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safer and more intuitive to do role = aws_iam_role.csi-driver-role.[count.index].id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, because the resource aws_iam_role.csi-driver-role won't exist. Also, if done like this, you will look up on index 1, so 2nd instance of the resource. Or does your syntax magically handles it somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, count.index does not work here if the roles aren't set.

user_data = data.ct_config.controller-ignitions[count.index].rendered
ami = local.ami_id
user_data = data.ct_config.controller-ignitions[count.index].rendered
iam_instance_profile = join("", aws_iam_instance_profile.csi-driver-instance-profile.*.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use aws_iam_instance_profile.csi-driver-instance-profile.[0].name. There can't be more than one instance profile so join doesn't make sense to me here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion doesn't work when set_csi_driver_iam_role = false

pkg/components/aws-ebs-csi-driver/component_test.go Outdated Show resolved Hide resolved
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found few more nits, which we could improve, though nothing really blocking.

pkg/platform/aws/template.go Outdated Show resolved Hide resolved
ci/aws/aws-cluster.lokocfg.envsubst Outdated Show resolved Hide resolved
// See the License for the specific language governing permissions and
// limitations under the License.

package awsebscsidriver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johananl we don't do that for any other components, so maybe to avoid introducing inconsistencies, we can leave it as it is for now to move forward with this PR and we re-visit it later for all components?

return gohcl.DecodeBody(*configBody, evalContext, c)
}

// RenderManifests renders the helm chart templates with values provided.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect @johananl meant https://helm.sh/docs/chart_best_practices/conventions/, even though even https://helm.sh does not respect it, as they often use Helm Charts term.

@BrainBlasted BrainBlasted force-pushed the cdavis/dynamic-pvcs-aws branch 3 times, most recently from 3d6abaa to 8aa2b93 Compare June 9, 2020 01:36
invidian
invidian previously approved these changes Jun 9, 2020
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @johananl can you have a look as well please?

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Good work!

Can you rework your commit messages so they follow our guidelines, right now lines are too long.

The only question I have is whether we should set enable_csi to true by default. I understand we want to make this the default behavior. However, since we need to install the component anyway to have this working it might be fine to have it false by default. When we have a way to install this automatically on AWS we might revisit this to make it work for every AWS cluster.

Happy to hear some other opinions.

## Introduction

The [CSI Driver for Amazon EBS](https://github.com/kubernetes-sigs/aws-ebs-csi-driver)
provides a CSI interface used by Container Orchestrators to manage the lifecycle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he means this

Suggested change
provides a CSI interface used by Container Orchestrators to manage the lifecycle
provides a CSI interface used by container orchestrators to manage the lifecycle

// See the License for the specific language governing permissions and
// limitations under the License.

package awsebscsidriver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revisit this later.

return gohcl.DecodeBody(*configBody, evalContext, c)
}

// RenderManifests renders the helm chart templates with values provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say

Suggested change
// RenderManifests renders the helm chart templates with values provided.
// RenderManifests renders the Helm chart templates with values provided.

@invidian
Copy link
Member

invidian commented Jun 9, 2020

The only question I have is whether we should set enable_csi to true by default. I understand we want to make this the default behavior. However, since we need to install the component anyway to have this working it might be fine to have it false by default. When we have a way to install this automatically on AWS we might revisit this to make it work for every AWS cluster.

Having it false by default now and changing it to true in the future seems reasonable to me.

Optionally sets up an IAM role that is required for dynamic volume provisioning
to work on AWS.

Related to #379
Creates a simple component for aws-ebs-csi-driver, which is required for
dynamic volume provisioning to work on AWS.

Related to #379
@BrainBlasted
Copy link
Contributor Author

Re-worked the commits to follow the guidelines. I think it's fine to leave enable_csi as false by default as well for now.

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iaguis iaguis merged commit a1c91a1 into master Jun 10, 2020
@iaguis iaguis deleted the cdavis/dynamic-pvcs-aws branch June 10, 2020 12:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants