Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS Instance profiles #637

Open
dvianello opened this issue Oct 1, 2019 · 15 comments
Open

Support AWS Instance profiles #637

dvianello opened this issue Oct 1, 2019 · 15 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/low Not that important. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.

Comments

@dvianello
Copy link

Hello machine-controller folks,

we're using kubeone to deploy k8s clusters, and we understand it uses machine-controller behind the scenes to create worker nodes. We're struggling a bit in making this work in our AWS setup as we're heavily relying on assuming roles in different accounts, rather having a IAM user that can access directly an underlying account.

Credentials in the environment where we're running kubeone are thus STS short-lived creds that last 8 hours maximum, and not too useful to be injected in machine-controller since it will stop working when the creds expire. We were hoping we could resort to the instance profile - it has enough permissions to create ec2 instances and so on - but editing the secrets out of the machine-controller deployment cause errors like the below:

E1001 10:23:25.552776       1 metrics.go:149] failed to call prov.SetInstanceNumberForMachines: errors: [failed to get EC2 instances: EmptyStaticCreds: static credentials are empty]
I1001 10:23:48.785995       1 migrations.go:147] CRD machines.machine.k8s.io not present, no migration needed
I1001 10:23:48.786014       1 migrations.go:53] Starting to migrate providerConfigs to providerSpecs
I1001 10:23:48.819680       1 migrations.go:135] Successfully migrated providerConfigs to providerSpecs
I1001 10:23:48.819734       1 plugin.go:97] looking for plugin "machine-controller-userdata-centos"
I1001 10:23:48.819761       1 plugin.go:125] checking "/usr/local/bin/machine-controller-userdata-centos"
I1001 10:23:48.819848       1 plugin.go:138] found '/usr/local/bin/machine-controller-userdata-centos'
I1001 10:23:48.819858       1 plugin.go:97] looking for plugin "machine-controller-userdata-coreos"
I1001 10:23:48.819870       1 plugin.go:125] checking "/usr/local/bin/machine-controller-userdata-coreos"
I1001 10:23:48.819889       1 plugin.go:138] found '/usr/local/bin/machine-controller-userdata-coreos'
I1001 10:23:48.819897       1 plugin.go:97] looking for plugin "machine-controller-userdata-ubuntu"
I1001 10:23:48.819908       1 plugin.go:125] checking "/usr/local/bin/machine-controller-userdata-ubuntu"
I1001 10:23:48.819926       1 plugin.go:138] found '/usr/local/bin/machine-controller-userdata-ubuntu'
E1001 10:25:18.192498       1 machine.go:360] Failed to reconcile machine "xxxxx-xxxxx-xxxx-5-xx-xxxx-xx-7d49b65947-6kkg5": failed to get instance from provider: failed to list instances from aws, due to EmptyStaticCreds: static credentials are empty

It feels like this is caused by the fact that

config = config.WithCredentials(credentials.NewStaticCredentials(id, secret, token))
goes for static credentials directly, instead of using a credentials chain via ChainProvider (https://docs.aws.amazon.com/sdk-for-go/api/aws/credentials/#ChainProvider) that could fall back to the instance profile.

Do you have any plans of supporting instance profiles? It would simplify a lot credentials management when dealing with clusters in AWS!

Happy to help if we can.

Best,
Dario

@alvaroaleman
Copy link
Contributor

hey @dvianello , while we ourselves do not need this I am not opposed to adding it. Would you be open to provide a PR and validate it from your side?

Only issue is that I am not sure how we could write a test for this, so there is a chance ppl my inadvertently break it in the future.

@kdomanski kdomanski added priority/low Not that important. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Nov 28, 2019
@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2020
@kron4eg
Copy link
Member

kron4eg commented Mar 19, 2020

Alternative solution can be integrating vault-injector, point it to the AWS secrets path, and make it re-request credentials. The only need that machine-controller should do in such use-case is to be able to read credentials from the file (and re-read them on change).

@kron4eg
Copy link
Member

kron4eg commented Mar 19, 2020

/remove-lifecycle stale

@kubermatic-bot kubermatic-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2020
@dvianello
Copy link
Author

@kron4eg, would this add an external dependency to the process, i.e. Vault?

Instance profiles "just work" in AWS, and there's built-in code for using them in the various AWS SDKs.

@kron4eg
Copy link
Member

kron4eg commented Mar 19, 2020

While KubeOne's example terraform config features quite open AWS permissions for instance profile for control-plane nodes, that's no a good way for real production setups. Real production setups should lock down those permissions. Relying on instance profile credentials on a VM with multiple shared workloads is kinda vulnerable way of doing AWS business. Any accidental/malicious workloads that will endup on control-plane Nodes would receive same privileges as control-plane itself. That's one of the main reasons why projects such as kube2iam exists (to prevent giving out instance profiles to whatever pod + to attach different IAM profiles to different pods).

@dvianello
Copy link
Author

@kron4eg, fair point.

Is there any way machine-controller could be forced to use kube2iam then? My understanding is that it should actually be transparent, with kube2iam intercepting calls to the metadata IP coming from pods and replying with STS creds if authorised to do so.

The above brings us back to the point that building support for Instance Profiles might just well work: it will be down to users to either rely on the instance profile directly - with the security downsides you mentioned above- or deploy kube2iam and use that to provide creds to machine-controller.

Or am I missing something?

@kron4eg
Copy link
Member

kron4eg commented Mar 19, 2020

@dvianello the problem with kube2iam (from machine-controller perspective), is that we can't differentiate between instance profile and kube2iam. The implicit nature of those credentials makes me worry.

Up until now, we were explicit about credentials used by the machine-controller and it should stay this way.

Besides, kube2iam is also an external dependency. So if we'd need to choose between those two I'd choose vault every time. Vault can communicate with AWS API, and request new shortlived credentials, and vault-agent will renew them on a shared with machine-controller volume.

P.S.
You can already "fake" usage of instance profile in machine-controller deployment with an init + sidecar container, that will grab STS credentials before machine-controller starts and launch in with new ENV vars containing STS creds.
Of course it comes with a downside that next kubeone invocation will override it. Vault injector on the other side can "inject" whatever needed without any change from kubeone (which creates machine-controller deployment). The only thing we need to do is to "teach" machine-controller to read credentials from the file provided by the injector, and maybe to annotate machine-controller deployment with injector instructions.

@dvianello
Copy link
Author

Hey,

@dvianello the problem with kube2iam (from machine-controller perspective), is that we can't differentiate between instance profile and kube2iam. The implicit nature of those credentials makes me worry.
Up until now, we were explicit about credentials used by the machine-controller and it should stay this way.

I understand that the process would become a little bit less transparent - but I can assure you that from our perspective it was quite non-obvious that the current process was grabbing user's credentials and injecting them behind the scenes into the machines. But again, this might be a bit more a problem with kubeone and the way they deal with it.

Besides, kube2iam is also an external dependency. So if we'd need to choose between those two I'd choose vault every time. Vault can communicate with AWS API, and request new shortlived credentials, and vault-agent will renew them on a shared with machine-controller volume.

Agreed kube2iam would be an external dependency, but it would IMHO be a bit less than an entire Vault setup.

P.S.
You can already "fake" usage of instance profile in machine-controller deployment with an init + sidecar container, that will grab STS credentials before machine-controller starts and launch in with new ENV vars containing STS creds.

Not sure the above would work, as I understand machine-controller would be a long-lived service, so the STS creds initially grabbed would expire after a set amount of time - max 12 hours I believe.

Anyway, understand why you're worried about changing all of this, don't get me wrong - I just believe that, from an AWS usability point of view, support for some sort of almost-native AWS credentials delivery system would be nice. For EKS there's more happening behind the scenes for this, see https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-eks-adds-support-to-assign-iam-permissions-to-kubernetes-service-accounts/. So there might be a time where no external dependencies will be needed.

HTH!

Thanks,
Dario

@kron4eg
Copy link
Member

kron4eg commented Mar 20, 2020

Not sure the above would work, as I understand machine-controller would be a long-lived service, so the STS creds initially grabbed would expire after a set amount of time - max 12 hours I believe.

if sidecar would quite (say after 12 hours), whole pod will be restarted and then the process will reiterate.

@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 18, 2020
@kubermatic-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@kubermatic-bot kubermatic-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2020
@cpuspellcaster
Copy link

cpuspellcaster commented Oct 15, 2020

This is a concern for us as well; we would like to utilize the AWS Credentials Chain and relying instead on a static credential pair is not a workable solution for us. Our security posture does not allow usage of static credentials. Using a sidecar container to refresh the credentials is not desirable either, because the pod restart metric will increase linearly over 12h cycles and obfuscate our observability infrastructure from using that metrics as a failure heuristic. Setting up Vault to cover this use case is not a feasible solution for us. We've invested in using Pod Identity Webhook and Mutating Admission controller to scope down IAM policy permission to the pod level. Requiring usage of the static credentials here blocks the AWS Credentials Chain from picking up a Pod Identity. At a minimum, however, we would still prefer to give elevated permission to the whole node via the IAM Instance Profile for the Control Plane EC2 instances, since the machine-controller runs on the control plane nodes.

@xmudrii
Copy link
Member

xmudrii commented Oct 16, 2020

/remove-lifecycle rotten
/kind feature

@kubermatic-bot kubermatic-bot added kind/feature Categorizes issue or PR as related to a new feature. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale. labels Oct 16, 2020
@cpuspellcaster
Copy link

Just wanted to provide an update that the lack of this feature continues to be an issue for our organization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/low Not that important. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Projects
None yet
Development

No branches or pull requests

7 participants