Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: IAM pod identity #260

Merged
merged 1 commit into from
Sep 21, 2020

Conversation

sjenning
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2020
@sjenning sjenning changed the title aws: IAM pod identity [WIP] aws: IAM pod identity Mar 27, 2020
@sjenning sjenning changed the title [WIP] aws: IAM pod identity WIP: aws: IAM pod identity Mar 27, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2020
@sjenning sjenning force-pushed the aws-pod-identity branch 2 times, most recently from 6d29f3f to 4e8da06 Compare March 30, 2020 17:39
@sjenning
Copy link
Contributor Author

/cc @marun

@sjenning sjenning changed the title WIP: aws: IAM pod identity aws: IAM pod identity Mar 30, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2020
@sjenning
Copy link
Contributor Author

/assign @derekwaynecarr

@sjenning
Copy link
Contributor Author

/cc @smarterclayton

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to work out tomorrow if we should require a unique provider per cluster and if the user could provide the provider as a day-1 option rather than we create it.

enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
@@ -0,0 +1,442 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this needs to pick a domain ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean an subdirectory under enhancements? i'm open to suggestions 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a CCO enhancement as well, my best guess was security.


### Goals

1. A pod can assume an IAM role created with a `CredentialRequest` CR using a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that a pod can have only ONE identity attached right?

cloud-credential-operator currently supports dual identity.

  • an admin entity that is used to create new roles/users etc.
  • an reader entity that is only allowed to read roles/users to make sure things stay as-is

a lot of the operators today require create/update/delete permissions do that they can create dynamic objects. But some of the operators can be modified to have dual identities.

  • ADMIN_WRITE identity that can be used to create the objects.
  • RESTRICTED_WRITE identity that is restricted to only the objects it created.
    that could allow users to remove/disable the ADMIN_WRITE permissions as day-2 operation.. making the security stance even stronger.

if using the POD identity based on webhook and environment disallows us to ever move in that direction, I think we should maybe not use it and stick to secrets as multiple secrets can be mounted compared to single AWS env variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would currently not be suitable for use by the CCO itself. since the pod can only run with one serviceaccount, this mechanism is one-to-one IAM role to pod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would/could this be solved by having the operator have the ADMIN_WRITE and the operand have RESTRICTED_WRITE (assuming a clean separation between operator and operand)?

i suppose this might also imply that the operator is also creating a new CredentialsRequest so that the perms for the operand only grant perms to the actually created objects...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should punt on the multi-cred use case for now since serviceaccount-to-pod is 1:1 vs secret-to-pod is N:1

@sjenning
Copy link
Contributor Author

sjenning commented Apr 2, 2020

adding review request from remaining CCO approvers
/cc @abutcher @csrwng @dgoodwin @twiest


### Non-Goals

1. Convert existing IAM user-based components (e.g. ingress, image-registry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to update the existing CredRequests to use this new mechanism on AWS? And if so, how can we upgrade without triggering something bad here. These CredRequests are delivered via the release manifest and I believe that means the CVO will update them to their new state. Is there a way we can make this manual and opt-in?

In doing so, the operator permissions would then become fixed and need manual intervention in the event an upgraded operator needed new permissions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we are not planning to update existing CredentialRequests at this time.

For this stage, I recommended we disallow CR changes where the spec changes from a secretRef to a serviceAccountRef or visa versa.

There doesn't seem to be a way to mark CRD fields as immutable atm kubernetes/kubernetes#65973

We'll probably just overwrite the change in the controller is someone tries to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably just overwrite the change in the controller is someone tries to do that.

Do we have an admission webhook for this? What happens if someone tries to push a change while the cred minter is, for some reason, out of action? The new cred minter comes up and... searches for existing Secrets to decide if it needs to overwrite a serviceAccountRef?

@sjenning
Copy link
Contributor Author

sjenning commented May 1, 2020

@deads2k @derekwaynecarr @joelddiaz @dgoodwin @abhinavdahiya

Ok so I've encountered an issue implementing the API change to the CredentialsRequest CRD

Currently, this is the spec
https://github.com/openshift/cloud-credential-operator/blob/master/pkg/apis/cloudcredential/v1/credentialsrequest_types.go#L49-L56

// CredentialsRequestSpec defines the desired state of CredentialsRequest
type CredentialsRequestSpec struct {
	// SecretRef points to the secret where the credentials should be stored once generated.
	SecretRef corev1.ObjectReference `json:"secretRef,omitempty"`

	// ProviderSpec contains the cloud provider specific credentials specification.
	ProviderSpec *runtime.RawExtension `json:"providerSpec,omitempty"`
}

The issue is two fold depending on what solution are you trying to implement.

  1. The SecretRef is not nullable because it isn't a pointer. So there is not a good way to tell that the user didn't set it (if their intent is to use the serviceaccount injection method)
  2. We could just use the corev1.ObjectReference for both the secret and serviceaccount based injecting and multiplex on the SecretRef.Kind (treating no Kind as Secret for backward compat). However, the name of the field indicates that the object type is Secret even if the actual type is more flexible than that. Though use of type ObjectReference is discouraged

I've thought of 3 solutions and I'm looking for feedback:

  1. Just reuse the SecretRef to hold the serivceaccount reference and multiplex in the code based on SecretRef.Kind. This is the simpliest of the solutions as it doesn't require any change to the CRD at all. However, it is messy to have a field name SecretRef that you set the Kind to ServiceAccount.
  2. Add a ServiceAccountRef *corev1.ObjectReference field to the CredentialsRequestSpec and multiplex based on if SecretRef.Namespace is set. ServiceAccountRef needs to be a pointer since it will not be present in previously created CRs
  3. Possibly the most correct, but also most disruptive way, switch to a union discriminator and deprecate the SecretRef field. Something like this (suggested by @deads2k)
// CredentialsRequestSpec defines the desired state of CredentialsRequest
type CredentialsRequestSpec struct {
	// SecretRef points to the secret where the credentials should be stored once generated.
	SecretRef *corev1.ObjectReference `json:"secretRef,omitempty"`

	StorageDestination *StorageDestination

	// ProviderSpec contains the cloud provider specific credentials specification.
	ProviderSpec *runtime.RawExtension `json:"providerSpec,omitempty"`
}

type StorageDestination struct {
	StorageType    string
	Secret         *SecretStorageDestination
	ServiceAccount *ServiceAccountStorageDestination
}

type SecretStorageDestination struct {
	Namespace string
	Name      string
}

type ServiceAccountStorageDestination struct {
	Namespace string
	Name      string
}

I would apreciate any feedback on these methods or there might be a far superior way that I'm overlooking.

@dgoodwin
Copy link
Contributor

dgoodwin commented May 4, 2020

3 feels the best to me, going from a mandatory field to optional shouldn't cause any harm or break API compat, and I'm not particularly worried about converting that field to a pointer, though you would need to update all usage of it to add in the other implementation. We do not typically use discriminator fields for this kind of thing but it sounds like good practice. I would suggest some kind of name with "Type", maybe "CredentialsType". I don't think we have webhook validation today so this will probably need to be validated at runtime and reflected in a condition?

@sjenning
Copy link
Contributor Author

sjenning commented May 4, 2020

There is going to need to be some intermediate work either way. The core operator and the actuators currently operate directly on the secret. I'm going to need to create an abstraction so we aren't if-else'ing throughout the whole codebase.

@joelddiaz
Copy link
Contributor

What is the advantage of the StorageDestination approach? Is it that the StorageType tells you which of the pointers contains the actual data vs searching through the object for a non-nil value?

Is converting the SecretRef to a pointer and adding the ServiceAccountRef as a pointer not an option, or is the StorageDestination basically the solution you offered covering that?

FWIW, I'm partial to #3 (or basically any option which can take us to a place where the SecretRef field becomes optional).

@deads2k
Copy link
Contributor

deads2k commented May 6, 2020

What is the advantage of the StorageDestination approach? Is it that the StorageType tells you which of the pointers contains the actual data vs searching through the object for a non-nil value?

Yes. UnionDiscriminators allow older clients to know they don't understand a newer format. Imagine adding, "nextCoolThingSethWants". Old clients won't even see the new fields, but they will see the discriminator.

Is converting the SecretRef to a pointer and adding the ServiceAccountRef as a pointer not an option, or is the StorageDestination basically the solution you offered covering that?

You cannot change a field to a pointer because old clients won't be able to deserialize nil and the deserialization will fail, preventing any list operation from working.

@sjenning sjenning force-pushed the aws-pod-identity branch 2 times, most recently from e389f60 to 2f9b54a Compare May 12, 2020 14:55
@sjenning sjenning force-pushed the aws-pod-identity branch 2 times, most recently from 521e99a to c6e57aa Compare May 14, 2020 19:42
@sjenning
Copy link
Contributor Author

Latest push contains significant revisions:

  • breaking the enhancement into phases
    • phase 1: EKS end user experience parity
    • phase 2: integration with CredentialsRequest CR workflow

We also resolved to writing a new controller that will create and reconcile an OIDC discovery endpoint in a public S3 bucket. This is a significant amount of work entering the scope (configuration for enable/disable this functionality, new IAM perms required, etc).

As such, the work on openshift/cloud-credential-operator#187 will be put on hold, and work will begin on the OIDC discovery controller.


Pros:
- Works even with the apiserver is not accessible by AWS STS
(disconnected/private cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the door open to seamlessly transition to ServiceAccountDiscovery later once it's out of alpha?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. We just have to detect if the URL in the Authentication CR is empty or our default and switch it to the ServiceAccountDiscovery URL. If it is not our default (the customer set it purposefully), we would do nothing. However, it was decided to do the s3 bucket because ServiceAccountDiscovery won't work for private clusters.

enhancements/aws-pod-identity.md Outdated Show resolved Hide resolved
to use this method if we work out the ordering i.e. the webhook needs to be
running and the IAM roles need to exist before any of the pods for these
components are created and would require change and coordination with the
operators that manage those components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCO does run pretty early on the bootstrap node which may help here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But having in-cluster logic that switches on "am I a fresh install after $INSTALL_APPROACH_PIVOT?" seems pretty complicated.


This allows users to more cleanly express and manage pod identities in IAM when
running on AWS. The use of bound tokens to assume roles (vs user id/key) is
preferred as these tokens can expire and be rotated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we expire and rotate user IDs and keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but the expiration is not a direct property of the userID/key, where it is for the token i.e. the token user can discover the token's expiration.

eks.amazonaws.com/role-arn: "arn:aws:iam::111122223333:role/sjenning-abcde-openshift-image-registry"
```

Any pod started with this `serviceAccountName` will have the token injected via
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a Pod is defined that references the not-yet-created Service Account (CCO hasn't finished creating the role/ServiceAccount)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the webhook does not inject the projected volume. the annotated SA must exist before the pod is admitted for this to work. This is a known limitation of the upstream implementation i.e. that a pod can be admitted before the service account that runs it exists.

```
Ideally, new 4.5 clusters deployed on platform `aws` would roll out with this set day-1.

For day-2 deployments, changing this results in a redeployment of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the first release of CCO (and all future ones) with support for pod-identiy be responsible for performing this modification of the Authentication object (so that all 4.5+ clusters are in pod-identity-capable-mode)?

@gregsheremeta
Copy link
Contributor

Getting requests for STS to be in 4.8. What is the status of this RFE?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make the requested tweaks in follow-up.

/approve
/lgtm


# Support Native (IAM-based) Pod Identity on AWS

## Release Signoff Checklist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update which of these items got completed since we are retro merging?

Phase 2 will make the Role creation and `ServiceAccount` annotation
automatic, just as the User creation and `Secret` injection are today. This will
involve changing the API in the CredentialRequest CRD to allow references to
`ServiceAccounts` in addition to `Secerts` as an injection mechanism for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Secrets

integratedOAuthMetadata:
name: oauth-openshift
```
Ideally, new 4.5 clusters deployed on platform `aws` would roll out with this set day-1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that this is how we are rolling out now (i see issuer specified)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [derekwaynecarr,sjenning]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit c91fc51 into openshift:master Sep 21, 2020

1. An admin can create a IAM Role via `CredentialRequest` CR and inject (via
annotation) to a `ServiceAccount`, just as they can create an IAM user and
inject to a `Secret` currently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning we've talked about only having done phase 1 but this Phase 2 is actually complete isn't it?

There's really a phase 3 of transitioning a full OCP install and all components to use this functionality, is that right?

Access to the new functionality is gated by two things: the platform being
`aws` and, once phase 2 is complete, the user specifying a `spec.storage.serviceAccount` in the `CredentialsRequests` CR.

## Design Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like everything after this is template boilerplate. Maybe a followup PR where you drop the sections? Or explicitly say "We do not need a test plan for this enhancement" and so forth for each section?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.