Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for Secret/ConfigMap protection #2640

Closed
wants to merge 12 commits into from

Conversation

mkimuram
Copy link
Contributor

KEP for secret protection.

Enhancement issue: #2639

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 20, 2021
@k8s-ci-robot k8s-ci-robot requested review from msau42 and saad-ali April 20, 2021 19:35
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 20, 2021
@msau42
Copy link
Member

msau42 commented Apr 22, 2021

@kubernetes/sig-storage-proposals

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Apr 22, 2021
@enj
Copy link
Member

enj commented Apr 28, 2021

@kubernetes/sig-auth-proposals

/sig auth

@k8s-ci-robot k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Apr 28, 2021
@liggitt
Copy link
Member

liggitt commented Apr 28, 2021

I don't think this is particularly sig-auth-related, since it is not related to the security properties of secrets. Seems more like an apps/node/storage intersection, and would apply equally to configmaps.

@enj
Copy link
Member

enj commented May 10, 2021

I don't think this is particularly sig-auth-related, since it is not related to the security properties of secrets. Seems more like an apps/node/storage intersection, and would apply equally to configmaps.

/remove-sig auth

@k8s-ci-robot k8s-ci-robot removed the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label May 10, 2021
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/kep.yaml Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 12, 2021
@mkimuram
Copy link
Contributor Author

@xing-yang
Thank you for your review. Addressed your review comments. PTAL

keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/kep.yaml Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/kep.yaml Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/kep.yaml Outdated Show resolved Hide resolved
- The `kubernetes.io/secret-protection` finalizer will be deleted by newly introduced in-tree `secret-protection-controller` by checking whether the secret is in-use, on every change(Create/Update/Delete) events for secrets and related resources.
- Out-of-tree resources(`VolumeSnapshot`):
- The deletion is blocked by using newly introduced `snapshot.storage.kubernetes.io/secret-protection` finalizer,
- The `snapshot.storage.kubernetes.io/secret-protection` finalizer will be added on creation of the secret by using admission controller for `VolumeSnapshot`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems better to do with a controller that adds the finalizer as a precondition to using the secret, rather than putting a blocking admission plugin in the secret-writing path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkimuram have you addressed this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xing-yang

Sorry. I missed to respond to it. This is the same approach as the existing pv/pvc protection. This logic is introduced here. So, the discussion should be here or here, however I can't find the reason why admission plugin is used there.

Actually, it has a fallback logic to add finalizer here in pvc protection controller side. I'm not sure that this works well in all the cases. Therefore, I think that it's safer to use the same approach that worked for a long time.

@liggitt

Do you know the background or do you think that the only fallback logic is enough? (It seems that you are involved in the discussion.)
I would like to make this KEP as is and remove this later if it really proves to be unnecessary. Does it make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use Lien instead of Finalizer, and it is only added when controller finds that the secret/configmap is used. So, the concern should be resolved.

keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved
keps/sig-storage/2639-secret-protection/README.md Outdated Show resolved Hide resolved

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling the feature gate. After the feature gate is disabled, users need to manually delete the `kubernetes.io/secret-protection` finalizer and the `snapshot.storage.kubernetes.io/secret-protection` finalizer to make secret deleted properly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not recommend requiring users to manually clean up finalizers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas to work around? Adding command/script for the purpose?
(Actually, pv-protection/pvc-protection have been there without it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added my ideas in Design Details section. Any feedback is welcome. Thank you for raising this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt

I considered a bit deeper on in which case the users are forced to delete finalizer manually. With annotation workaround, users won't need to do it if the controller exist. As for in-tree, the controller has a feature to force delete finalizer if feature is disabled, and the controller exists at least one version as alpha. Therefore this happens only below two cases:

  1. Users enabled this feature via feature gate while alpha, and downgrade 1+ version where this controller doesn't exist,
  2. After this feature become beta, and downgrade 2+ version,

It seems that it's very rare to hit this condition.

As for out-of-tree, the same thing could be said, if controller has a nob to disable this feature. Therefore, I'm thinking about adding some kind of feature flag to enable/disable this feature to out-of-tree controller (It doesn't necessary be the same flag with in-tree).

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I've made change "adding some kind of feature flag to enable/disable this feature to out-of-tree controller" described in the above comment as --ephemeral-protection=enabled flag. If it won't be necessary, I will remove it. Please give feedback on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use only external controller. So, deploying the external controller enables this feature and undeploying it will disable this feature.

@mkimuram
Copy link
Contributor Author

@xing-yang @liggitt @johnbelamaric

Thank you very much for your reviews. Addressed review comments, again. PTAL

- Create/Update event:
- For all the `Secret`s referenced by the resource:
1. Update the dependency graph (If the resource is using a Secret, mark the Secret that the resource is using it)
2. If there are any updates to the Secret in the dependency graph, add the key for the Secret to add-lien-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically - it's not any update - only the update that changes number from 0 -> 1+ or vice versa matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

- The controller watches `Pod`, `PersistentVolume`, `VolumeSnapshotContent`, `Deployment`, `StatefulSet`, `DaemonSet`, and `CronJob`
- Create/Update event:
- For all the `Secret`s referenced by the resource:
1. Update the dependency graph (If the resource is using a Secret, mark the Secret that the resource is using it)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that update may potentially also remove reference to secret.

So in that sense, update doesn't differ much from the update event. They should be merged in this text.

Also - it's enough to count the number references, we don't need to store all of them.

FWIW - such logic is pretty much already implemented in kubelet for pods - you should be able to copy-paste it to a significant extent:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/util/manager/cache_based_manager.go#L221
[and generalize for non-pod objects]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the reference. I agree that similar logic will be needed and it is really helpful.
However, there will be a difference in the direction of the reference.

I understand that it gets all the referenced secrets from a pod, and create a list of references from the pod there. On the other hand, for this use case, we need to get all the referenced resources, like pod, pvc, snapshot, to the secrets and create a list of references to the secret.

All the references to the secret don't exist on a single resource, unlike all the references from the pod to secrets exist on the single pod resource. As a result, we will need to create a list of references to a secret by checking multiple resources (And for this purpose, I think that we will need to track the referencing objects, instead of just reference count).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

- Specified through [StatefulSetSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#statefulsetspec-v1-apps) -> [PodTemplateSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#podtemplatespec-v1-core)
- From DaemonSet:
- Specified through [DaemonSetSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#daemonsetspec-v1-apps) -> [PodTemplateSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#podtemplatespec-v1-core)
- From CronJob:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the list isn't complete. You should also have Job itself, probably also ReplicaSet itself.
And I'm wondering if there is anything else.

This doesn't visibily change the proposal (and probably doesn't even block Alpha), but it's something we should keep in mind.

I would like to see a graduation criteria for Beta to figure out the exact set of dependencies we take into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Job and ReplicaSet. Also added the reasons to exclude them and graduation criteria for Beta to review it later.


Note that
- Periodical checks for referencing resources are needed for handling a case that the annotation is added after referecing resources are created,
- Periodical checks for `Secret` or `Configmap` are needed for handling a case that update for a referencing resource is done in a way that remove the reference to `Secret` or `Configmap`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about polling regularly to call list API

We shouldn't periodically poll - that's not a scalable design.
We should be watching.

That said - watching (or even listing) all secrets is somewhat tricky from the security POV. So I would like to heard sig-auth input on that. @liggitt maybe?

1. If the `Secret` doesn't have `kubernetes.io/enable-secret-protection: "yes"` annotation, delete `kubernetes.io/secret-protection` lien from the `Secret`
2. Check API server if the `Secret` is actually not used by any of the resources, and if not used, delete `kubernetes.io/secret-protection` lien from the `Secret`

- ConfigMap protection controller:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just say that ConfigMap controller does pretty much exactly the same as Secret controller?
This whole text doesn't bring any additional value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

2. Check API server if the `ConfigMap` is actually not used by any of the resources, and if not used, delete `kubernetes.io/configmap-protection` lien from the `ConfigMap`

Note that
- Periodical checks for referencing resources are needed for handling a case that the annotation is added after referecing resources are created,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would be watching, then periodic checks aren't really needed, as the event will be triggered by change of secret/configmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed periodic checks. The controller-restart case that I was worried about should be covered by watching Secret logic and checking before deleting lien. Also, the updated-to-delete-reference case should be covered by keeping previous references in the dep graph.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name: secret_protection_controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a typo - I think that metric names will be different (it's the controller), right?

[We don't have to fill this in for Alpha though.]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the naming convention that pvc protection is using, but I agree to remove controller. So, removed the "_controller".

### Scalability
###### Will enabling / using this feature result in any new API calls?

- API call type: Update Secret/ConfigMap, List Pod/PV/VolumeSnapshot, and Get PVC/SC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be listing - we should be watching.

Copy link
Contributor Author

@mkimuram mkimuram Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal of the poll, removed listing from most parts. However, before deleting lien, I would like to call list just for sure (It is how pv/pvc protection are doing. Unlike finalizer, deleting lien won't immediately delete the resource. So, subsequent watch events to the related resources would expect to add lien again, though).

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty skeptical about the complexity and performance cost of this feature relative to its benefits.

The primary use case seems to be around preventing deletion of secrets used to administrate volumes backing PVCs, but I think those could be placed in alternative namespaces to avoid the issue (and limiting what is placed into the PVC namespace helps preserve the PVC abstraction and avoid exposing unnecessary details about the volume implementation).

If you want to proceed with this feature anyway, this seems like an ideal candidate to be implemented as an admission webhook external controller, not built-in.

Comment on lines +76 to +77
- [Mounted as data volumes](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod)
- [Exposed as environment variables](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern is protecting used-from-PVC/snapshot secret. This issue is really a serious problem that needs to be handled properly.

I still think placing the secrets required to administrate a volume for a PVC in the same namespace as the PVC is not a good approach. Would placing those secrets in a dedicated namespace separate from the PVC address the main issue this KEP is trying to resolve?

Updated to include Deployment, StatefulSet, Daemonset, and CronJob to the scope, which use podTemplateSpec (Intentionally excluded Job, because the existence of Job doesn't necessary mean that a Pod will be created from the template in the future).

I don't think we should spider out to all the pod-template-containing objects. I'm pretty skeptical about the reliability/performance of this feature already, relative to the benefit.

Comment on lines +132 to +133
A user creates Secrets/ConfigMaps and a pod using them. Then, the user mistakenly delete them while the pod is using them.
They are protected until the pod using them is deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would making the kubelet coast when a volume source is removed be more robust?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be one of the ways, but it would make it built-in and separated from PV/Snapshot logic (because snapshot requires it to be external). Let's keep it as an option.

Comment on lines +135 to +138
#### Story 2

A user creates a volume that uses a certain secret in the same namespace. Then, the user delete the namespace.
The secret is protected until the volume using the secret is deleted and the deletion of the volume succeeds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is story 2 trying to protect the cluster from actions taken by the user? if so, story 3 seems to allow bypassing/undoing the solution for story 2.

It seems like placing the resources required to administrate a volume backing a PVC in a namespace other than the PVC would resolve this issue and be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-volume secret use case and mis-operation use case still won't be resolved this way.


A user really would like to delete a secret despite that it is used by other resources.
The user force delete the secret while it is used by other resources, and the secret isn't protected and is actually deleted.
An example of such use cases is to update an immutable secret, by deleting and recreating.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is updating immutable secrets a goal? I thought a primary goal of those was to let consumers of those secrets skip watching for updates, which would mean they would miss a delete/recreate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "ways to dynamically enable/disable" the feature to the goal.

@mkimuram
Copy link
Contributor Author

mkimuram commented Sep 8, 2021

@saad-ali @xing-yang @msau42 @jsafrane

I'm pretty skeptical about the complexity and performance cost of this feature relative to its benefits.

The primary use case seems to be around preventing deletion of secrets used to administrate volumes backing PVCs, but I think those could be placed in alternative namespaces to avoid the issue (and limiting what is placed into the PVC namespace helps preserve the PVC abstraction and avoid exposing unnecessary details about the volume implementation).

I would like to get opinions from sig-storage to @liggitt 's comment above.
(And it is related to @jsafrane comment here)

I still think that this feature is helpful for users and even administrators to avoid mis-operation. Even if a Secret is deployed in a different namespace, they might mistakenly delete it. However, we may be able to ask users to manually add lien that will be introduced in KEP-2839, if they would like to avoid such a situation. So, I agree that it will be one of the choices.

If you want to proceed with this feature anyway, this seems like an ideal candidate to be implemented as an admission webhook, not built-in.

Note that we plan to implement this feature as external controllers, therefore administrators can choose to enable/disable easily.

@xing-yang
Copy link
Contributor

Note that we plan to implement this feature as external controllers, therefore administrators can choose to enable/disable easily.

Why not admission webhook? That can be external and optional as well.

@mkimuram
Copy link
Contributor Author

mkimuram commented Sep 8, 2021

Note that we plan to implement this feature as external controllers, therefore administrators can choose to enable/disable easily.
Why not admission webhook? That can be external and optional as well.

Admission webhook will be also a choice. However, lien is a generic implementation of the blocking deletion part of the logic, and the controllers are to decide which resources should be protected. I expect that lien simplifies the logic. So, if the logic become too complex by using lien, I agree that we should choose to use admission webhook.
(I just meant to say that it is "not built-in" by above comment).

BTW, I would like to get feedback on whether we should implement Secret/ConfigMap protection in the first place or not
(It should have been discussed much earlier).


This KEP proposes a feature to protect Secrets/ConfigMaps while it is in use. Currently, users can delete a secret that is being used by other resources, like Pods, PVs, and VolumeSnapshots. This may have negative impact on the resouces using the secret and it may result in data loss.

For example, one of the worst case scenarios is deletion of a Node Stage Secret for CSI volumes while it is still in-use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSI Spec specifically does not allow accept secrets in NodeUnstageVolumeRequest, NodeUnpublishVolumeRequest for this reason. It is expected that CO is able to clean up volumes on a node regardless of if secret exists or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. It seems true for NodeUnstageVolumeRequest, so the example is not good.
However, ControllerUnpublishVolumeRequest and DeleteVolumeRequest seem to accept secret. So, it is still valid for csi.storage.k8s.io/controller-publish-secret-name and
csi.storage.k8s.io/provisioner-secret-name. I will update the KEP.

And, I agree that impact of lack of this feature becomes a bit smaller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to describe the issue with a provisioner secret and a controller-publish secret.

@mkimuram
Copy link
Contributor Author

mkimuram commented Sep 8, 2021

Note that I'm proposing this KEP with this actual issue in mind.
This kind of issues will be easily happen when per volume secrets or per namespace secrets are used, as I commented here. In such cases, secrets will be managed by namespace admins or namespace users, not cluster admins, so I think that we should provide a foolpoof solution (and let admins choose whether to enable it).

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: InUseProtection
Copy link
Contributor

@xing-yang xing-yang Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify that this feature gate is part of the "liens" KEP which this KEP depends on.
For this KEP, whether the feature is enabled also depends on whether secret-protection-controller and configmap_protection_controller are deployed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t @xing-yang Are there good ways to describe above in kep.yaml? I've grep-ed other key.yaml, but failed to find a good example.

@mkimuram
Copy link
Contributor Author

mkimuram commented Sep 9, 2021

As a result of discussion in KEP-2839, it needs some more consideration around permissions issues to implement as alpha. Therefore, the lien feature will skip 1.23. As a result, this KEP will also need to skip 1.23 (As well as the conclusion in the Sig-storage meeting today to revisit after lien feature is merged).

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkimuram
To complete the pull request process, please ask for approval from wojtek-t, xing-yang after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor Author

@mkimuram mkimuram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KEP. PTAL when you have time (For this won't be merged for k8s 1.23 and there will be a lot of another KEPs to review, today).


This KEP proposes a feature to protect Secrets/ConfigMaps while it is in use. Currently, users can delete a secret that is being used by other resources, like Pods, PVs, and VolumeSnapshots. This may have negative impact on the resouces using the secret and it may result in data loss.

For example, one of the worst case scenarios is deletion of a Node Stage Secret for CSI volumes while it is still in-use.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to describe the issue with a provisioner secret and a controller-publish secret.

Comment on lines +76 to +77
- [Mounted as data volumes](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod)
- [Exposed as environment variables](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty skeptical about the reliability/performance of this feature already, relative to the benefit.

Simplified the logic a bit (May still be complex). I know that you reviewed/implemented much of the gc, so you are worrying about its difficult parts also happen to these controllers. So, feel free to point out which part needs to be improved.

As discussed in the sig-storage meeting, we will wait for other releases to get feedback from users whether this feature is needed. So, we will have much time to improve.

- Specified through [StatefulSetSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#statefulsetspec-v1-apps) -> [PodTemplateSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#podtemplatespec-v1-core)
- From DaemonSet:
- Specified through [DaemonSetSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#daemonsetspec-v1-apps) -> [PodTemplateSpec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#podtemplatespec-v1-core)
- From CronJob:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Job and ReplicaSet. Also added the reasons to exclude them and graduation criteria for Beta to review it later.

Comment on lines +132 to +133
A user creates Secrets/ConfigMaps and a pod using them. Then, the user mistakenly delete them while the pod is using them.
They are protected until the pod using them is deleted.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be one of the ways, but it would make it built-in and separated from PV/Snapshot logic (because snapshot requires it to be external). Let's keep it as an option.

Comment on lines +135 to +138
#### Story 2

A user creates a volume that uses a certain secret in the same namespace. Then, the user delete the namespace.
The secret is protected until the volume using the secret is deleted and the deletion of the volume succeeds.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-volume secret use case and mis-operation use case still won't be resolved this way.

- Create/Update event:
- For all the `Secret`s referenced by the resource:
1. Update the dependency graph (If the resource is using a Secret, mark the Secret that the resource is using it)
2. If there are any updates to the Secret in the dependency graph, add the key for the Secret to add-lien-queue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

1. If the `Secret` doesn't have `kubernetes.io/enable-secret-protection: "yes"` annotation, delete `kubernetes.io/secret-protection` lien from the `Secret`
2. Check API server if the `Secret` is actually not used by any of the resources, and if not used, delete `kubernetes.io/secret-protection` lien from the `Secret`

- ConfigMap protection controller:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

2. Check API server if the `ConfigMap` is actually not used by any of the resources, and if not used, delete `kubernetes.io/configmap-protection` lien from the `ConfigMap`

Note that
- Periodical checks for referencing resources are needed for handling a case that the annotation is added after referecing resources are created,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed periodic checks. The controller-restart case that I was worried about should be covered by watching Secret logic and checking before deleting lien. Also, the updated-to-delete-reference case should be covered by keeping previous references in the dep graph.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name: secret_protection_controller
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the naming convention that pvc protection is using, but I agree to remove controller. So, removed the "_controller".

### Scalability
###### Will enabling / using this feature result in any new API calls?

- API call type: Update Secret/ConfigMap, List Pod/PV/VolumeSnapshot, and Get PVC/SC
Copy link
Contributor Author

@mkimuram mkimuram Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal of the poll, removed listing from most parts. However, before deleting lien, I would like to call list just for sure (It is how pv/pvc protection are doing. Unlike finalizer, deleting lien won't immediately delete the resource. So, subsequent watch events to the related resources would expect to add lien again, though).

@mkimuram
Copy link
Contributor Author

/milestone clear

@k8s-ci-robot
Copy link
Contributor

@mkimuram: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone clear

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xing-yang xing-yang removed this from the v1.23 milestone Sep 14, 2021
@MadhavJivrajani
Copy link
Contributor

/remove-kind design
/kind feature
kind/design is migrated to kind/feature, see kubernetes/community#6144 for more details

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/design Categorizes issue or PR as related to design. labels Oct 11, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingxu97
Copy link
Contributor

is this being replace by KEP #2840?

@jingxu97
Copy link
Contributor

@mkimuram, is this KEP being replaced by #2840?

@mkimuram
Copy link
Contributor Author

@jingxu97 The controller proposed in this KEP will automatically add Lien(defined in #2840 KEEP) to the referenced Secret/ConfigMap. So, this KEP depends on #2840, and is waiting it to be implemented.

Even without this KEP, users may be able to manually add Liens to their Secret/ConfigMap. If we agree to choose only providing such a manual approach, this KEP can potentially be replaced by #2840. I think that the current agreement in the sig-storage is to wait until Lien to be implemented and wait for feedback from users on whether a controller in this KEP is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.