diff --git a/keps/sig-storage/2639-secret-protection/README.md b/keps/sig-storage/2639-secret-protection/README.md new file mode 100644 index 000000000000..eb161294e25a --- /dev/null +++ b/keps/sig-storage/2639-secret-protection/README.md @@ -0,0 +1,428 @@ +# KEP-2639: Secret Protection + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha -> Beta Graduation](#alpha---beta-graduation) + - [Beta -> GA Graduation](#beta---ga-graduation) + - [Removing a Deprecated Flag](#removing-a-deprecated-flag) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [ ] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes a feature to protect secrets while it is in use. Currently, user can delete a Secret that is being used by other resources, like Pods and PVs. This may have negative impact on the resouces using the Secret and it may result in data loss. + +Similar features for protecting PV and PVC already exist as [pv-protection](https://github.com/kubernetes/enhancements/issues/499) and [pvc-protection](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/postpone-pvc-deletion-if-used-in-a-pod.md). + + +## Motivation + +This feature aims to protect secrets from deleting while they are in-use. +Secrets can be used by below ways: +- From Pod: + - [Mounted as data volumes](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod) + - [Exposed as environment variables](https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables) + - [Generic ephemeral volumes +](https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes) (can be handled as CSI PV below) +- From PV: + - [CSI](https://kubernetes-csi.github.io/docs/secrets-and-credentials-storage-class.html): + - provisioner secret + - controller publish secret + - node stage secret + - node publish secret + - controller expand secret + - non-CSI: + - dependent on each storage driver and will be deprecated soon (Out of scope) +- [From Snapshot](https://kubernetes-csi.github.io/docs/secrets-and-credentials-volume-snapshot-class.html): + - snapshotter secret + +### Goals + +- Protect secrets from deleting while they are in use. + +### Non-Goals + +- Protect important secrets that aren't in use from deleting +- Protect other resources than secret from deleting. + +## Proposal + +A new controller to protect secret is introduced. + +### User Stories + +#### Story 1 + +A user creates a secret and a pod using the secret. Then, the user mistakenly delete the secret while the pod is using it. +The secret is protected until the pod using the secret is deleted. + +#### Story 2 + +A user creates a volume that uses a certain secret in the same namespace. Then, the user delete the namespace. +The secret is protected until the volume using the secret is deleted and the deletion of the volume succeeds. + +#### Story 3 + +A user really would like to delete a secret inspite that it is used by other resources. +The user force delete the secret while it is used by other resources, and the secret isn't protected and is actually deleted. + +### Notes/Constraints/Caveats (Optional) + +- Compatibility: + - There might be many existing scripts that don't care the order of deletion. Therefore, such scripts might stuck on secret deletion, if the deletion of the resources using the secrets are done later. +- Usability: + - Use of the secret in other resource will not be obvious to users. Therefore, users might not easily understand why the secret is not deleted. + - Users might need to force delete the secret on deletion or would like to avoid protection for certain secrets that already exist or that are newly created. + +### Risks and Mitigations + + + +## Design Details + +This feature would be able to implement by the same way to pv-protection/pvc-protection. +- The deletion is blocked by using newly introduced `kubernetes.io/secret-protection` finalizer, +- The finalizer will be added on creation of the secret by using admission controller, +- The finalizer will be deleted by newly introduced `secret-protection-controller` by checking whether the secret is in-use, on every change(Create/Update/Delete) events for secrets and related resrouces. + +### Test Plan + +- For Alpha, unit tests and e2e tests verifying that a secret used by other resources is protected by this feature are added. +- For Beta, scalability tests are added to exercise this feature. +- For GA, the introduced e2e tests will be promoted to conformance. + +### Graduation Criteria +#### Alpha -> Beta Graduation + +- Gather feedback from developers and surveys +- Tests are in Testgrid and linked in KEP + +#### Beta -> GA Graduation + +- Allowing time for feedback + +#### Removing a Deprecated Flag + +- Announce deprecation and support policy of the existing flag +- Two versions passed since introducing the functionality that deprecates the flag (to address version skew) +- Address feedback on usage/changed behavior, provided on GitHub issues +- Deprecate the flag + +### Upgrade / Downgrade Strategy + + +- Upgrade: Secret that doesn't have `kubernetes.io/secret-protection` finalizer will be added the finalizer on Update/Delete events, therefore no additional user operation will be needed. +- Downgrade: + - Feature disabled case: If the secret-protection-controller exists and the feature is disabled, `kubernetes.io/secret-protection` finalizer will always be deleted, therefore no additional user operation will be needed, + - Downgraded to no secret-protection-controller case: If no secret-protection-controller exists but `kubernetes.io/secret-protection` finalizer is added to the secrets, no one remove the finalizer. Therefore, user needs to remove the `kubernetes.io/secret-protection` finalizer from all the secrets manually. + +### Version Skew Strategy + +- As for components, this feature involves only admission controller and secret-protection-controller, so version skew won't happen unless these components are available with different versions, +- As for resources, CSI Volume and CSI Snapshot are involved, changes in the API/CRD of these resources especially for their secret fields might cause issue. Howerver, this should be compatibility issue for these API/CRDs. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: SecretInUseProtection + - Components depending on the feature gate: + - secret-protection-controller + - storageobjectinuseprotection admission plugin + +###### Does enabling the feature change any default behavior? + +Secrets aren't deleted until the resources using them aren't deleted. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, by disabling the feature gates. + +###### What happens if we reenable the feature if it was previously rolled back? + +Secrets aren't deleted until the resources using them aren't deleted, again. + +###### Are there any tests for feature enablement/disablement? + +Yes, unit tests for the secret-protection-controller cover scenarios where the feature is disabled or enabled. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + +There will be secrets which have `kubernetes.io/secret-protection` finalizer. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [x] Metrics + - Metric name: secret_protection_controller + - [Optional] Aggregation method: prometheus + - Components exposing the metric: secret-protection-controller + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + + + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability +###### Will enabling / using this feature result in any new API calls? + +- API call type: Update Secret, List Pod/PV/Snapshot, Get PVC/SC +- estimated throughput: TBD +- originating component: secret-protection-controller +- API calls are triggered by changes of secrets, Pod, PV, Snapshot + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +- API type(s): Secret +- Estimated increase in size: the size of `kubernetes.io/secret-protection` finalizer per secret +- Estimated amount of new objects: N/A + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +API: + - Size: Increase in size of the Secret is very limited. + - Number of calls: Rate limit is set for the number of API calls. +Disk/IO: + - No disk/IO are done through other than API calls and log outputs +CPU/RAM: + - It works as common controller pattern. Therefore, number of resouces to process and the logic on how to detect in-use secret should only be needed to be checked. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + +- Manually adding/deleting finalizer to/from secrets that shouldn't be deleted in certain life cycle +- Introduce a new kind of reference, like usedReference, and leave addition/deletion of it to users + (Similar to ownerReference, but just block deletion and won't try to delete referenced resources through GC, like deleting child on parent's deletion). + +Above ways will force users to do some additional works to protect secrets. Also, they are inconsistent with pv-protection/pvc-protection concepts. diff --git a/keps/sig-storage/2639-secret-protection/kep.yaml b/keps/sig-storage/2639-secret-protection/kep.yaml new file mode 100644 index 000000000000..268286e1b7a6 --- /dev/null +++ b/keps/sig-storage/2639-secret-protection/kep.yaml @@ -0,0 +1,47 @@ +title: secret protection +kep-number: 2639 +authors: + - "@mkimuram" +owning-sig: sig-storage +participating-sigs: + - sig-storage +status: implementable +creation-date: 2021-04-20 +reviewers: + - TBD +approvers: + - TBD +prr-approvers: + - TBD +#see-also: +# - "/keps/sig-aaa/1234-we-heard-you-like-keps" +# - "/keps/sig-bbb/2345-everyone-gets-a-kep" +#replaces: +# - "/keps/sig-ccc/3456-replaced-kep" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.22" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.22" + beta: "v1.23" + stable: "v1.24" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: SecretInUseProtection + components: + - secret-protection-controller + - storageobjectinuseprotection admission plugin +disable-supported: true + +# The following PRR answers are required at beta release +#metrics: +# - my_feature_metric