Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet does not reload recreated immutable configmap, unclear documentation #42359

Open
juliantaylor opened this issue Aug 2, 2023 · 26 comments
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@juliantaylor
Copy link

juliantaylor commented Aug 2, 2023

What happened?

With immutable configmaps and secrets are not watched by the kubelet. This also means it does not refresh the loaded content when they are deleted and recreated.

The documentation of immutable configmaps though is misleading on how this needs to be handled by pods:
https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable

Once a ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data or the binaryData field. You can only delete and recreate the ConfigMap. Because existing Pods maintain a mount point to the deleted ConfigMap, it is recommended to recreate these pods.

This to me implies deleting and recreating the configmap and restarting the pods using it is sufficient, but it is not as the kublet will still have the old version loaded and won't update it until it is restarted.

This behavior is more or less expected but the documentation should be more explicit about this pitfall.

What did you expect to happen?

The documentation should state that immutable objects mounted into pods need to be deleted and recreated with a different name in order for the (recreated) pods to see the new version reliably.

How can we reproduce it (as minimally and precisely as possible)?

Start a deployment with an immutable configmap mounted. Change it, delete and recreate it with the same name and restart the deployment.
Pods on the kubelet will still have the old version until the kubelet is restarted and then the pods restarted.
see #42359 (comment) for an example deployment

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-15T00:36:28Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

bare metall

OS version

No response

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@juliantaylor juliantaylor added the kind/bug Categorizes issue or PR as related to a bug. label Aug 2, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2023
@vaibhav2107
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 2, 2023
@tzneal
Copy link
Contributor

tzneal commented Aug 2, 2023

/kind documentation
/remove-kind bug
/cc @sftim
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2023
@sftim
Copy link
Contributor

sftim commented Aug 2, 2023

/uncc

/sig docs

@k8s-ci-robot k8s-ci-robot added the sig/docs Categorizes an issue or PR as relevant to SIG Docs. label Aug 2, 2023
@sftim
Copy link
Contributor

sftim commented Aug 2, 2023

/transfer website

@k8s-ci-robot k8s-ci-robot transferred this issue from kubernetes/kubernetes Aug 2, 2023
@sftim
Copy link
Contributor

sftim commented Aug 2, 2023

I believe from the evidence here that we may be able build a test case where, starting from a namespace that contains:

  • no ConfigMaps
  • no Pods

can create a ConfigMap and a Pod that references the ConfigMap, and (because of what was previously in that namespace) see the old ConfigMap loaded into that Pod. Even if the new ConfigMap is not marked as immutable.

The period where that's possible might be quite short though.


This sounds more plausible; starting from a namespace that contains:

  • no ConfigMaps
  • some running Pods that reference a previous immutable ConfigMaps

create a new ConfigMap (same name) and a new Pod. You may observe that the new Pod appears to have loaded the wrong data, from the old ConfigMap.

We might need to define that you should delete every object that could have had a reference to that old ConfigMap - potentially including cluster-scoped objects such as IngressClasses and ValidatingAdmissionPolicies.

Advising people to create the replacement ConfigMap with a different name is not enough; we cannot assume that a different user is aware of the potential problems around use of the old name.

@sftim
Copy link
Contributor

sftim commented Aug 2, 2023

We might be able to detect in-use immutable ConfigMaps (within the control plane), and do something around finalizing them. That mechanism could eventually enable sending a warning when an immutable ConfigMap is deleted whilst still in use.

@sftim
Copy link
Contributor

sftim commented Aug 2, 2023

I recommend that SIG Architecture chip in here.

/sig architecture

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Aug 2, 2023
@pegasas
Copy link
Contributor

pegasas commented Aug 2, 2023

/assign @pegasas

@pegasas
Copy link
Contributor

pegasas commented Aug 2, 2023

Hi, @sftim ,
I do the test case that you mentioned in Azure Kubernetes 1.25.11, refer to this doc: https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: very
immutable: true

and

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  labels:
    app: ubuntu
spec:
  containers:
  - image: ubuntu
    command:
      - "sleep"
      - "3650d"
    env:
      - name: SPECIAL_LEVEL_KEY
        valueFrom:
          configMapKeyRef:
            name: special-config
            key: special.how
    imagePullPolicy: IfNotPresent
    name: ubuntu
  restartPolicy: Always

and I create these 2 in k8s cluster, and echo the key:

kubectl exec -it ubuntu -- /bin/sh
# echo $SPECIAL_LEVEL_KEY
very

and after I delete the cm,

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: test
immutable: true

it keeps the same value:

kubectl exec -it ubuntu -- /bin/sh
# echo $SPECIAL_LEVEL_KEY
very

@tengqm
Copy link
Contributor

tengqm commented Aug 3, 2023

The motivation for Immutable ConfigMap/Secret is to avoid watching them. Since they are immutable, you get what is in it, and treat it as constant. You don't care about the future changes to them, they are immutable after all. If you want to watch them, especially for updates, you won't use immutable ConfigMap/Secret.

Protecting ConfigMap/Secret in use is another topic. See KEP 2840.

@sftim
Copy link
Contributor

sftim commented Aug 3, 2023

The test case explained in #42359 (comment) covers already documented behavior; if that's the test case, I don't think we need to make a docs change.

I thought you were concerned about ConfigMap removal and recreation with a new uid @juliantaylor, where a new Pod might see the old ConfigMap.

@juliantaylor
Copy link
Author

juliantaylor commented Aug 3, 2023

Protecting a configmap from being deleted while it is in use is not sufficient, a configmap or secret name cannot be reused after it has been deleted until all kubelets have been restarted or you risk a pod seeing old content.

This likely also happens long after all pods using the old version are long gone (create deployment and immutable cm, delete it, wait some time redeploy with different configmap content), though I have not explicitly verified full delete and longer timescales, only rolling updates with the same cm name.

The practical solution to this is to have immutable configmaps and secrets name contain the checksum of its content. That way it is almost impossible for wrong data to end in a pod even when the kubelet runs for a long time.
This practice should be mentioned in the documentation.

If you can fix the underlying issue all the better, but as stated the point of immutable is to reduce api requests and watches, so watching for deletes is not an sensible option.

Possibly you could have the kubelet refetch the object when it is required but only if it already exists as immutable in its caches.

@sftim
Copy link
Contributor

sftim commented Aug 3, 2023

The practical solution to this is to have immutable configmaps and secrets name contain the checksum of its content.

Let's make sure we know what (replicable) problem we're helping people to avoid, before we triage this as accepted.

@juliantaylor
Copy link
Author

juliantaylor commented Aug 3, 2023

ok here the reproducer I stated in the beginning written out:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  name: cm-test
spec:
  replicas: 7
  selector:
    matchLabels:
      app: cm-test
  template:
    metadata:
      labels:
        app: cm-test
    spec:
      terminationGracePeriodSeconds: 1
      volumes:
        - name: cm
          configMap:
            name: testcm
      containers:
      - image: debian:bookworm
        command:
          - bash
          - "-c"
          - 'while true; do echo -n "$NODENAME "; md5sum /cm/script.py || true; sleep 20; done'
        imagePullPolicy: Always
        name: test
        securityContext:
          runAsUser: 1000
        volumeMounts:
          - name: cm
            mountPath: /cm
        env:
          - name: NODENAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: testcm
data:
  script.py: "content7"
immutable: true
kubectl apply -f deployment.yaml
logs:
cm-test-69864cf46f-smjf5 kworker-be-sandbox-iz2-004 f4a40d590cbb53290e7ba914e24a346c  /cm/script.py
kubectl delete cm testcm
sed -i -e "s/content7/content8/" deployment.yaml
kubectl apply -f deployment.yaml
kubectl rollout restart deployment cm-test
logs (same checksum)
cm-test-8b5456fc-gc2l4 kworker-be-sandbox-iz2-004 f4a40d590cbb53290e7ba914e24a346c  /cm/script.py

restart the kubelet or launch on a node teh pod was not on yet and you get the new content

@juliantaylor
Copy link
Author

full delete of the deployment does seem to refresh the configmap, so kubelet may have something to cleanup unused configmaps, in that case protecting deletion in use would work

@juliantaylor
Copy link
Author

Any updates? Unless it can be trivially fixed and released fast, I just want the documentation to be updated to make the existing semantics of immutable objects clearer.

@sftim
Copy link
Contributor

sftim commented Aug 30, 2023

@juliantaylor the Kubernetes project is open source and the documentation is largely maintained by volunteers.

Depending on how much you want a fix, you could:

However, we first want to make sure we understand what needs fixing, or made more clear. Can you help with that part?
I saw that you'd added more detail. Thanks.

@pegasas
Copy link
Contributor

pegasas commented Aug 30, 2023

Hi, @juliantaylor ,

The practical solution to this is to have immutable configmaps and secrets name contain the checksum of its content. That way it is almost impossible for wrong data to end in a pod even when the kubelet runs for a long time.
This practice should be mentioned in the documentation.

You aim to add this into docs as a supplement, right?
I think the community thinks the immutable secret feature is designed as immutable, you treat it as constant and don't care about the future changes to them.

I think your scenario may obey our design rule.

What do you think?

@juliantaylor
Copy link
Author

juliantaylor commented Aug 30, 2023

I would suggest to update the following from:

Once a ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data or the binaryData field. You can only delete and recreate the ConfigMap. Because existing Pods maintain a mount point to the deleted ConfigMap, it is recommended to recreate these pods.

to

Once a ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data or the binaryData field. Also after deleting and recreating the object with new content Kubernetes components may retain the original content in their caches and continue to provide the original content to new pods.
It is recommended to ensure immutable object names are not reused by for example adding the contents checksum into the object name to ensure pods receive the intended content.

I can open a doc PR if you agree this represents intended behavior or won't be fixed soon and thus worth documenting for the time being,

@pegasas
Copy link
Contributor

pegasas commented Aug 30, 2023

Yeah, indeed it fits this behavior you mentioned.
However, I think it may mislead users to delete and recreate immutable secrets.
If you want to modify the secret, you should use the secret directly rather than immutable secrets.
We do not encourage this behavior because it is not by design.

The motivation for Immutable ConfigMap/Secret is to avoid watching them. Since they are immutable, you get what is in it, and treat it as constant. You don't care about the future changes to them, they are immutable after all. If you want to watch them, especially for updates, you won't use immutable ConfigMap/Secret.

Protecting ConfigMap/Secret in use is another topic. See KEP 2840.

The test case explained in #42359 (comment) covers already documented behavior; if that's the test case, I don't think we need to make a docs change.

I thought you were concerned about ConfigMap removal and recreation with a new uid @juliantaylor, where a new Pod might see the old ConfigMap.

I believe that this is what @tengqm and @sftim trying to clarify in the comments above.

@juliantaylor
Copy link
Author

juliantaylor commented Aug 30, 2023

I am only concerned about the existing documentation implying deleting and recreating a immutable configmap/secret and restarting pods is safe to do which it is not due to the potential serving of stale content.

I'm also fine with updating it to say do not ever delete and recreate an immutable secret or configmap and remove the whole section about doing it and restarting pods. But in this case it should also be mentioned why as it is may not immediately obvious to the reader that this is unsafe and yet deletion is allowed by the API.

As for why one would use immutable configmaps, in large clusters with many larges nodes and pods the many many configmap watches do put some load on the API. The majority of configmaps never to barely ever changes e.g. root CA certificates and many deployment use versioned configurationmaps anyway (each change creates a new configmap).
So we mark as many configmaps as we can immutable even if they do occasionally change as its just free load reduction.
While updating some stuff which did not use versioned configmaps we discovered this behaviour. Why it behaves like this is clear after the fact, but needing to discover it in the first place should not be needed after reading the immutable documentation which I would like to see fixed.

@juliantaylor
Copy link
Author

May this is better at discouraging changing immutable objects:

Once a ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data or the binaryData field. Deletion of an immutable ConfigMap should be avoided as when one recreates the immutable ConfigMap with the same name at a latter time the old deleted content may still be served from caches.

Possibly one could also clarify how long the deleted configmap/secret can be cached which I assume is as long as something in the cluster references it (e.g. a still running pod with the deleted cm mounted).

@pegasas
Copy link
Contributor

pegasas commented Aug 30, 2023

Possibly one could also clarify how long the deleted configmap/secret can be cached which I assume is as long as something in the cluster references it (e.g. a still running pod with the deleted cm mounted).

I think this is a good point for "clarify how long the deleted configmap/secret can be cached".
Then it will become not binding to this character.
we could finds a suitable place to add this into document if there is not any description right now.

@sftim
Copy link
Contributor

sftim commented Aug 30, 2023

Let's clarify that you should prefer to use a new name for your update to an immutable ConfigMap or Secret. If you want to use the same name, you have to first make sure that every existing reference to the old object (not just Pods, any reference) has been cleaned up.

@pisto
Copy link

pisto commented May 27, 2024

As of today, the official docs are not clear on this subject. In both the ConfigMap and Secrets concepts pages, it states:

Once a Secret or ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data field. You can only delete and recreate the Secret. Existing Pods maintain a mount point to the deleted Secret - it is recommended to recreate these pods.

This does not hint the user on the fact that recreated pods may still get old stale value, at least if used as environment variables. The message as it is actually suggests (IMO clearly states) that when an immutable secret actually changes, it is sufficient to recreate it and the consuming pods, which is not correct.

It took us quite some time to figure this out, as we have been seeing inconsistent behavior when scheduling a pod across nodes (some nodes would see the new value, and others the old value). For an end user, this does not make any sense. From an administrator perspective, it took some time to isolate the issue and the particular level in the stack (kubelet? container runtime? race issues?) where the problem was originating. This also breaks the assumption that the state of the cluster is seen consistently across command line users, kubelets and other observers such as operator (we are using the ExternalSecrets operator, so that's one extra layer that we had to debug and exclude from the loop to understand the problem).

IMO it should be worded about this way:

Once a Secret or ConfigMap is marked as immutable, it is not possible to revert this change nor to mutate the contents of the data field. You can only delete and recreate the Secret. Even so, the kubelet does not watch for changes on immutable Secrets and ConfigMaps, so newly created pods may still get stale data. In case one wants to ensure that the new data values are used, the suggested course of action in this case is to recreate the Secret or ConfigMap under a new name.

However, I would strongly prefer a solution that fixes the root issue, and have the kubelet check the uid or something of the requested secret/configmap, and purge its internal cache based on that. Maybe this API can help in requesting only the metadata, if receiving the full object is a performance concern.

@sftim
Copy link
Contributor

sftim commented May 28, 2024

Repeating #42359 (comment) for clarity
Let's clarify that you should prefer to use a new name for your update to an immutable ConfigMap or Secret. If you want to use the same name, you have to first make sure that every existing reference to the old object (not just Pods, any reference) has been cleaned up.


It may be worse than this and that you also need to reboot nodes that have previously run Pods using the old ConfigMap, or go fix Kubernetes bugs, or whatever. However, we should nonetheless be recommending a straightforward approach: stop using the name of a ConfigMap or Secret that was once marked as immutable; make a new one instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
No open projects
Development

No branches or pull requests

8 participants