-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032
Comments
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
That looks to me like your apiserver is being restarted, which results in a temporary disconnect. I am not sure I understand what controller-runtime can or should do about that? |
Yes, I'm pretty sure that the APIServer got restarted due a change in its TLS configuration. |
So you are saying the restart of the apiserver causes you to miss events? Can you provide a reproducer for this please? Also, we just re-use the upstream informers in our cache and that seems like a bug in the informer, would you mind reporting this on kubernetes/kubernetes? |
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Co-authored-by: Simone Tiraboschi <stirabos@redhat.com>
This isn't reproducible without an Openshift cluster it seems and it does indeed happen, it is a bug with the underlying informer that we just import from client-go. Please create an issue in upstream kubernetes/kubernetes if you continue seeings this. |
Our operator implements a controller watching the APIServer CR on Openshift to follow tlsSecurityProfile changes there and propagate them to our operands.
See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html for more details.
During our tests, we got more cases where a change of tlsSecurityProfile configuration of APIServer CR caused some temporary (?) connection trouble on the informer used by controller-runtime.
In the logs of our operator we see:
the change is completely loss and our operator is not going to retry a new reconciliation if not after the resync period (but the default is 10 hours).
The text was updated successfully, but these errors were encountered: