A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

tiraboschi · 2022-10-26T14:37:59Z

Our operator implements a controller watching the APIServer CR on Openshift to follow tlsSecurityProfile changes there and propagate them to our operands.
See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html for more details.

During our tests, we got more cases where a change of tlsSecurityProfile configuration of APIServer CR caused some temporary (?) connection trouble on the informer used by controller-runtime.

In the logs of our operator we see:

W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END

the change is completely loss and our operator is not going to retry a new reconciliation if not after the resync period (but the default is 10 hours).

The text was updated successfully, but these errors were encountered:

If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

alvaroaleman · 2022-10-26T21:48:51Z

That looks to me like your apiserver is being restarted, which results in a temporary disconnect. I am not sure I understand what controller-runtime can or should do about that?

tiraboschi · 2022-10-28T07:49:28Z

Yes, I'm pretty sure that the APIServer got restarted due a change in its TLS configuration.
On the other side we are watching the APIServer CR on Openshift and the change event there got completely lost so we are probably going to get it resynched only after N hours (if I'm not wrong the resync default is 10 hours).
I'm absolutely fine with the idea of a level-based controller, but in this case due to the error on the informer we completely missed the level change.
I don't really know the internals, but I'd expect some better error handling once the APIserver will be back online.

alvaroaleman · 2022-10-30T17:37:58Z

So you are saying the restart of the apiserver causes you to miss events? Can you provide a reproducer for this please?

Also, we just re-use the upstream informers in our cache and that seems like a bug in the informer, would you mind reporting this on kubernetes/kubernetes?

If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Co-authored-by: Simone Tiraboschi <stirabos@redhat.com>

alvaroaleman · 2022-12-11T17:23:49Z

This isn't reproducible without an Openshift cluster it seems and it does indeed happen, it is a bug with the underlying informer that we just import from client-go. Please create an issue in upstream kubernetes/kubernetes if you continue seeings this.

alvaroaleman closed this as completed Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

tiraboschi commented Oct 26, 2022

alvaroaleman commented Oct 26, 2022

tiraboschi commented Oct 28, 2022

alvaroaleman commented Oct 30, 2022

alvaroaleman commented Dec 11, 2022

A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

Comments

tiraboschi commented Oct 26, 2022

alvaroaleman commented Oct 26, 2022

tiraboschi commented Oct 28, 2022

alvaroaleman commented Oct 30, 2022

alvaroaleman commented Dec 11, 2022