Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A configuration change in the TLS security profile of the APIServer can cause a missed reconciliation loop #2032

Closed
tiraboschi opened this issue Oct 26, 2022 · 4 comments

Comments

@tiraboschi
Copy link

Our operator implements a controller watching the APIServer CR on Openshift to follow tlsSecurityProfile changes there and propagate them to our operands.
See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html for more details.

During our tests, we got more cases where a change of tlsSecurityProfile configuration of APIServer CR caused some temporary (?) connection trouble on the informer used by controller-runtime.

In the logs of our operator we see:

W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END

the change is completely loss and our operator is not going to retry a new reconciliation if not after the resync period (but the default is 10 hours).

tiraboschi added a commit to tiraboschi/hyperconverged-cluster-operator that referenced this issue Oct 26, 2022
If not explicitly set on its CR,
HCO webhook is consuming TLS configuration
from Openshift cluster-wide APIServer CR.
For performance reason it's not reading it on each request
to the HCO CR but it's consuming a cached representation.
The cache was only refreshed by a controller
based on an informer.
We got reports that due to the nature
of changes in the APIServer CR, the connection
to the APIserver itself could become stuck:
```
W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END
```

On controller runtime the default SyncPeriod when all the
watched resources are refreshed is 10 hourse (
see kubernetes-sigs/controller-runtime#521
for its reasons) but it appears
too long for this specific use case.

Let's ensure we read the APIServer CR at least once every minute.

Make the logs less verbose.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896

Remove this once kubernetes-sigs/controller-runtime#2032
is properly addressed

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
@alvaroaleman
Copy link
Member

That looks to me like your apiserver is being restarted, which results in a temporary disconnect. I am not sure I understand what controller-runtime can or should do about that?

@tiraboschi
Copy link
Author

Yes, I'm pretty sure that the APIServer got restarted due a change in its TLS configuration.
On the other side we are watching the APIServer CR on Openshift and the change event there got completely lost so we are probably going to get it resynched only after N hours (if I'm not wrong the resync default is 10 hours).
I'm absolutely fine with the idea of a level-based controller, but in this case due to the error on the informer we completely missed the level change.
I don't really know the internals, but I'd expect some better error handling once the APIserver will be back online.

@alvaroaleman
Copy link
Member

So you are saying the restart of the apiserver causes you to miss events? Can you provide a reproducer for this please?

Also, we just re-use the upstream informers in our cache and that seems like a bug in the informer, would you mind reporting this on kubernetes/kubernetes?

kubevirt-bot pushed a commit to kubevirt/hyperconverged-cluster-operator that referenced this issue Nov 2, 2022
If not explicitly set on its CR,
HCO webhook is consuming TLS configuration
from Openshift cluster-wide APIServer CR.
For performance reason it's not reading it on each request
to the HCO CR but it's consuming a cached representation.
The cache was only refreshed by a controller
based on an informer.
We got reports that due to the nature
of changes in the APIServer CR, the connection
to the APIserver itself could become stuck:
```
W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END
```

On controller runtime the default SyncPeriod when all the
watched resources are refreshed is 10 hourse (
see kubernetes-sigs/controller-runtime#521
for its reasons) but it appears
too long for this specific use case.

Let's ensure we read the APIServer CR at least once every minute.

Make the logs less verbose.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896

Remove this once kubernetes-sigs/controller-runtime#2032
is properly addressed

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
kubevirt-bot pushed a commit to kubevirt-bot/hyperconverged-cluster-operator that referenced this issue Nov 2, 2022
If not explicitly set on its CR,
HCO webhook is consuming TLS configuration
from Openshift cluster-wide APIServer CR.
For performance reason it's not reading it on each request
to the HCO CR but it's consuming a cached representation.
The cache was only refreshed by a controller
based on an informer.
We got reports that due to the nature
of changes in the APIServer CR, the connection
to the APIserver itself could become stuck:
```
W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END
```

On controller runtime the default SyncPeriod when all the
watched resources are refreshed is 10 hourse (
see kubernetes-sigs/controller-runtime#521
for its reasons) but it appears
too long for this specific use case.

Let's ensure we read the APIServer CR at least once every minute.

Make the logs less verbose.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896

Remove this once kubernetes-sigs/controller-runtime#2032
is properly addressed

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
kubevirt-bot added a commit to kubevirt/hyperconverged-cluster-operator that referenced this issue Nov 7, 2022
If not explicitly set on its CR,
HCO webhook is consuming TLS configuration
from Openshift cluster-wide APIServer CR.
For performance reason it's not reading it on each request
to the HCO CR but it's consuming a cached representation.
The cache was only refreshed by a controller
based on an informer.
We got reports that due to the nature
of changes in the APIServer CR, the connection
to the APIserver itself could become stuck:
```
W1025 13:50:16.898592       1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
E1025 13:50:16.898683       1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused
I1025 13:50:43.182360       1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms):
Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182)
Trace[621733159]: [23.843677488s] [23.843677488s] END
I1025 13:50:43.716723       1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms):
Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716)
Trace[255710357]: [31.45666834s] [31.45666834s] END
I1025 13:50:43.968506       1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms):
Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968)
Trace[2001360213]: [32.44785055s] [32.44785055s] END
```

On controller runtime the default SyncPeriod when all the
watched resources are refreshed is 10 hourse (
see kubernetes-sigs/controller-runtime#521
for its reasons) but it appears
too long for this specific use case.

Let's ensure we read the APIServer CR at least once every minute.

Make the logs less verbose.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896

Remove this once kubernetes-sigs/controller-runtime#2032
is properly addressed

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>

Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
Co-authored-by: Simone Tiraboschi <stirabos@redhat.com>
@alvaroaleman
Copy link
Member

This isn't reproducible without an Openshift cluster it seems and it does indeed happen, it is a bug with the underlying informer that we just import from client-go. Please create an issue in upstream kubernetes/kubernetes if you continue seeings this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants