Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error upgrading to v0.32.0 #646

Closed
clementblaise opened this issue Apr 3, 2023 · 6 comments · Fixed by #648
Closed

Error upgrading to v0.32.0 #646

clementblaise opened this issue Apr 3, 2023 · 6 comments · Fixed by #648
Labels
bug Something isn't working needs:triage

Comments

@clementblaise
Copy link

What happened?

I upgraded from v0.31.0 to v0.32.0, the controller shuts down after a couple of minutes. The Pod is in CrashLoopBackOff:

2023-04-03T05:32:55Z    INFO    Wait completed, proceeding to shutdown the manager                                                                                                                                                                                                      
provider: error: Cannot start controller manager: failed to wait for managed/acmpca.aws.upbound.io/v1beta1, kind=certificateauthority caches to sync: timed out waiting for cache to be synced

On initial sync I am seing the following permission error:

W0403 04:57:09.398062       1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.Fleet: fleet.gamelift.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "fleet" in API group "gamelift.aws.upbound.io" at the cluster scope                                                                                                                                                                                                                            
E0403 04:57:09.398094       1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.Fleet: failed to list *v1beta1.Fleet: fleet.gamelift.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "fleet" in API group "gamelift.aws.upbound.io" at the cluster scope                                                                                                                                                                                            
W0403 04:57:09.594464       1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.DeviceFleet: devicefleet.sagemaker.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "devicefleet" in API group "sagemaker.aws.upbound.io" at the cluster scope                                                                                                                                                                                                        
E0403 04:57:09.594497       1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.DeviceFleet: failed to list *v1beta1.DeviceFleet: devicefleet.sagemaker.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "devicefleet" in API group "sagemaker.aws.upbound.io" at the cluster scope

I am wondering if because of this the timeout is triggered an shutdowns the controller

The ClusterRole has been modified between the two version:

4,5c4,5
<   creationTimestamp: "2023-03-13T08:10:45Z"
<   name: crossplane:provider:provider-aws-90b53987357c:system
---
>   creationTimestamp: "2023-04-03T03:45:22Z"
>   name: crossplane:provider:provider-aws-ab8a0c8dcadd:system
11,14c11,14
<     name: provider-aws-90b53987357c
<     uid: 1fcc7933-be96-4579-9c37-c39039b106cd
<   resourceVersion: "61729946"
<   uid: f213651a-6765-49b4-a192-55e09c1840e0
---
>     name: provider-aws-ab8a0c8dcadd
>     uid: 73ae4e94-420f-4bba-9383-aea37b8aa5c8
>   resourceVersion: "76774061"
>   uid: b9c4a0a3-5548-4c3c-a7b5-633f58ff02b0
124a125,126
>   - replicationconfigurations
>   - replicationconfigurations/status
449,450c451,452
<   - fleet
<   - fleet/status
---
>   - fleets
>   - fleets/status
881,882c883,884
<   - devicefleet
<   - devicefleet/status
---
>   - devicefleets
>   - devicefleets/status
978a981,992
>   - emrserverless.aws.upbound.io
>   resources:
>   - applications
>   - applications/status
>   verbs:
>   - get
>   - list
>   - watch
>   - update
>   - patch
>   - create
> - apiGroups:
1782a1797,1798
>   - s3endpoints
>   - s3endpoints/status
2138a2155,2156
>   - shareddirectories
>   - shareddirectories/status
2206a2225,2226
>   - tablereplicas
>   - tablereplicas/status
2423,2424c2443,2444
<   - fleet
<   - fleet/status
---
>   - fleets
>   - fleets/status
3137c3157
<   - cloudcontrol.aws.upbound.io
---
>   - ram.aws.upbound.io
3139,3140c3159,3162
<   - resources
<   - resources/status
---
>   - resourceassociations
>   - resourceassociations/status
>   - resourceshares
>   - resourceshares/status
3149c3171
<   - ram.aws.upbound.io
---
>   - cloudcontrol.aws.upbound.io
3151,3152c3173,3174
<   - resourceshares
<   - resourceshares/status
---
>   - resources
>   - resources/status

How can we reproduce it?

Upgrade from v0.31.0 to v0.32.0

What environment did it happen in?

  • Crossplane Version: 1.11.0
  • Provider Version: 0.32.0
  • Kubernetes Version: v1.24.10
  • Kubernetes Distribution: EKS
@clementblaise clementblaise added bug Something isn't working needs:triage labels Apr 3, 2023
@clementblaise clementblaise changed the title Error running v0.32.0 Error upgrading to v0.32.0 Apr 3, 2023
@ulucinar
Copy link
Collaborator

ulucinar commented Apr 3, 2023

Hi @clementblaise,
Thank you for reporting this issue. Could you please share the Crossplane RBAC manager logs (from the crossplane-rbac-manager Kubernetes deployment) and its status, while you are trying to upgrade from provider-aws@v0.31.0 to provider-aws@v0.32.0? I suspect if, for some reason, the RBAC manager cannot do its job and prepare the relevant RBAC rules, we might be getting those authorization errors. And I agree that the shutdown of the controller manager in the provider is due to the timeouts in the individual controllers, which in turn looks like to be caused by the RBAC issues, which might be related to the RBAC manager.

Another question would be how long did you wait before downgrading back to v0.31.0? If this is really an issue with the RBAC manager, I'm trying to understand whether it would eventually be able to prepare the RBAC rules for the new version of the provider if we wait for more.

@project-administrator
Copy link

Here are my answers on this issue:

  1. RBAC manager is in a Running state and it's pretty quiet (any need to enable debug for it?):
> k logs crossplane-rbac-manager-858886f4dc-mbs86
Defaulted container "crossplane" out of: crossplane, crossplane-init (init)
I0403 10:47:39.175079       1 request.go:690] Waited for 1.01958716s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/wafv2.aws.upbound.io/v1beta1?timeout=32s
I0403 10:47:41.792466       1 leaderelection.go:248] attempting to acquire leader lease operators/crossplane-leader-election-rbac...
I0403 10:47:57.627201       1 leaderelection.go:258] successfully acquired lease operators/crossplane-leader-election-rbac
  1. "how long did you wait before downgrading back to v0.31.0" - At least 30 minutes. It's already running with --max-reconcile-rate=200 which helped me resolve some issues during the previous upgrade. Increasing this even further to 2000 does not seem to help in any way.

@ulucinar
Copy link
Collaborator

ulucinar commented Apr 3, 2023

Hi @project-administrator,
Are these the logs of the RBAC manager against the crashing v0.32.0 provider-aws pod and are there any restarts of the RBAC manager pod (e.g., what does kubectl get pods show for the specific namespace)?

We can also try enabling the debug logs by passing a -d to the RBAC manager with something similar to the following in its deployment (crossplane-rbac-manager):

...
    spec:
      containers:
      - args:
        - rbac
        - start
        - --manage=Basic
        - --provider-clusterrole=crossplane:allowed-provider-permissions
        - -d

@ulucinar
Copy link
Collaborator

ulucinar commented Apr 3, 2023

Another observation (that's also captured in the diff given in the description) is that the plural resource name of Fleet.gamelift has changed from fleet in v0.31.0 to fleets in v0.32.0.

This results in two CRDs being installed in the cluster after upgrading to v0.32.0:

❯ k get crds | grep 'fleet.*gamelift.aws.upbound.io'
fleet.gamelift.aws.upbound.io                                             2023-04-03T10:44:47Z
fleets.gamelift.aws.upbound.io                                            2023-04-03T11:02:32Z

In my case the controller was able to start the controllers:

...
❯ grep gamelift.aws.upbound.io provider-aws-logs_v0.32.0.txt | grep -i fleet
2023-04-03T11:02:53Z    INFO    Starting EventSource    {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet", "source": "kind source: *v1beta1.Fleet"}
2023-04-03T11:02:53Z    INFO    Starting Controller     {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet"}
2023-04-03T11:02:53Z    INFO    Starting workers        {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet", "worker count": 10}
...

But I think this is not a good situation:

status:
...
  conditions:
  - lastTransitionTime: "2023-04-03T11:02:32Z"
    message: '"FleetList" is already in use'
    reason: ListKindConflict
    status: "False"
    type: NamesAccepted
  - lastTransitionTime: "2023-04-03T11:02:32Z"
    message: not all names are accepted
    reason: NotAccepted
    status: "False"
    type: Established
  storedVersions:
  - v1beta1

These status conditions belong to the fleets.gamelift.aws.upbound.io CRD from the v0.32.0 of the provider. Although the list kind has been rejected and the CRD was not successfully established, the provider revision is healthy:

❯ k get providerrevisions
NAME                        HEALTHY   REVISION   IMAGE                                          STATE      DEP-FOUND   DEP-INSTALLED   AGE
provider-aws-90b53987357c   True      1          xpkg.upbound.io/upbound/provider-aws:v0.31.0   Inactive                               163m
provider-aws-ab8a0c8dcadd   True      2          xpkg.upbound.io/upbound/provider-aws:v0.32.0   Active                                 146m

@ulucinar
Copy link
Collaborator

ulucinar commented Apr 3, 2023

This issue could be related to:
kubernetes-sigs/controller-tools#660

ulucinar added a commit to ulucinar/upbound-provider-aws that referenced this issue Apr 3, 2023
- Please see: crossplane-contrib#646

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
ulucinar added a commit to ulucinar/upbound-provider-aws that referenced this issue Apr 3, 2023
- Please see: crossplane-contrib#646

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
github-actions bot pushed a commit that referenced this issue Apr 3, 2023
- Please see: #646

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
(cherry picked from commit de90966)
@ulucinar
Copy link
Collaborator

ulucinar commented Apr 3, 2023

Thank you @clementblaise & @project-administrator for reporting this issue and for the great feedback and help. It turns out that an old clusterrolebinding to the cluster-role role in my test cluster was hiding this issue from me. @sergenyalcin has successfully validated v0.32.1 on a separate cluster and upgrades from v0.31.0 to v0.32.1 should now be fine.

Those managed resources that have kind names ending in fleet were affected by this issue in v0.32.0. The plural names of these 3 resources (Fleet.gamelift, Fleet.appstream and DeviceFleet.sagemaker) have acquired an s due to the controller-tools PR mentioned above. And this is a breaking change that we cannot automatically (without requiring manual intervention) handle as of now. So we reverted these plural name changes back, for now.

We are planning to extend crddiff (a tool capable of detecting breaking spec.forProvider API changes in managed resources) to also check for path changes. We are also planning to add some upgrade tests in the provider repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants