Error upgrading to v0.32.0 #646

clementblaise · 2023-04-03T05:49:59Z

What happened?

I upgraded from v0.31.0 to v0.32.0, the controller shuts down after a couple of minutes. The Pod is in CrashLoopBackOff:

2023-04-03T05:32:55Z    INFO    Wait completed, proceeding to shutdown the manager                                                                                                                                                                                                      
provider: error: Cannot start controller manager: failed to wait for managed/acmpca.aws.upbound.io/v1beta1, kind=certificateauthority caches to sync: timed out waiting for cache to be synced

On initial sync I am seing the following permission error:

W0403 04:57:09.398062       1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.Fleet: fleet.gamelift.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "fleet" in API group "gamelift.aws.upbound.io" at the cluster scope                                                                                                                                                                                                                            
E0403 04:57:09.398094       1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.Fleet: failed to list *v1beta1.Fleet: fleet.gamelift.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "fleet" in API group "gamelift.aws.upbound.io" at the cluster scope                                                                                                                                                                                            
W0403 04:57:09.594464       1 reflector.go:424] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta1.DeviceFleet: devicefleet.sagemaker.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "devicefleet" in API group "sagemaker.aws.upbound.io" at the cluster scope                                                                                                                                                                                                        
E0403 04:57:09.594497       1 reflector.go:140] k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta1.DeviceFleet: failed to list *v1beta1.DeviceFleet: devicefleet.sagemaker.aws.upbound.io is forbidden: User "system:serviceaccount:crossplane-system:provider-aws-ab8a0c8dcadd" cannot list resource "devicefleet" in API group "sagemaker.aws.upbound.io" at the cluster scope

I am wondering if because of this the timeout is triggered an shutdowns the controller

The ClusterRole has been modified between the two version:

4,5c4,5
<   creationTimestamp: "2023-03-13T08:10:45Z"
<   name: crossplane:provider:provider-aws-90b53987357c:system
---
>   creationTimestamp: "2023-04-03T03:45:22Z"
>   name: crossplane:provider:provider-aws-ab8a0c8dcadd:system
11,14c11,14
<     name: provider-aws-90b53987357c
<     uid: 1fcc7933-be96-4579-9c37-c39039b106cd
<   resourceVersion: "61729946"
<   uid: f213651a-6765-49b4-a192-55e09c1840e0
---
>     name: provider-aws-ab8a0c8dcadd
>     uid: 73ae4e94-420f-4bba-9383-aea37b8aa5c8
>   resourceVersion: "76774061"
>   uid: b9c4a0a3-5548-4c3c-a7b5-633f58ff02b0
124a125,126
>   - replicationconfigurations
>   - replicationconfigurations/status
449,450c451,452
<   - fleet
<   - fleet/status
---
>   - fleets
>   - fleets/status
881,882c883,884
<   - devicefleet
<   - devicefleet/status
---
>   - devicefleets
>   - devicefleets/status
978a981,992
>   - emrserverless.aws.upbound.io
>   resources:
>   - applications
>   - applications/status
>   verbs:
>   - get
>   - list
>   - watch
>   - update
>   - patch
>   - create
> - apiGroups:
1782a1797,1798
>   - s3endpoints
>   - s3endpoints/status
2138a2155,2156
>   - shareddirectories
>   - shareddirectories/status
2206a2225,2226
>   - tablereplicas
>   - tablereplicas/status
2423,2424c2443,2444
<   - fleet
<   - fleet/status
---
>   - fleets
>   - fleets/status
3137c3157
<   - cloudcontrol.aws.upbound.io
---
>   - ram.aws.upbound.io
3139,3140c3159,3162
<   - resources
<   - resources/status
---
>   - resourceassociations
>   - resourceassociations/status
>   - resourceshares
>   - resourceshares/status
3149c3171
<   - ram.aws.upbound.io
---
>   - cloudcontrol.aws.upbound.io
3151,3152c3173,3174
<   - resourceshares
<   - resourceshares/status
---
>   - resources
>   - resources/status

How can we reproduce it?

Upgrade from v0.31.0 to v0.32.0

What environment did it happen in?

Crossplane Version: 1.11.0
Provider Version: 0.32.0
Kubernetes Version: v1.24.10
Kubernetes Distribution: EKS

ulucinar · 2023-04-03T11:19:21Z

Hi @clementblaise,
Thank you for reporting this issue. Could you please share the Crossplane RBAC manager logs (from the crossplane-rbac-manager Kubernetes deployment) and its status, while you are trying to upgrade from provider-aws@v0.31.0 to provider-aws@v0.32.0? I suspect if, for some reason, the RBAC manager cannot do its job and prepare the relevant RBAC rules, we might be getting those authorization errors. And I agree that the shutdown of the controller manager in the provider is due to the timeouts in the individual controllers, which in turn looks like to be caused by the RBAC issues, which might be related to the RBAC manager.

Another question would be how long did you wait before downgrading back to v0.31.0? If this is really an issue with the RBAC manager, I'm trying to understand whether it would eventually be able to prepare the RBAC rules for the new version of the provider if we wait for more.

project-administrator · 2023-04-03T11:50:25Z

Here are my answers on this issue:

RBAC manager is in a Running state and it's pretty quiet (any need to enable debug for it?):

> k logs crossplane-rbac-manager-858886f4dc-mbs86
Defaulted container "crossplane" out of: crossplane, crossplane-init (init)
I0403 10:47:39.175079       1 request.go:690] Waited for 1.01958716s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/wafv2.aws.upbound.io/v1beta1?timeout=32s
I0403 10:47:41.792466       1 leaderelection.go:248] attempting to acquire leader lease operators/crossplane-leader-election-rbac...
I0403 10:47:57.627201       1 leaderelection.go:258] successfully acquired lease operators/crossplane-leader-election-rbac

"how long did you wait before downgrading back to v0.31.0" - At least 30 minutes. It's already running with --max-reconcile-rate=200 which helped me resolve some issues during the previous upgrade. Increasing this even further to 2000 does not seem to help in any way.

ulucinar · 2023-04-03T11:59:00Z

Hi @project-administrator,
Are these the logs of the RBAC manager against the crashing v0.32.0 provider-aws pod and are there any restarts of the RBAC manager pod (e.g., what does kubectl get pods show for the specific namespace)?

We can also try enabling the debug logs by passing a -d to the RBAC manager with something similar to the following in its deployment (crossplane-rbac-manager):

...
    spec:
      containers:
      - args:
        - rbac
        - start
        - --manage=Basic
        - --provider-clusterrole=crossplane:allowed-provider-permissions
        - -d

ulucinar · 2023-04-03T12:28:32Z

Another observation (that's also captured in the diff given in the description) is that the plural resource name of Fleet.gamelift has changed from fleet in v0.31.0 to fleets in v0.32.0.

This results in two CRDs being installed in the cluster after upgrading to v0.32.0:

❯ k get crds | grep 'fleet.*gamelift.aws.upbound.io'
fleet.gamelift.aws.upbound.io                                             2023-04-03T10:44:47Z
fleets.gamelift.aws.upbound.io                                            2023-04-03T11:02:32Z

In my case the controller was able to start the controllers:

...
❯ grep gamelift.aws.upbound.io provider-aws-logs_v0.32.0.txt | grep -i fleet
2023-04-03T11:02:53Z    INFO    Starting EventSource    {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet", "source": "kind source: *v1beta1.Fleet"}
2023-04-03T11:02:53Z    INFO    Starting Controller     {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet"}
2023-04-03T11:02:53Z    INFO    Starting workers        {"controller": "managed/gamelift.aws.upbound.io/v1beta1, kind=fleet", "controllerGroup": "gamelift.aws.upbound.io", "controllerKind": "Fleet", "worker count": 10}
...

But I think this is not a good situation:

status:
...
  conditions:
  - lastTransitionTime: "2023-04-03T11:02:32Z"
    message: '"FleetList" is already in use'
    reason: ListKindConflict
    status: "False"
    type: NamesAccepted
  - lastTransitionTime: "2023-04-03T11:02:32Z"
    message: not all names are accepted
    reason: NotAccepted
    status: "False"
    type: Established
  storedVersions:
  - v1beta1

These status conditions belong to the fleets.gamelift.aws.upbound.io CRD from the v0.32.0 of the provider. Although the list kind has been rejected and the CRD was not successfully established, the provider revision is healthy:

❯ k get providerrevisions
NAME                        HEALTHY   REVISION   IMAGE                                          STATE      DEP-FOUND   DEP-INSTALLED   AGE
provider-aws-90b53987357c   True      1          xpkg.upbound.io/upbound/provider-aws:v0.31.0   Inactive                               163m
provider-aws-ab8a0c8dcadd   True      2          xpkg.upbound.io/upbound/provider-aws:v0.32.0   Active                                 146m

ulucinar · 2023-04-03T13:07:27Z

This issue could be related to:
kubernetes-sigs/controller-tools#660

- Please see: crossplane-contrib#646 Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

- Please see: #646 Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com> (cherry picked from commit de90966)

ulucinar · 2023-04-03T22:31:53Z

Thank you @clementblaise & @project-administrator for reporting this issue and for the great feedback and help. It turns out that an old clusterrolebinding to the cluster-role role in my test cluster was hiding this issue from me. @sergenyalcin has successfully validated v0.32.1 on a separate cluster and upgrades from v0.31.0 to v0.32.1 should now be fine.

Those managed resources that have kind names ending in fleet were affected by this issue in v0.32.0. The plural names of these 3 resources (Fleet.gamelift, Fleet.appstream and DeviceFleet.sagemaker) have acquired an s due to the controller-tools PR mentioned above. And this is a breaking change that we cannot automatically (without requiring manual intervention) handle as of now. So we reverted these plural name changes back, for now.

We are planning to extend crddiff (a tool capable of detecting breaking spec.forProvider API changes in managed resources) to also check for path changes. We are also planning to add some upgrade tests in the provider repos.

clementblaise added bug Something isn't working needs:triage labels Apr 3, 2023

github-actions bot added the community label Apr 3, 2023

clementblaise changed the title ~~Error running v0.32.0~~ Error upgrading to v0.32.0 Apr 3, 2023

ulucinar added a commit to ulucinar/upbound-provider-aws that referenced this issue Apr 3, 2023

Configure plural names for resources whose kind names end with "fleet"

4f801c6

- Please see: crossplane-contrib#646 Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar mentioned this issue Apr 3, 2023

Configure plural names for resources whose kind names end with "fleet" #648

Merged

1 task

ulucinar added a commit to ulucinar/upbound-provider-aws that referenced this issue Apr 3, 2023

Configure plural names for resources whose kind names end with "fleet"

de90966

- Please see: crossplane-contrib#646 Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

ulucinar closed this as completed in #648 Apr 3, 2023

github-actions bot pushed a commit that referenced this issue Apr 3, 2023

Configure plural names for resources whose kind names end with "fleet"

4252c8b

- Please see: #646 Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com> (cherry picked from commit de90966)

ulucinar mentioned this issue Apr 4, 2023

Support CRD plural name checking in crddiff upbound/official-providers-ci#89

Open

jbw976 mentioned this issue Apr 24, 2023

Improved e2e integration testing crossplane/crossplane#4013

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error upgrading to v0.32.0 #646

Error upgrading to v0.32.0 #646

clementblaise commented Apr 3, 2023

ulucinar commented Apr 3, 2023

project-administrator commented Apr 3, 2023

ulucinar commented Apr 3, 2023 •

edited

Loading

ulucinar commented Apr 3, 2023 •

edited

Loading

ulucinar commented Apr 3, 2023

ulucinar commented Apr 3, 2023

Error upgrading to v0.32.0 #646

Error upgrading to v0.32.0 #646

Comments

clementblaise commented Apr 3, 2023

What happened?

How can we reproduce it?

What environment did it happen in?

ulucinar commented Apr 3, 2023

project-administrator commented Apr 3, 2023

ulucinar commented Apr 3, 2023 • edited Loading

ulucinar commented Apr 3, 2023 • edited Loading

ulucinar commented Apr 3, 2023

ulucinar commented Apr 3, 2023

ulucinar commented Apr 3, 2023 •

edited

Loading

ulucinar commented Apr 3, 2023 •

edited

Loading