Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between reading the ack-role-account-map ConfigMap and resyncing breaks CARM #2011

Closed
nampnguyen opened this issue Feb 9, 2024 · 15 comments
Assignees
Labels
area/carm Issues or PRs related to CARM (Cross Account Resource Management) area/runtime Issues or PRs as related to controller runtime, common reconciliation logic, etc kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@nampnguyen
Copy link

Describe the bug
When a controller starts while using CARM, particularly if the pod has a CPU limit specified (~250m), the account config map can sometimes be read after reconciliation starts. When this happens, the controller no longer assumes the appropriate cross account role and results in AccessDenied errors because the IRSA role does not have permissions to resources in other accounts.

The problem seems to be getting worse over time and may be related to the number of namespaces.

Even after the ack-role-account-map is read, the controller does not use the correct cross account role.

Debug logs (with some redactions):

{"level":"info","ts":"2024-02-09T19:54:29.148Z","logger":"setup","msg":"initializing service controller","aws.service":"dynamodb"}
{"level":"debug","ts":"2024-02-09T19:54:29.148Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-2"}
...
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-39"}
{"level":"debug","ts":"2024-02-09T19:54:29.158Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-40"}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Backup","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"GlobalTable","resync period seconds":36000}
{"level":"debug","ts":"2024-02-09T19:54:29.170Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Table","resync period seconds":36000}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"setup","msg":"starting manager","aws.service":"dynamodb"}
{"level":"info","ts":"2024-02-09T19:54:29.171Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
...
{"level":"info","ts":"2024-02-09T19:54:29.171Z","msg":"Starting Controller","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table"}
{"level":"info","ts":"2024-02-09T19:54:29.273Z","msg":"Starting workers","controller":"adoptedresource","controllerGroup":"services.k8s.aws","controllerKind":"AdoptedResource","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"globaltable","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"GlobalTable","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"fieldexport","controllerGroup":"services.k8s.aws","controllerKind":"FieldExport","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"table","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Table","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"info","ts":"2024-02-09T19:54:29.274Z","msg":"Starting workers","controller":"backup","controllerGroup":"dynamodb.services.k8s.aws","controllerKind":"Backup","worker count":1}
{"level":"debug","ts":"2024-02-09T19:54:29.277Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"> r.Sync","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< r.resetConditions","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.ResolveReferences","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":"<< rm.EnsureTags","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">> rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.279Z","logger":"ackrt","msg":">>> rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3}
{"level":"debug","ts":"2024-02-09T19:54:29.311Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<<< rm.sdkFind","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}
{"level":"debug","ts":"2024-02-09T19:54:29.349Z","logger":"ackrt","msg":"<< rm.ReadOne","account":"222222222222","role":"","region":"us-east-1","kind":"Table","namespace":"namespace-5","name":"myapp-dynamodb","is_adopted":false,"generation":3,"error":"AccessDeniedException: User: arn:aws:sts::111111111111:assumed-role/IRSARole/1707508469279464158 is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:us-east-1:111111111111:table/myapp because no identity-based policy allows the dynamodb:DescribeTable action\n\tstatus code: 400, request id: MH8E64L3D7BR9OJRI11BQ75FBNVV4KQNSO5AEMVJF66Q9ASUAAJG"}

Steps to reproduce

  1. Configure CARM with a large number of namespaces (~30-40) and a CPU limit on the controller pod
  2. Create an ACK resource
  3. Restart the controller pod to trigger reloading of the config map and reconciliation

Expected outcome
Resource resyncing to wait until after the ack-role-account-map ConfigMap is read or future resyncs pivots to the cross account role if the ConfigMap is read after resyncing starts.

Environment

  • Kubernetes version: 1.27
  • Using EKS (yes/no), if so version? Yes, 1.27
  • AWS service targeted (S3, RDS, etc.): Multiple (observed with DynamoDB, IAM, S3, SQS)
@a-hilaly a-hilaly added area/runtime Issues or PRs as related to controller runtime, common reconciliation logic, etc kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 9, 2024
@a-hilaly a-hilaly self-assigned this Feb 9, 2024
@a-hilaly
Copy link
Member

@nampnguyen Is this happening only during controller startup, or have you observed it persisting even after multiple reconciliations?

We can change a few things in the ACK runtime to mitigate this:

  • On startup, we could wait the for CARM ShareInformers to sync, before spinning the reconcilers.
  • If a controllers is trying to use a CARM entry that wasn't populated/observed by the CARM sharedinfomers... we need to error and requeue (until it's properly propagated)

Quick note: We really need to start stress and load testing for ACK controllers, especially for components like CARM watchers.. i'm pretty sure there is a lot we can catch in there.

@nampnguyen
Copy link
Author

@a-hilaly In our experience, if the reconciler starts before the ack-role-account-map ConfigMap is read, the issue persists for the life of the controller. The only option is to restart the controller and roll the dice again.

When we originally saw this issue when we had a CPU limit of 250m, but then increased it to 2000m. After the limit increase, the frequency of reconciler errors went down significantly. We implemented a cronjob to check for reconciler errors and restarted the controller if that happened which took care of the few times it popped back up. However, since that time the number of namespaces and accounts used by ACK has grown and we now see some controllers continue to experience reconciler errors even after dozens of restarts.

ack-prow bot pushed a commit to aws-controllers-k8s/runtime that referenced this issue Feb 22, 2024
…tated namespaces (#138)

Addresses aws-controllers-k8s/community#2011

In certain scenarios, where a user deploys a resource to a namespace
annotated with a specific ownner accountID, a race condition was
identified between the reconciler and the CARM (Cross Account Resource
Management) `ConfigMap`. This race condition resulted in the controller
setting an empty roleARN, preventing the aws-sdk-go client from pivoting
(calling `STS::AssumeRole`) and managing resourecs in the correct
account. Instead, resources were inadvertently managed in the default
account instead of the namespace assigned account.

This issue stemmed from the initial implementation of the CARM feature,
where the method responsible for retrieving the accountID from the
cache, didn't not properly verify the existance and content of the CARM
configMap and instead returned an empty stringy when these conditions
were not satisfied. This led to selection of the default account (when an
empty `RoleARN` is returned )for resource management.

Although these scenarios are rare, they can occur in clusters with a
significantly high number of namespaces, causing a delay between
naemsapce/configmap events and the informer's event handlers.

This patch addresses the race issue by implementing two main things:
- Proper error propagation: an error is no propagated when a `ConfigMap`
  is missing or when an accountID entry is missing in the `ConfigMap`.
  This helps the reconciler make the right decision on how to handle
  these cases.
- Improved error handling: The reconciler now carefully handles these
  errors and requeues whenever a user has issued an
  owneraccountid-annotated namespace but the Configmap is not create or
  properly propagated.

Signed-off-by: Amine Hilaly <hilalyamine@gmail.com>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@a-hilaly
Copy link
Member

a-hilaly commented Feb 22, 2024

@nampnguyen we starting rolling out a fix for all the controllers - i'll ping here once dynamodb controller is patched.

ack-prow bot pushed a commit to aws-controllers-k8s/code-generator that referenced this issue Feb 22, 2024
Pin ACK runtime to v0.31.0. Mainly addressing two github issues:
- Healthz, liveness and readiness probes aws-controllers-k8s/community#2012
- CARM Race condition and scaling issues aws-controllers-k8s/community#2011

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@nampnguyen
Copy link
Author

@a-hilaly Saw the controllers PRs that included the 0.31.0 runtime get closed, is the fix being re-worked? Thanks!

@a-hilaly
Copy link
Member

a-hilaly commented Mar 4, 2024

@nampnguyen Looks like we also needed aws-controllers-k8s/runtime@861f7ed - we're rolling a new patch today :)

@nampnguyen
Copy link
Author

@a-hilaly We tried the latest controllers using runtime 0.32.0, however are still seeing the issue. Debug logs from iam-controller 1.3.5:

{"level":"info","ts":"2024-03-11T13:47:53.770Z","logger":"setup","msg":"initializing service controller","aws.service":"iam"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.account","msg":"Starting shared informer for accounts cache","targetConfigMap":"ack-role-account-map"}
{"level":"debug","ts":"2024-03-11T13:47:53.770Z","logger":"cache.namespace","msg":"Starting namespace cache","watchScope":[],"ignored":["kube-system","kube-public","kube-node-lease"]}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-1"}
{"level":"debug","ts":"2024-03-11T13:47:53.788Z","logger":"cache.namespace","msg":"created namespace","name":"namespace-13"}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"OpenIDConnectProvider","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Policy","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.816Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Role","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"User","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"Group","resync period seconds":36000}
{"level":"debug","ts":"2024-03-11T13:47:53.817Z","logger":"ackrt","msg":"Initiating reconciler","reconciler kind":"InstanceProfile","resync period seconds":36000}
{"level":"info","ts":"2024-03-11T13:47:53.817Z","logger":"setup","msg":"starting manager","aws.service":"iam"}
...
{"level":"info","ts":"2024-03-11T13:47:53.928Z","msg":"Starting workers","controller":"policy","controllerGroup":"iam.services.k8s.aws","controllerKind":"Policy","worker count":1}
{"level":"debug","ts":"2024-03-11T13:47:53.936Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.938Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:53.944Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Config map read here
{"level":"debug","ts":"2024-03-11T13:47:53.972Z","logger":"cache.account","msg":"created account config map","name":"ack-role-account-map"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10"}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.016Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345426\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.038Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-1","name":"iam-role-1","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.038Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-1","namespace":"namespace-1"},"namespace":"namespace-1","name":"iam-role-1","reconcileID":"9d3c3474-c71e-4a60-a544-efe1bfe759c9","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole1 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: 899dd5aa-2b83-49b1-87cf-818457adce10","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"exporter.field-export-reconciler","msg":"error did not need requeue","error":"the source resource is not synced yet"}
// Role is empty
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"> r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< r.resetConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.ResolveReferences","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":"<< rm.EnsureTags","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">> rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.044Z","logger":"ackrt","msg":">>> rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.sdkFind","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< rm.ReadOne","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">>> rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<<< rm.IsSynced","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"<< r.ensureConditions","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"< r.Sync","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626"}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":"> r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.074Z","logger":"ackrt","msg":">> kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"patched resource status","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4,"json":"{\"metadata\":{\"resourceVersion\":\"462345427\"},\"spec\":{...},\"status\":{\"conditions\":[{\"message\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"True\",\"type\":\"ACK.Recoverable\"},{\"lastTransitionTime\":\"2024-03-11T13:47:54Z\",\"message\":\"Unable to determine if desired resource state matches latest observed state\",\"reason\":\"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\\n\\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626\",\"status\":\"Unknown\",\"type\":\"ACK.ResourceSynced\"}]}}"}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"<< kc.Patch (status)","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"debug","ts":"2024-03-11T13:47:54.091Z","logger":"ackrt","msg":"< r.patchResourceStatus","kind":"Role","namespace":"namespace-2","name":"iam-role-2","account":"123456789012","role":"","region":"us-east-1","is_adopted":false,"generation":4}
{"level":"error","ts":"2024-03-11T13:47:54.091Z","msg":"Reconciler error","controller":"role","controllerGroup":"iam.services.k8s.aws","controllerKind":"Role","Role":{"name":"iam-role-2","namespace":"namespace-2"},"namespace":"namespace-2","name":"iam-role-2","reconcileID":"58482afd-89b1-45eb-8630-724deacf338c","error":"AccessDenied: User: arn:aws:sts::123456789012:assumed-role/ACKControlerIRSARole/1710164873938759450 is not authorized to perform: iam:GetRole on resource: role ACKManagedRole2 because no identity-based policy allows the iam:GetRole action\n\tstatus code: 403, request id: ad1e102f-2db9-488a-9dbe-6b1894a15626","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

Even waiting several minutes, I'm not seeing the re-queuing behavior described in the PRs.

@a-hilaly
Copy link
Member

@nampnguyen I will try to reproduce locally, maybe i'm missing something in here...
qq: does the namespace have an owner-account-id annotation? the only case where controllers will skip the requeue is if the namespaces aren't annotated..

@a-hilaly a-hilaly added the area/carm Issues or PRs related to CARM (Cross Account Resource Management) label Mar 13, 2024
@nampnguyen
Copy link
Author

nampnguyen commented Mar 13, 2024

@a-hilaly Yes, the namespaces do have the services.k8s.aws/owner-account-id annotation. Looking at the fix, I think the issue we're currently experiencing may have been introduced by aws-controllers-k8s/runtime@24070d9

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

https://github.com/aws-controllers-k8s/runtime/blob/20e1c0d714a90e806d885082922fc3cc8d5358c1/pkg/runtime/reconciler.go#L1029

In the most recent logs I shared, the resource's Account ID and Controller account are in fact the same, but we still need the CARM role pivot. I think this can be fixed by always checking for the namespace annotation and if present use CARM.

@a-hilaly
Copy link
Member

@nampnguyen makes sense~ working on a fix! I think we might also want to enable CARM even in single-namespace watch mode.

If I'm reading this correctly, this line here assumes that if the resource's Account ID is the same as the controller's IRSA account ID, then CARM is not used.

Correct. Not sure, but this might have been the case even before.

ndbhat pushed a commit to ndbhat/ack-runtime that referenced this issue Apr 16, 2024
…tated namespaces (aws-controllers-k8s#138)

Addresses aws-controllers-k8s/community#2011

In certain scenarios, where a user deploys a resource to a namespace
annotated with a specific ownner accountID, a race condition was
identified between the reconciler and the CARM (Cross Account Resource
Management) `ConfigMap`. This race condition resulted in the controller
setting an empty roleARN, preventing the aws-sdk-go client from pivoting
(calling `STS::AssumeRole`) and managing resourecs in the correct
account. Instead, resources were inadvertently managed in the default
account instead of the namespace assigned account.

This issue stemmed from the initial implementation of the CARM feature,
where the method responsible for retrieving the accountID from the
cache, didn't not properly verify the existance and content of the CARM
configMap and instead returned an empty stringy when these conditions
were not satisfied. This led to selection of the default account (when an
empty `RoleARN` is returned )for resource management.

Although these scenarios are rare, they can occur in clusters with a
significantly high number of namespaces, causing a delay between
naemsapce/configmap events and the informer's event handlers.

This patch addresses the race issue by implementing two main things:
- Proper error propagation: an error is no propagated when a `ConfigMap`
  is missing or when an accountID entry is missing in the `ConfigMap`.
  This helps the reconciler make the right decision on how to handle
  these cases.
- Improved error handling: The reconciler now carefully handles these
  errors and requeues whenever a user has issued an
  owneraccountid-annotated namespace but the Configmap is not create or
  properly propagated.

Signed-off-by: Amine Hilaly <hilalyamine@gmail.com>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@mattzech
Copy link

@a-hilaly @nampnguyen We are also facing this exact issue when using most recent version of iam-controller. Glad to hear a fix is being worked on. Thanks!

@mumlawski
Copy link

@a-hilaly we do still face this same issue even after your pull request from April. Just wondering if someone is looking at this problem still?

@a-hilaly
Copy link
Member

@mattzech @mumlawski @nampnguyen Sorry folks I somehow got side tracked from this issue. This is a high priority issue and I i'll ship a fix for it ASAP.

@a-hilaly
Copy link
Member

a-hilaly commented Aug 6, 2024

@mumlawski @mattzech @nampnguyen we released a new iam-controller version fix the CARM/CM race condition issue

@a-hilaly
Copy link
Member

a-hilaly commented Aug 7, 2024

/close

Copy link

ack-prow bot commented Aug 7, 2024

@a-hilaly: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ack-prow ack-prow bot closed this as completed Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/carm Issues or PRs related to CARM (Cross Account Resource Management) area/runtime Issues or PRs as related to controller runtime, common reconciliation logic, etc kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

4 participants