Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve race condition between CARM ConfigMap and reconciler for annotated namespaces #138

Merged
merged 1 commit into from
Feb 22, 2024

Conversation

a-hilaly
Copy link
Member

@a-hilaly a-hilaly commented Feb 17, 2024

Addresses aws-controllers-k8s/community#2011

In certain scenarios, where a user deploys a resource to a namespace
annotated with a specific ownner accountID, a race condition was
identified between the reconciler and the CARM (Cross Account Resource
Management) ConfigMap. This race condition resulted in the controller
setting an empty roleARN, preventing the aws-sdk-go client from pivoting
(calling STS::AssumeRole) and managing resourecs in the correct
account. Instead, resources were inadvertently managed in the default
account instead of the namespace assigned account.

This issue stemmed from the initial implementation of the CARM feature,
where the method responsible for retrieving the accountID from the
cache, didn't not properly verify the existance and content of the CARM
configMap and instead returned an empty stringy when these conditions
were not satisfied. This led to selection of the default account (when an
empty RoleARN is returned )for resource management.

Although these scenarios are rare, they can occur in clusters with a
significantly high number of namespaces, causing a delay between
naemsapce/configmap events and the informer's event handlers.

This patch addresses the race issue by implementing two main things:

  • Proper error propagation: an error is no propagated when a ConfigMap
    is missing or when an accountID entry is missing in the ConfigMap.
    This helps the reconciler make the right decision on how to handle
    these cases.
  • Improved error handling: The reconciler now carefully handles these
    errors and requeues whenever a user has issued an
    owneraccountid-annotated namespace but the Configmap is not create or
    properly propagated.

Signed-off-by: Amine Hilaly hilalyamine@gmail.com

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ack-prow ack-prow bot requested review from jlbutler and jljaco February 17, 2024 22:55
@ack-prow ack-prow bot added the approved label Feb 17, 2024
@a-hilaly a-hilaly force-pushed the carm-race-condition/fix branch 2 times, most recently from 63afd91 to eb73742 Compare February 18, 2024 20:38
…tated namespaces

In certain scenarios, where a user deploys a resource to a namespace
annotated with a specific ownner accountID, a race condition was
identified between the reconciler and the CARM (Cross Account Resource
Management) `ConfigMap`. This race condition resulted in the controller
setting an empty roleARN, preventing the aws-sdk-go client from pivoting
(calling `STS::AssumeRole`) and managing resourecs in the correct
account. Instead, resources were inadvertently managed in the default
account instead of the namespace assigned account.

This issue stemmed from the initial implementation of the CARM feature,
where the method responsible for retrieving the accountID from the
cache, didn't not properly verify the existance and content of the CARM
configMap and instead returned an empty stringy when these conditions
were not satisfied. This led to selection of the default account (empty
`RoleARN` for resource management.

Although these scenarios are rare, they can occur in clusters with a
significantly high number of namespaces, causing a delay between
naemsapce/configmap events and the informer's event handlers.

This patch addresses the race issue by implementing two main things:
- Proper error propagation: an error is no propagated when a `ConfigMap`
  is missing or when an accountID entry is missing in the `ConfigMap`.
  This helps the reconciler make the right decision on how to handle
  these cases.
- Improved error handling: The reconciler now carefully handles these
  errors and requeues whenever a user has issued an
  owneraccountid-annotated namespace but the Configmap is not create or
  properly propagated.

Signed-off-by: Amine Hilaly <hilalyamine@gmail.com>
@yuxiang-zhang
Copy link
Member

/lgtm

@ack-prow ack-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024
Copy link

ack-prow bot commented Feb 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly, yuxiang-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot merged commit 24070d9 into aws-controllers-k8s:main Feb 22, 2024
5 checks passed
ndbhat pushed a commit to ndbhat/ack-runtime that referenced this pull request Apr 16, 2024
…tated namespaces (aws-controllers-k8s#138)

Addresses aws-controllers-k8s/community#2011

In certain scenarios, where a user deploys a resource to a namespace
annotated with a specific ownner accountID, a race condition was
identified between the reconciler and the CARM (Cross Account Resource
Management) `ConfigMap`. This race condition resulted in the controller
setting an empty roleARN, preventing the aws-sdk-go client from pivoting
(calling `STS::AssumeRole`) and managing resourecs in the correct
account. Instead, resources were inadvertently managed in the default
account instead of the namespace assigned account.

This issue stemmed from the initial implementation of the CARM feature,
where the method responsible for retrieving the accountID from the
cache, didn't not properly verify the existance and content of the CARM
configMap and instead returned an empty stringy when these conditions
were not satisfied. This led to selection of the default account (when an
empty `RoleARN` is returned )for resource management.

Although these scenarios are rare, they can occur in clusters with a
significantly high number of namespaces, causing a delay between
naemsapce/configmap events and the informer's event handlers.

This patch addresses the race issue by implementing two main things:
- Proper error propagation: an error is no propagated when a `ConfigMap`
  is missing or when an accountID entry is missing in the `ConfigMap`.
  This helps the reconciler make the right decision on how to handle
  these cases.
- Improved error handling: The reconciler now carefully handles these
  errors and requeues whenever a user has issued an
  owneraccountid-annotated namespace but the Configmap is not create or
  properly propagated.

Signed-off-by: Amine Hilaly <hilalyamine@gmail.com>

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm Indicates that a PR is ready to be merged.
Projects
None yet
2 participants