You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FTL Unable to find the AWS region to process queue events.
At that time, this value isn't defined which we could probably set explicitly:
{{- with .Values.awsRegion }}
- name: AWS_REGION
value: {{ . | quote }}
{{- end }}
Also, we set a queue name, not a full ARN as QUEUE_URL, which NTH may use to parse the region:
// Populate the aws region if available from node metadata and not already explicitly configured
if nthConfig.AWSRegion == "" && nodeMetadata.Region != "" {
nthConfig.AWSRegion = nodeMetadata.Region
} else if nthConfig.AWSRegion == "" && nthConfig.QueueURL != "" {
nthConfig.AWSRegion = getRegionFromQueueURL(nthConfig.QueueURL)
log.Debug().Str("Retrieved AWS region from queue-url: \"%s\"", nthConfig.AWSRegion)
}
if nthConfig.AWSRegion == "" && nthConfig.EnableSQSTerminationDraining {
nthConfig.Print()
log.Fatal().Msgf("Unable to find the AWS region to process queue events.")
}
I'm not sure yet how and when NTH gets the region. At first glance, it seems that it starts working by pure chance as a by-product of the IRSA environment variable injection into Pod/aws-node-termination-handler-*, but not into Deployment/aws-node-termination-handler:
With that, NTH uses AWS_REGION which luckily is the correct region of the SQS queue. This must be improved to explicitly specify the region and therefore allow operation immediately.
The text was updated successfully, but these errors were encountered:
When explicitly setting the region, the crashlooping and fatal error is gone. But until IRSA sets the IAM role+token, it can still take a few minutes where the application tries the node's instance profile instead of having the right credentials:
2024/12/17 12:59:45 WRN There was a problem monitoring for events error="AccessDenied: User: arn:aws:sts::590184012814:assumed-role/nodes-nodepool-0-t-eigx6dsrti1rcrok3a/i-04096072154be0e20 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:eu-west-2:590184012814:t-eigx6dsrti1rcrok3a-nth because no identity-based policy allows the sqs:receivemessage action\n\tstatus code: 403, request id: a4a508f0-dd10-5e21-9e50-572ba5119f2f" event_type=SQS_MONITOR
So we're still not faster like that.
Deploying aws-pod-identity-webhook-app faster would be part of a solution, but that requires improving app-operator reconciliation. I guess I'll just open the crashloop fix PR and then we can see if we want to improve this further.
NTH starts a few minutes late due to
At that time, this value isn't defined which we could probably set explicitly:
Also, we set a queue name, not a full ARN as
QUEUE_URL
, which NTH may use to parse the region:I'm not sure yet how and when NTH gets the region. At first glance, it seems that it starts working by pure chance as a by-product of the IRSA environment variable injection into
Pod/aws-node-termination-handler-*
, but not intoDeployment/aws-node-termination-handler
:With that, NTH uses
AWS_REGION
which luckily is the correct region of the SQS queue. This must be improved to explicitly specify the region and therefore allow operation immediately.The text was updated successfully, but these errors were encountered: