Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

Open
AndiDog opened this issue Dec 17, 2024 · 1 comment
Assignees
Labels
team/phoenix Team Phoenix

Comments

@AndiDog
Copy link

AndiDog commented Dec 17, 2024

NTH starts a few minutes late due to

FTL Unable to find the AWS region to process queue events.

At that time, this value isn't defined which we could probably set explicitly:

{{- with .Values.awsRegion }}
- name: AWS_REGION
  value: {{ . | quote }}
{{- end }}

Also, we set a queue name, not a full ARN as QUEUE_URL, which NTH may use to parse the region:

// Populate the aws region if available from node metadata and not already explicitly configured
if nthConfig.AWSRegion == "" && nodeMetadata.Region != "" {
	nthConfig.AWSRegion = nodeMetadata.Region
} else if nthConfig.AWSRegion == "" && nthConfig.QueueURL != "" {
	nthConfig.AWSRegion = getRegionFromQueueURL(nthConfig.QueueURL)
	log.Debug().Str("Retrieved AWS region from queue-url: \"%s\"", nthConfig.AWSRegion)
}
if nthConfig.AWSRegion == "" && nthConfig.EnableSQSTerminationDraining {
	nthConfig.Print()
	log.Fatal().Msgf("Unable to find the AWS region to process queue events.")
}

I'm not sure yet how and when NTH gets the region. At first glance, it seems that it starts working by pure chance as a by-product of the IRSA environment variable injection into Pod/aws-node-termination-handler-*, but not into Deployment/aws-node-termination-handler:

      - name: AWS_STS_REGIONAL_ENDPOINTS
        value: regional
      - name: AWS_DEFAULT_REGION
        value: eu-west-1
      - name: AWS_REGION
        value: eu-west-1
      - name: AWS_ROLE_ARN
        value: arn:aws:iam::622574275803:role/byoa1-nth
      - name: AWS_WEB_IDENTITY_TOKEN_FILE
        value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

With that, NTH uses AWS_REGION which luckily is the correct region of the SQS queue. This must be improved to explicitly specify the region and therefore allow operation immediately.

@AndiDog
Copy link
Author

AndiDog commented Dec 17, 2024

When explicitly setting the region, the crashlooping and fatal error is gone. But until IRSA sets the IAM role+token, it can still take a few minutes where the application tries the node's instance profile instead of having the right credentials:

2024/12/17 12:59:45 WRN There was a problem monitoring for events error="AccessDenied: User: arn:aws:sts::590184012814:assumed-role/nodes-nodepool-0-t-eigx6dsrti1rcrok3a/i-04096072154be0e20 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:eu-west-2:590184012814:t-eigx6dsrti1rcrok3a-nth because no identity-based policy allows the sqs:receivemessage action\n\tstatus code: 403, request id: a4a508f0-dd10-5e21-9e50-572ba5119f2f" event_type=SQS_MONITOR

So we're still not faster like that.

Deploying aws-pod-identity-webhook-app faster would be part of a solution, but that requires improving app-operator reconciliation. I guess I'll just open the crashloop fix PR and then we can see if we want to improve this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/phoenix Team Phoenix
Projects
Status: Inbox 📥
Development

No branches or pull requests

2 participants