aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

AndiDog · 2024-12-17T11:37:52Z

NTH starts a few minutes late due to

FTL Unable to find the AWS region to process queue events.

At that time, this value isn't defined which we could probably set explicitly:

{{- with .Values.awsRegion }}
- name: AWS_REGION
  value: {{ . | quote }}
{{- end }}

Also, we set a queue name, not a full ARN as QUEUE_URL, which NTH may use to parse the region:

// Populate the aws region if available from node metadata and not already explicitly configured
if nthConfig.AWSRegion == "" && nodeMetadata.Region != "" {
	nthConfig.AWSRegion = nodeMetadata.Region
} else if nthConfig.AWSRegion == "" && nthConfig.QueueURL != "" {
	nthConfig.AWSRegion = getRegionFromQueueURL(nthConfig.QueueURL)
	log.Debug().Str("Retrieved AWS region from queue-url: \"%s\"", nthConfig.AWSRegion)
}
if nthConfig.AWSRegion == "" && nthConfig.EnableSQSTerminationDraining {
	nthConfig.Print()
	log.Fatal().Msgf("Unable to find the AWS region to process queue events.")
}

I'm not sure yet how and when NTH gets the region. At first glance, it seems that it starts working by pure chance as a by-product of the IRSA environment variable injection into Pod/aws-node-termination-handler-*, but not into Deployment/aws-node-termination-handler:

      - name: AWS_STS_REGIONAL_ENDPOINTS
        value: regional
      - name: AWS_DEFAULT_REGION
        value: eu-west-1
      - name: AWS_REGION
        value: eu-west-1
      - name: AWS_ROLE_ARN
        value: arn:aws:iam::622574275803:role/byoa1-nth
      - name: AWS_WEB_IDENTITY_TOKEN_FILE
        value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

With that, NTH uses AWS_REGION which luckily is the correct region of the SQS queue. This must be improved to explicitly specify the region and therefore allow operation immediately.

The text was updated successfully, but these errors were encountered:

AndiDog · 2024-12-17T13:02:28Z

When explicitly setting the region, the crashlooping and fatal error is gone. But until IRSA sets the IAM role+token, it can still take a few minutes where the application tries the node's instance profile instead of having the right credentials:

2024/12/17 12:59:45 WRN There was a problem monitoring for events error="AccessDenied: User: arn:aws:sts::590184012814:assumed-role/nodes-nodepool-0-t-eigx6dsrti1rcrok3a/i-04096072154be0e20 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:eu-west-2:590184012814:t-eigx6dsrti1rcrok3a-nth because no identity-based policy allows the sqs:receivemessage action\n\tstatus code: 403, request id: a4a508f0-dd10-5e21-9e50-572ba5119f2f" event_type=SQS_MONITOR

So we're still not faster like that.

Deploying aws-pod-identity-webhook-app faster would be part of a solution, but that requires improving app-operator reconciliation. I guess I'll just open the crashloop fix PR and then we can see if we want to improve this further.

github-project-automation bot added this to Roadmap Dec 17, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Dec 17, 2024

AndiDog self-assigned this Dec 17, 2024

architectbot added the team/phoenix Team Phoenix label Dec 17, 2024

AndiDog mentioned this issue Dec 17, 2024

Explicitly set aws-node-termination-handler queue region so crash-loops are avoided, allowing faster startup giantswarm/cluster-aws#977

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

AndiDog commented Dec 17, 2024

AndiDog commented Dec 17, 2024 •

edited

Loading

aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

aws-node-termination-handler comes up late because it doesn't immediately determine the AWS region #3802

Comments

AndiDog commented Dec 17, 2024

AndiDog commented Dec 17, 2024 • edited Loading

AndiDog commented Dec 17, 2024 •

edited

Loading