-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade from 5.5.1 up hangs on creation of the Autoscaling nested stack #938
Comments
Sorry to hear this has caused issues. Is it still stuck in this state? Are you using any advanced features in your stack like a CloudFormation Service Role? If you create a new stack from the v5.7.0 template in the same account region does it apply successfully or stall at the same point? |
Just getting back to attempting the upgrade
Yes, I waited multiple hours then canceled the upgrade so builds could run again
I'm not sure if it's "advanced" but the only potentially unusual thing we're doing is using a secretsmanager secret for the agent token
Just tried this, and it does seem to still stall. |
Thank you for these details 🙇 I’m afraid I’ve been unable to reproduce this 🤔 would you be able to share your AWS Region and a redacted set of CloudFormation parameters for your stack?
Are you using the SSM Reference Syntax to retrieve the Secrets Manager Secret value via SSM? I can’t see that causing CloudFormation deploy-time issues with the scaling Lambda because it is retrieved at runtime.
I apologise again for the interruption, though I think this is good news in so far as the issue is reproducible 😅 |
Hi, Seeing this exact issue when trying to upgrade from 5.5.1 to 5.6.1 or 5.7.1. Tested deleting the stack and recreating and seeing the same issue with it hanging on the nested Autoscaling job, eventually errors with |
@darrenwhighamfd could you share the event log for the Autoscaling sub-stack that fails to create successfully? |
Other than any account level IAM restriction, do your AWS organisations use any service control policies that could be restricting access to the Serverless Application Repository? |
We're in us-east-2. We create the stack via a Terraform module; here's the relevant resource for the actual stack. I can fill in what any of the other
locals {
stack_version = "5.7.0"
}
resource "aws_cloudformation_stack" "main" {
name = local.stack_name
template_url = "https://s3.amazonaws.com/buildkite-aws-stack/v${local.stack_version}/aws-stack.yml"
capabilities = ["CAPABILITY_NAMED_IAM", "CAPABILITY_AUTO_EXPAND"]
parameters = {
"ArtifactsBucket" = var.artifacts_bucket
"SecretsBucket" = var.secrets_bucket
"BuildkiteQueue" = var.buildkite_queue
"BuildkiteAgentTokenParameterStorePath" = var.buildkite_agent_token_parameter_store_path
"InstanceType" = var.instance_type
"ECRAccessPolicy" = "poweruser"
"MaxSize" = var.max_size
"MinSize" = var.min_size
"VpcId" = data.aws_vpc.vpc.id
"Subnets" = join(",", data.aws_subnets.public_subnets.ids)
}
} |
Also confirmed no service control policies |
Thank you for those details, I appreciate the time spent working to resolve this problem. It’s frustrating that the stack events don’t include more details here. I have found some AWS guidance on identifying a CloudFormation "Internal Failure" that involves using CloudTrail to identify the failing API operations. Would you be able to configure CloudTrail logging to help identify why this stack operation is failing? |
Hi @keithduncan, I think the issue relates to the BuildkiteAgentTokenParameter trying to pull I asked our AWS support if they could shine a light. They believe the The Secret Manager will encrypt the data, so GetParameter requires AWS Support recommended using Secrets Manager purely, and this dynamic Environment: I've confirmed that the rest of the CF template works by deleting my stack and recreating fresh using the |
That’s good research @darrenwhighamfd! We could be on to something as @glittershark also showed using AWS Secrets Manager SSM Parameter Store reference syntax.
Our template doesn’t resolve the value of the parameter, so this isn’t applicable. Instead the the template passes the value of the provided SSM Parameter Path to the EC2 instances via The Lambda template has both ssm:GetParameter and kms:Decrypt permissions, and passes WithDecryption when fetching the value of the SSM parameter. The launch template’s IAM Role also has ssm:GetParameter and kms:Decrypt permissions, and passes I have previously tested and documented AWS Secrets Manager SSM reference syntax in our template.
Were you passing a value for the |
Thanks @keithduncan,
Which would suggest that WithDecryption is not set. I have set BuildkiteAgentTokenParameterStoreKMSKey within the options and can see this gets passed to the nested stack with the key ID |
Thank you for all your help isolating this issue and the steps to reproduce. I have reproduced this and to fix it I have removed the We’ll release an updated version of the agent scaler to the Serverless Application repository shortly, and will incorporate it into the next release of the Elastic CI Stack. I’ll keep this open to track that and close it when that release goes out 🙇 |
Version 5.7.2 of the Elastic CI Stack has just been released which includes a fix for this issue, thank you again for your patience. I’m going to close this issue, but please don’t hesitate to comment or re-open it should you experience any further issues. |
I'm trying to upgrade a stack from version 5.5.1 to 5.7.0 right now, and that upgrade is currently hanging on creating the nested
Autoscaling
stack. That stack just has the following two events in the "Events" log:and no resources at all:
The text was updated successfully, but these errors were encountered: