Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 5.5.1 up hangs on creation of the Autoscaling nested stack #938

Closed
glittershark opened this issue Oct 6, 2021 · 16 comments
Closed

Comments

@glittershark
Copy link

I'm trying to upgrade a stack from version 5.5.1 to 5.7.0 right now, and that upgrade is currently hanging on creating the nested Autoscaling stack. That stack just has the following two events in the "Events" log:

image

and no resources at all:

image

@keithduncan
Copy link
Contributor

Sorry to hear this has caused issues. Is it still stuck in this state? Are you using any advanced features in your stack like a CloudFormation Service Role? If you create a new stack from the v5.7.0 template in the same account region does it apply successfully or stall at the same point?

@glittershark
Copy link
Author

glittershark commented Oct 11, 2021

Just getting back to attempting the upgrade

Is it still stuck in this state?

Yes, I waited multiple hours then canceled the upgrade so builds could run again

Are you using any advanced features in your stack like a CloudFormation Service Role?

I'm not sure if it's "advanced" but the only potentially unusual thing we're doing is using a secretsmanager secret for the agent token

If you create a new stack from the v5.7.0 template in the same account region does it apply successfully or stall at the same point?

Just tried this, and it does seem to still stall.

@keithduncan
Copy link
Contributor

Thank you for these details 🙇

I’m afraid I’ve been unable to reproduce this 🤔 would you be able to share your AWS Region and a redacted set of CloudFormation parameters for your stack?

I'm not sure if it's "advanced" but the only potentially unusual thing we're doing is using a secretsmanager secret for the agent token

Are you using the SSM Reference Syntax to retrieve the Secrets Manager Secret value via SSM? I can’t see that causing CloudFormation deploy-time issues with the scaling Lambda because it is retrieved at runtime.

Just tried this, and it does seem to still stall.

I apologise again for the interruption, though I think this is good news in so far as the issue is reproducible 😅

@darrenwhighamfd
Copy link

darrenwhighamfd commented Oct 15, 2021

Hi, Seeing this exact issue when trying to upgrade from 5.5.1 to 5.6.1 or 5.7.1. Tested deleting the stack and recreating and seeing the same issue with it hanging on the nested Autoscaling job, eventually errors with
2021-10-14 12:56:26 UTC+0100 Autoscaling CREATE_FAILED Embedded stack arn:aws:cloudformation:us-east-1:stack/buildkite-canary-Autoscaling-3349UQASS4SA/b19dfba0-2ce0-11ec-b434-1222c22090c1 was not successfully created: Internal Failure in the cloudformation events

@keithduncan
Copy link
Contributor

@darrenwhighamfd could you share the event log for the Autoscaling sub-stack that fails to create successfully?

@keithduncan
Copy link
Contributor

Other than any account level IAM restriction, do your AWS organisations use any service control policies that could be restricting access to the Serverless Application Repository?

@darrenwhighamfd
Copy link

darrenwhighamfd commented Oct 18, 2021

Attached screenshot or nested task that fails, no real helpful errors
Screenshot 2021-10-18 at 09 30 19

No service control policies that affect anything like that.

@glittershark
Copy link
Author

glittershark commented Oct 18, 2021

would you be able to share your AWS Region and a redacted set of CloudFormation parameters for your stack?

We're in us-east-2. We create the stack via a Terraform module; here's the relevant resource for the actual stack. I can fill in what any of the other vars are if necessary, but at the very least there's:

buildkite_agent_token_parameter_store_path = "/aws/reference/secretsmanager/buildkite/AGENT_TOKEN"
locals {
  stack_version = "5.7.0"
}

resource "aws_cloudformation_stack" "main" {
  name = local.stack_name
  template_url = "https://s3.amazonaws.com/buildkite-aws-stack/v${local.stack_version}/aws-stack.yml"
  capabilities = ["CAPABILITY_NAMED_IAM", "CAPABILITY_AUTO_EXPAND"]
  parameters = {
    "ArtifactsBucket" = var.artifacts_bucket
    "SecretsBucket"   = var.secrets_bucket
    "BuildkiteQueue" = var.buildkite_queue
    "BuildkiteAgentTokenParameterStorePath" = var.buildkite_agent_token_parameter_store_path
    "InstanceType" = var.instance_type
    "ECRAccessPolicy" = "poweruser"
    "MaxSize" = var.max_size
    "MinSize" = var.min_size
    "VpcId"   = data.aws_vpc.vpc.id
    "Subnets" = join(",", data.aws_subnets.public_subnets.ids)
  }
}

@glittershark
Copy link
Author

Also confirmed no service control policies

@keithduncan
Copy link
Contributor

Thank you for those details, I appreciate the time spent working to resolve this problem.

It’s frustrating that the stack events don’t include more details here. I have found some AWS guidance on identifying a CloudFormation "Internal Failure" that involves using CloudTrail to identify the failing API operations. Would you be able to configure CloudTrail logging to help identify why this stack operation is failing?

@darrenwhighamfd
Copy link

I had a look at our Cloudtrail logs, a lot of messages are generated and it seems very hard to find anything but managed to finds these two errors which maybe related.
Screenshot 2021-10-19 at 10 28 47

Screenshot 2021-10-19 at 10 28 32

@darrenwhighamfd
Copy link

Hi @keithduncan,

I think the issue relates to the BuildkiteAgentTokenParameter trying to pull
from the secret manager, versus systems manager.

I asked our AWS support if they could shine a light. They believe the
failure's due to the BuildkiteAgentTokenParameter having been defined as
AWS::SSM::Parameter::Name, which is a System Manager param, not Secret
Manager.

The Secret Manager will encrypt the data, so GetParameter requires
'WithDecryption' to retrieve the value from Secret Manager. When our
parameter type is AWS::SSM::Parameter::Name, CF doesn't include
WithDecryption in its GetParameter invocation, as
AWS::SSM::Parameter::Name, is a Systems Manager type.

AWS Support recommended using Secrets Manager purely, and this dynamic
reference[1] below.

Environment:
Variables:
BUILDKITE_AGENT_TOKEN_SSM_KEY: '{{resolve:secretsmanager:SecretName:SecretString:SecretKey}}'`

I've confirmed that the rest of the CF template works by deleting my stack and recreating fresh using the
previous BuildkiteAgentTokenParameter option, and was able to install a
v5.7.1 stack.

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/dynamic-references.html#dynamic-references-secretsmanager

@keithduncan
Copy link
Contributor

That’s good research @darrenwhighamfd! We could be on to something as @glittershark also showed using AWS Secrets Manager SSM Parameter Store reference syntax.

CF doesn't include WithDecryption in its GetParameter invocation

Our template doesn’t resolve the value of the parameter, so this isn’t applicable. Instead the the template passes the value of the provided SSM Parameter Path to the EC2 instances via UserData and the scaling Lambda. Both resolve the value of this parameter at runtime.

The Lambda template has both ssm:GetParameter and kms:Decrypt permissions, and passes WithDecryption when fetching the value of the SSM parameter.

The launch template’s IAM Role also has ssm:GetParameter and kms:Decrypt permissions, and passes WithDecryption in both the Linux and Windows install scripts.

I have previously tested and documented AWS Secrets Manager SSM reference syntax in our template.

I've confirmed that the rest of the CF template works by deleting my stack and recreating fresh using the
previous BuildkiteAgentTokenParameter option, and was able to install a
v5.7.1 stack.

Were you passing a value for the BuildkiteAgentTokenParameterStoreKMSKey when using a secrets manager reference path for BuildkiteAgentTokenParameterStorePath? NB you must pass the Key ID, not the Key Name for this parameter.

@darrenwhighamfd
Copy link

Thanks @keithduncan,
Looking the key error from CF is

    },
    "eventTime": "2021-10-21T09:26:08Z",
    "eventSource": "ssm.amazonaws.com",
    "eventName": "GetParameter",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "ValidationException",
    "errorMessage": "WithDecryption flag must be True for retrieving a Secret Manager secret.",
    "requestParameters": {
        "name": "/aws/reference/secretsmanager/bk_agent"
    },

Which would suggest that WithDecryption is not set. I have set BuildkiteAgentTokenParameterStoreKMSKey within the options and can see this gets passed to the nested stack with the key ID
Screenshot 2021-10-21 at 10 31 29

@keithduncan
Copy link
Contributor

Thank you for all your help isolating this issue and the steps to reproduce.

I have reproduced this and to fix it I have removed the AWS::SSM::Parameter::Name annotation in buildkite/buildkite-agent-scaler#53

We’ll release an updated version of the agent scaler to the Serverless Application repository shortly, and will incorporate it into the next release of the Elastic CI Stack. I’ll keep this open to track that and close it when that release goes out 🙇

@keithduncan keithduncan added this to the v5.7.2 milestone Oct 25, 2021
@keithduncan
Copy link
Contributor

Version 5.7.2 of the Elastic CI Stack has just been released which includes a fix for this issue, thank you again for your patience. I’m going to close this issue, but please don’t hesitate to comment or re-open it should you experience any further issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants