Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazon-ssm-agent service fails connecting to SSM due to eventual consistency #503

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gnought
Copy link
Contributor

@gnought gnought commented Aug 25, 2024

Using the below sample config. the temporary_iam_instance_profile_policy_document may not be immediately visible after a EC2 instance starts due to eventual consistency of PutRolePolicy and AddRoleToInstanceProfile. As a result, the amazon-ssm-agent service may fail to connect to SSM because the required SSM role does not available yet. This issue requires logging into the instance to manually restart the service or wait for 30 mins to self heal. (please see the packer log and ec2 amazon-ssm-agent log below)

This PR automatically creates a custom instance profile associated with AmazonSSMManagedInstanceCore managed policy when session_manager is used without specifying iam_instance_profile key attribute. If a user defines temporary_iam_instance_profile_policy_document, it will be added as an inline policy to the custom profile. This will solve the racing condition ensuring the amazon-ssm-agent service could consistently connect to SSM on the first start.

As a bonus, this PR also supports AWS China region, closing #50

sample config

ssh_interface           = "session_manager"
temporary_key_pair_type = "ed25519"
temporary_key_pair_bits = 384
// copy from AmazonSSMManagedInstanceCore managed policy
temporary_iam_instance_profile_policy_document {
  Version = "2012-10-17"
  Statement {
    Action = [
      "ssm:DescribeAssociation",
      "ssm:GetDeployablePatchSnapshotForInstance",
      "ssm:GetDocument",
      "ssm:DescribeDocument",
      "ssm:GetManifest",
      "ssm:GetParameter",
      "ssm:GetParameters",
      "ssm:ListAssociations",
      "ssm:ListInstanceAssociations",
      "ssm:PutInventory",
      "ssm:PutComplianceItems",
      "ssm:PutConfigurePackageResult",
      "ssm:UpdateAssociationStatus",
      "ssm:UpdateInstanceAssociationStatus",
      "ssm:UpdateInstanceInformation"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
  Statement {
    Action = [
      "ssmmessages:CreateControlChannel",
      "ssmmessages:CreateDataChannel",
      "ssmmessages:OpenControlChannel",
      "ssmmessages:OpenDataChannel"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
  Statement {
    Action = [
      "ec2messages:AcknowledgeMessage",
      "ec2messages:DeleteMessage",
      "ec2messages:FailMessage",
      "ec2messages:GetEndpoint",
      "ec2messages:GetMessages",
      "ec2messages:SendReply"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
}

packer build log:

2024/08/26 00:56:29 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:29 Retryable error: TargetNotConnected: i-011a46c740a76676e is not connected.
2024/08/26 00:56:31 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:31 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused
2024/08/26 00:56:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:36 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused
2024/08/26 00:56:41 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:41 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused

The ec2 amazon-ssm-agent log:

status code: 404, request id:
2024-08-25 16:54:21 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
	status code: 400, request id: 906a00a0-9eec-42b7-b385-xxxxxxxxx
2024-08-25 16:54:21 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management Err: error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
	status code: 400, request id: 906a00a0-9eec-42b7-b385-xxxxxxxxx
2024-08-25 16:54:21 INFO [CredentialRefresher] Sleeping for 27m6s before retrying retrieve credentials

@gnought gnought requested a review from a team as a code owner August 25, 2024 19:04
@gnought gnought force-pushed the fix/ssm_race_condition branch from 98a5b76 to f630f5c Compare October 21, 2024 07:26
@gnought gnought force-pushed the fix/ssm_race_condition branch from f630f5c to ffbf88b Compare January 10, 2025 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant