Skip to content

(3.0.0 3.1.2) build image stack deletion failed after image successfully created

Giacomo Marciani edited this page Apr 20, 2022 · 3 revisions

The issue

Starting from 2021-03-03 the build-image command may fail during stack deletion after image created successfully. From the result of pcluster describe-image --image-id <your-image-id> , imageBuildStatus is in BUILD_COMPLETE state and image state is AVAILABLE, but the CloudFormation stack deletion failed with the following error messages:

Resource handler returned message: 
"User: arn:aws:sts::1234567:assumed-role/ParallelClusterImageCleanup-xxxx-xxxx-xxxx/ParallelClusterImage-xxxxx is not authorized to perform: 
imagebuilder:GetImage on resource: arn:aws:imagebuilder:us-east-1:1234567:image/parallelclusterimage-build-image/3.1.2/1 
(Service: Imagebuilder, Status Code: 403, Request ID: xxx)" 
(RequestToken: xxxxx, HandlerErrorCode: GeneralServiceException)

Resource handler returned message: 
"User: arn:aws:sts::1234567:xxxx is not authorized to perform: 
imagebuilder:CancelImageCreation on resource: arn:aws:imagebuilder:sa-east-1:1234567:image/parallelclusterimage-integ-tests-build-image-xxx/3.1.2/1 
(Service: Imagebuilder, Status Code: 403, Request ID: xxx)" 
(RequestToken: xxxx, HandlerErrorCode: GeneralServiceException)

You can verify whether you are impacted by the issue by executing the following command:

pcluster get-image-stack-events --image-id <your-image-id> --region <image-build-region> --query "events[0]"

Then you can check from the output if the resourceStatus and resourceStatusReason match the following:

"resourceStatusReason": "The following resource(s) failed to delete: [ParallelClusterImage]. "
"resourceStatus": "DELETE_FAILED",

If you are not impacted by the issue, the following output is expected:

{
  "message": "CloudFormation Stack for Image build-image does not exist."
}

The root cause of the problem is that the new version of EC2 Image Builder service, which is used by ParallelCluster build-image command, requires two new additional IAM policies imagebuilder:GetImage and imagebuilder:CancelImageCreation.

Affected versions

This issue affects ParallelCluster versions from 3.0.0 to 3.1.2 when using build-image functionality.

Mitigation

There are two options to mitigate to problem. Option 1 can be used to manually delete a stack in DELETE_FAILED. Option 2 can be used if you want to use build-image command in the future but want to avoid delete the stack by hand after the image build successfully.

Option1: Manually delete the stack after image become available

Use pcluster describe-image command to make sure the image is already created:

>pcluster describe-image --image-id <your-image-id> --region <image-region>
{
  "imageConfiguration": {
    "url":...
  },
  "imageId": "your-image-id",
  "creationTime": "2022-03-06T21:20:52.000Z",
  "imageBuildStatus": "BUILD_COMPLETE",
  "region": "us-east-1",
  "ec2AmiInfo": {
    "amiName": "...",
    "amiId": "ami-xxx",
    "description": "...",
    "state": "AVAILABLE",
    "tags": [...],
    "architecture": "x86_64"
  },
  "version": "3.1.2"
}

After seeing imageBuildStatus is in BUILD_COMPLETE and image state is AVAILABLE, you can manually delete the stack by using command:

aws cloudformation delete-stack --stack-name <your-image-id>

Or you can go to AWS CloudFormation console, find the stack named with image_id and delete it.

Option2: Add AdditionalIamPolicies and CleanupLambdaRole to build-image config

This option can be used if you want to use build-image command in the future and you want to avoid to delete the stack by hand after the image build successfully.

You can use the CleanupLambdaRole from ParallelCluster documentation. The CleanupLambdaRole should contains policies imagebuilder:GetImage and imagebuilder:CancelImageCreation.

After creating the custom CleanupLambdaRole, add the role to build-image configuration:

Build:
  Iam:
    CleanupLambdaRole: <custom cleanup_lambda_role>