(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

horsmand · 2021-11-17T19:41:10Z

What is the problem?

The LogRetention custom resource is causing CDK deployments to fail due to a race condition happening due to the log group the log retention is trying to create and the log group being created (and the retention period being set) for the log retention Lambda.

There was previously an issue made for this and a fix was completed and released, but many people are still reporting that the issue persists. The previous issue is here: #15709

Reproduction Steps

This is difficult to reproduce consistently, but the more a CDK app makes use of the LogRetention custom resource, the more likely it is to happen.

The RFDK integration tests (RFDK is a CDK construct library) deploy multiple stacks in parallel and have seen the failure on 7 of the last 10 runs.

What did you expect to happen?

We were expecting the CDK app to deploy successfully.

What actually happened?

The CDK app failed with the following error:

CREATE_FAILED        | Custom::LogRetention                        | RenderStructRQ2/RenderQueue/LogGroupWrapper (RenderStructRQ2RenderQueueLogGroupWrapper18972F11) Received response stat
us [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 5a7d9dda-94e2-4d20-ad9b-eb66d61746fa)
        new LogRetention (/local/home/horsmand/workspace/rfdk/integ/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
        \_ Function.createOrFetch (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/core/lib/log-group-factory.ts:46:9)
        \_ new RenderQueue (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/deadline/lib/render-queue.ts:398:37)
        \_ new RenderStruct (/local/home/horsmand/workspace/rfdk/integ/integ/lib/render-struct.ts:103:24)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/integ/components/deadline/deadline_02_renderQueue/bin/deadline_02_renderQueue.ts:52:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Module.m._compile (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1365:23)
        \_ Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Object.require.extensions.<computed> [as .ts] (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1368:12)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ main (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:331:12)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:482:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ /local/home/horsmand/.nvm/versions/node/v14.17.1/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

CDK CLI Version

1.129.0

Framework Version

No response

Node.js Version

14.17.1

OS

Amazon Linux 2

Language

Typescript

Language Version

TypeScript ~4.4.4

Other information

I modified the LogRetention Lambda to include a few more log statements and was able to reproduce the error.
This is my modified createLogGroupSafe function:

var _a;

let retryCount = (options === null || options === void 0 ? void 0 : options.maxRetries) == undefined ? 10 : options.maxRetries;
console.log(`retryCount: ${retryCount}`); // DEBUGGING
const delay = ((_a = options === null || options === void 0 ? void 0 : options.retryOptions) === null || _a === void 0 ? void 0 : _a.base) == undefined ? 10 : options.retryOptions.base;
do {
    try {
        const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
        await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
        console.log(`LogGroup ${logGroupName} created successfully. Returning.`); // DEBUGGING
        return;
    }
    catch (error) {
        console.log('Caught error'); // DEBUGGING
        console.log(error); // DEBUGGING
        console.log(error.code); // DEBUGGING
        if (error.code === 'ResourceAlreadyExistsException') {
            // The log group is already created by the lambda execution
            console.log('ResourceAlreadyExistsException'); // DEBUGGING
            return;
        }
        if (error.code === 'OperationAbortedException') {
            console.log(`Caught OperationAbortedException. Retry: ${retryCount}`); // DEBUGGING
            if (retryCount > 0) {
                retryCount--;
                console.log(`Retrying with new retry count: ${retryCount}`); // DEBUGGING
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }
            else {
                console.log('Out of retry attempts'); // DEBUGGING
                // The log group is still being created by another execution but we are out of retries
                throw new Error('Out of attempts to create a logGroup');
            }
        }
        // Any other error
        console.error(error);
        throw error;
    }
} while (true); // exit happens on retry count check

This produced the following log:

START RequestId: d0d569a4-4171-468a-988f-b09e041e0644 Version: $LATEST
2021-11-17T02:55:41.181Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    
{
    "RequestType": "Create",
    "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
    "ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3A##########%3Astack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f%7CRenderStructRQ2RenderQueueLogGroupWrapper18972F11%7C5a7d9dda-94e2-4d20-ad9b-eb66d61746fa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211117T025540Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SDABHSHLV%2F20211117%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=49cb653700d2f4d9d1a165e48f90cfade037da40c77ab2b356dfa6d73bb83ced",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "ResourceType": "Custom::LogRetention",
    "ResourceProperties": {
        "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
        "RetentionInDays": "3",
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

2021-11-17T02:55:41.188Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:41.867Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    LogGroup RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue created successfully. Returning.
2021-11-17T02:55:42.006Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:42.130Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Caught error
2021-11-17T02:55:42.170Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException: The specified log group already exists
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ResourceAlreadyExistsException',
  time: 2021-11-17T02:55:42.129Z,
  requestId: '43e6312e-25b3-4d54-a95d-01c393e66280',
  statusCode: 400,
  retryable: false,
  retryDelay: 63.22735256810377
}
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.290Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-17T02:55:42.290Z,
  requestId: '16d429d1-1e27-4575-921e-237d508d1318',
  statusCode: 400,
  retryable: false,
  retryDelay: 28.945744461931877
}
2021-11-17T02:55:42.307Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Responding 
{
    "Status": "FAILED",
    "Reason": "A conflicting operation is currently in progress against this resource. Please try again.",
    "PhysicalResourceId": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "Data": {
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

END RequestId: d0d569a4-4171-468a-988f-b09e041e0644

From this we can see that:

It created the initial log it was trying to create successfully.
It hit a ResourceAlreadyExistsException, likely when trying to create the LogGroup for the Lambda itself, which is expected if the Lambda has previously ran.
After that, it goes right to an OperationAbortedException without hitting the Caught error logging line I added or flowing through any of the retry logic.

I believe this logging would indicate that the OperationAbortedException might not be coming from createLogGroupSafe and might instead be coming from setRetentionPolicy that happens after.

The text was updated successfully, but these errors were encountered:

horsmand · 2021-11-19T16:39:00Z

I've added more logging to the setRetentionPolicy method and it confirms my suspicion that it is what is still causing problems.
The logging I added to the method was:

async function setRetentionPolicy(logGroupName, region, options, retentionInDays) {
    const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
    if (!retentionInDays) {
        console.log(`Attempting to delete retention policy for ${logGroupName}`); // DEBUGGING
        try {
            await cloudwatchlogs.deleteRetentionPolicy({ logGroupName }).promise();
        } catch (error) {
            console.log(`Failed to delete retention policy with error code ${error.code}`); // DEBUGGING
            throw error;
        }
    }
    else {
        console.log(`Attempting to update retention policy for ${logGroupName}`); // DEBUGGING
        try {
            await cloudwatchlogs.putRetentionPolicy({ logGroupName, retentionInDays }).promise();
        } catch (error) {
            console.log(`Failed to update retention policy with error code ${error.code}`); // DEBUGGING
            throw error;
        }
    }
}

This resulted in:

2021-11-18T16:47:09.700Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    Attempting to update retention policy for /aws/lambda/RFDKInteg-SM-ComponentTie-LogRetentionaae0aa3c5b4d-6syHb7QJHP0y
2021-11-18T16:47:09.824Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    Failed to update retention policy with error code OperationAbortedException
2021-11-18T16:47:09.839Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-18T16:47:09.824Z,
  requestId: '1c19ab9b-9f64-4b0c-8200-4510846a165b',
  statusCode: 400,
  retryable: false,
  retryDelay: 55.24597932481656
}

jaapvanblaaderen · 2021-11-25T08:53:57Z

I have the exact same issue with 1.133.0 and your fix in #17688 solves it for me.

jonashdown · 2021-12-08T13:58:48Z

Could the speed of the machine CDK is running on manifest this ? We have works on my machine when running CDK from developer's mac books, but failure when running CDK through Github actions

thiagomeireless · 2021-12-09T17:33:15Z

Confirming that #17688 also fixes the problem for me, hope it gets merged soon!

Fixes: #17546 This adds to the fix in #16083 that was addressing the issue where the LogRetention Lambda can be executed concurrently and create a race condition where multiple invocations are trying to create or modify the same log group. The previous fix addressed the issue if it occurred during log group creation, in the `createLogGroupSafe` method, but did not account for the same problem happening when modifying a log group's retention period in the `setRetentionPolicy` method. This fix applies the same logic from the last fix to the other method. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

github-actions · 2021-12-10T18:24:52Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

mtsoltan · 2021-12-21T21:32:58Z

Still occurs as of v1.136.0

) Fixes: aws#17546 This adds to the fix in aws#16083 that was addressing the issue where the LogRetention Lambda can be executed concurrently and create a race condition where multiple invocations are trying to create or modify the same log group. The previous fix addressed the issue if it occurred during log group creation, in the `createLogGroupSafe` method, but did not account for the same problem happening when modifying a log group's retention period in the `setRetentionPolicy` method. This fix applies the same logic from the last fix to the other method. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

horsmand added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 17, 2021

github-actions bot assigned comcalvi Nov 17, 2021

github-actions bot added the @aws-cdk/aws-logs Related to Amazon CloudWatch Logs label Nov 17, 2021

horsmand mentioned this issue Nov 17, 2021

(aws-logs): log retention update can race against the target lambda's log group creation resulting in OperationAbortedException #15709

Closed

NGL321 added the p1 label Nov 22, 2021

horsmand mentioned this issue Nov 24, 2021

fix(logs): log retention fails with OperationAbortedException #17688

Merged

mergify bot closed this as completed in #17688 Dec 10, 2021

khushail mentioned this issue Mar 31, 2023

Custom::LogRetention - ThrottlingException when creating multiple resources #24879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

horsmand commented Nov 17, 2021

horsmand commented Nov 19, 2021

jaapvanblaaderen commented Nov 25, 2021 •

edited

Loading

jonashdown commented Dec 8, 2021 •

edited

Loading

thiagomeireless commented Dec 9, 2021

github-actions bot commented Dec 10, 2021

mtsoltan commented Dec 21, 2021

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

Comments

horsmand commented Nov 17, 2021

What is the problem?

Reproduction Steps

What did you expect to happen?

What actually happened?

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

horsmand commented Nov 19, 2021

jaapvanblaaderen commented Nov 25, 2021 • edited Loading

jonashdown commented Dec 8, 2021 • edited Loading

thiagomeireless commented Dec 9, 2021

github-actions bot commented Dec 10, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

mtsoltan commented Dec 21, 2021

jaapvanblaaderen commented Nov 25, 2021 •

edited

Loading

jonashdown commented Dec 8, 2021 •

edited

Loading