Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

Closed
horsmand opened this issue Nov 17, 2021 · 6 comments · Fixed by #17688
Closed

(aws-logs): log retention failing cdk deployment with OperationAbortedException #17546

horsmand opened this issue Nov 17, 2021 · 6 comments · Fixed by #17688
Assignees
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. p1

Comments

@horsmand
Copy link
Contributor

What is the problem?

The LogRetention custom resource is causing CDK deployments to fail due to a race condition happening due to the log group the log retention is trying to create and the log group being created (and the retention period being set) for the log retention Lambda.

There was previously an issue made for this and a fix was completed and released, but many people are still reporting that the issue persists. The previous issue is here: #15709

Reproduction Steps

This is difficult to reproduce consistently, but the more a CDK app makes use of the LogRetention custom resource, the more likely it is to happen.

The RFDK integration tests (RFDK is a CDK construct library) deploy multiple stacks in parallel and have seen the failure on 7 of the last 10 runs.

What did you expect to happen?

We were expecting the CDK app to deploy successfully.

What actually happened?

The CDK app failed with the following error:

CREATE_FAILED        | Custom::LogRetention                        | RenderStructRQ2/RenderQueue/LogGroupWrapper (RenderStructRQ2RenderQueueLogGroupWrapper18972F11) Received response stat
us [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 5a7d9dda-94e2-4d20-ad9b-eb66d61746fa)
        new LogRetention (/local/home/horsmand/workspace/rfdk/integ/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
        \_ Function.createOrFetch (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/core/lib/log-group-factory.ts:46:9)
        \_ new RenderQueue (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/deadline/lib/render-queue.ts:398:37)
        \_ new RenderStruct (/local/home/horsmand/workspace/rfdk/integ/integ/lib/render-struct.ts:103:24)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/integ/components/deadline/deadline_02_renderQueue/bin/deadline_02_renderQueue.ts:52:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Module.m._compile (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1365:23)
        \_ Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Object.require.extensions.<computed> [as .ts] (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1368:12)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ main (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:331:12)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:482:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ /local/home/horsmand/.nvm/versions/node/v14.17.1/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

CDK CLI Version

1.129.0

Framework Version

No response

Node.js Version

14.17.1

OS

Amazon Linux 2

Language

Typescript

Language Version

TypeScript ~4.4.4

Other information

I modified the LogRetention Lambda to include a few more log statements and was able to reproduce the error.
This is my modified createLogGroupSafe function:

var _a;

let retryCount = (options === null || options === void 0 ? void 0 : options.maxRetries) == undefined ? 10 : options.maxRetries;
console.log(`retryCount: ${retryCount}`); // DEBUGGING
const delay = ((_a = options === null || options === void 0 ? void 0 : options.retryOptions) === null || _a === void 0 ? void 0 : _a.base) == undefined ? 10 : options.retryOptions.base;
do {
    try {
        const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
        await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
        console.log(`LogGroup ${logGroupName} created successfully. Returning.`); // DEBUGGING
        return;
    }
    catch (error) {
        console.log('Caught error'); // DEBUGGING
        console.log(error); // DEBUGGING
        console.log(error.code); // DEBUGGING
        if (error.code === 'ResourceAlreadyExistsException') {
            // The log group is already created by the lambda execution
            console.log('ResourceAlreadyExistsException'); // DEBUGGING
            return;
        }
        if (error.code === 'OperationAbortedException') {
            console.log(`Caught OperationAbortedException. Retry: ${retryCount}`); // DEBUGGING
            if (retryCount > 0) {
                retryCount--;
                console.log(`Retrying with new retry count: ${retryCount}`); // DEBUGGING
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }
            else {
                console.log('Out of retry attempts'); // DEBUGGING
                // The log group is still being created by another execution but we are out of retries
                throw new Error('Out of attempts to create a logGroup');
            }
        }
        // Any other error
        console.error(error);
        throw error;
    }
} while (true); // exit happens on retry count check

This produced the following log:

START RequestId: d0d569a4-4171-468a-988f-b09e041e0644 Version: $LATEST
2021-11-17T02:55:41.181Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    
{
    "RequestType": "Create",
    "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
    "ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3A##########%3Astack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f%7CRenderStructRQ2RenderQueueLogGroupWrapper18972F11%7C5a7d9dda-94e2-4d20-ad9b-eb66d61746fa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211117T025540Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SDABHSHLV%2F20211117%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=49cb653700d2f4d9d1a165e48f90cfade037da40c77ab2b356dfa6d73bb83ced",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "ResourceType": "Custom::LogRetention",
    "ResourceProperties": {
        "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
        "RetentionInDays": "3",
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

2021-11-17T02:55:41.188Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:41.867Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    LogGroup RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue created successfully. Returning.
2021-11-17T02:55:42.006Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:42.130Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Caught error
2021-11-17T02:55:42.170Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException: The specified log group already exists
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ResourceAlreadyExistsException',
  time: 2021-11-17T02:55:42.129Z,
  requestId: '43e6312e-25b3-4d54-a95d-01c393e66280',
  statusCode: 400,
  retryable: false,
  retryDelay: 63.22735256810377
}
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.290Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-17T02:55:42.290Z,
  requestId: '16d429d1-1e27-4575-921e-237d508d1318',
  statusCode: 400,
  retryable: false,
  retryDelay: 28.945744461931877
}
2021-11-17T02:55:42.307Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Responding 
{
    "Status": "FAILED",
    "Reason": "A conflicting operation is currently in progress against this resource. Please try again.",
    "PhysicalResourceId": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "Data": {
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

END RequestId: d0d569a4-4171-468a-988f-b09e041e0644

From this we can see that:

  1. It created the initial log it was trying to create successfully.
  2. It hit a ResourceAlreadyExistsException, likely when trying to create the LogGroup for the Lambda itself, which is expected if the Lambda has previously ran.
  3. After that, it goes right to an OperationAbortedException without hitting the Caught error logging line I added or flowing through any of the retry logic.

I believe this logging would indicate that the OperationAbortedException might not be coming from createLogGroupSafe and might instead be coming from setRetentionPolicy that happens after.

@horsmand
Copy link
Contributor Author

I've added more logging to the setRetentionPolicy method and it confirms my suspicion that it is what is still causing problems.
The logging I added to the method was:

async function setRetentionPolicy(logGroupName, region, options, retentionInDays) {
    const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
    if (!retentionInDays) {
        console.log(`Attempting to delete retention policy for ${logGroupName}`); // DEBUGGING
        try {
            await cloudwatchlogs.deleteRetentionPolicy({ logGroupName }).promise();
        } catch (error) {
            console.log(`Failed to delete retention policy with error code ${error.code}`); // DEBUGGING
            throw error;
        }
    }
    else {
        console.log(`Attempting to update retention policy for ${logGroupName}`); // DEBUGGING
        try {
            await cloudwatchlogs.putRetentionPolicy({ logGroupName, retentionInDays }).promise();
        } catch (error) {
            console.log(`Failed to update retention policy with error code ${error.code}`); // DEBUGGING
            throw error;
        }
    }
}

This resulted in:

2021-11-18T16:47:09.700Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    Attempting to update retention policy for /aws/lambda/RFDKInteg-SM-ComponentTie-LogRetentionaae0aa3c5b4d-6syHb7QJHP0y
2021-11-18T16:47:09.824Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    Failed to update retention policy with error code OperationAbortedException
2021-11-18T16:47:09.839Z    0de1ecd3-d830-43b6-9c32-a9e6ca47130c    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-18T16:47:09.824Z,
  requestId: '1c19ab9b-9f64-4b0c-8200-4510846a165b',
  statusCode: 400,
  retryable: false,
  retryDelay: 55.24597932481656
}

@jaapvanblaaderen
Copy link
Contributor

jaapvanblaaderen commented Nov 25, 2021

I have the exact same issue with 1.133.0 and your fix in #17688 solves it for me.

@jonashdown
Copy link

jonashdown commented Dec 8, 2021

Could the speed of the machine CDK is running on manifest this ? We have works on my machine when running CDK from developer's mac books, but failure when running CDK through Github actions

@thiagomeireless
Copy link

Confirming that #17688 also fixes the problem for me, hope it gets merged soon!

@mergify mergify bot closed this as completed in #17688 Dec 10, 2021
mergify bot pushed a commit that referenced this issue Dec 10, 2021
Fixes: #17546

This adds to the fix in #16083 that was addressing the issue where the LogRetention Lambda can be executed concurrently and create a race condition where multiple invocations are trying to create or modify the same log group.

The previous fix addressed the issue if it occurred during log group creation, in the `createLogGroupSafe` method, but did not account for the same problem happening when modifying a log group's retention period in the `setRetentionPolicy` method. This fix applies the same logic from the last fix to the other method.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@mtsoltan
Copy link

Still occurs as of v1.136.0

TikiTDO pushed a commit to TikiTDO/aws-cdk that referenced this issue Feb 21, 2022
)

Fixes: aws#17546

This adds to the fix in aws#16083 that was addressing the issue where the LogRetention Lambda can be executed concurrently and create a race condition where multiple invocations are trying to create or modify the same log group.

The previous fix addressed the issue if it occurred during log group creation, in the `createLogGroupSafe` method, but did not account for the same problem happening when modifying a log group's retention period in the `setRetentionPolicy` method. This fix applies the same logic from the last fix to the other method.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-logs Related to Amazon CloudWatch Logs bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants