Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIGRATION ISSUE: Lambdas seemingly not cleaning up/disposing of RDS connections properly after upgrade #6369

Open
4 of 5 tasks
gcv-epalmer opened this issue Aug 8, 2024 · 6 comments
Assignees
Labels
v2-v3-inconsistency Behavior has changed from v2 to v3, or feature is missing altogether

Comments

@gcv-epalmer
Copy link

gcv-epalmer commented Aug 8, 2024

Pre-Migration Checklist

Which JavaScript Runtime is this issue in?

Node.js (includes AWS Lambda)

AWS Lambda Usage

  • Yes, my application is running on AWS Lambda.
  • No, my application is not running on AWS Lambda.

Describe the Migration Issue

Update: FWIW, looking into this being a possible issue with Node20: nodejs/node#47130
We'll be testing with Node18 tonight

We have a nightly process that runs a large number of concurrent step function executions that contain lambdas which open connections to RDS to get some data. These lambdas use the SDKv3 to talk to s3, secrets manager, and dynamodb. We are using AWS RDS Proxy and read replicas to help manage the high concurrent connections, and our code logic is always closing the connection after getting data. This process has been running fine for months and months at our current scale, until we upgraded to AWS SDK v3 and our lambdas to node 20 (from node 16). Suddenly after updating, our lambdas in the step function are erroring out from RDS in a few ways:

unnamed prepared statement does not exist
could not send data to client: Connection reset by peer
Timed-out waiting to acquire database connection
canceling statement due to conflict with recovery

After looking at some of the performance charts in RDS, it's clear there is a change in behavior on the day that our updated SDK lambdas are deployed:

rds1
rds3
rds2

You can see our lambda function duration has also increased after this update:
duration1

I know the SDK should have nothing to do with connecting to RDS, but after lots of testing the only thing that alleviates the issue is going back to SDK v2 on node16. So I don't know where the source of the issue is, it's a chicken or egg scenario. Is the longer lambda duration causing us to keep RDS connections open longer, therefore bogging down RDS to a point where it breaks? Or is there some behind the scenes change in how the RDS connections are being opened and closed that are causing lambda environments to stay alive longer? I've been doing a good amount of testing various logical changes and optimizations with no effect. Only rolling back the code to sdkv2 and node16 does our performance come back to normal.

Code Comparison

Previously our V2 initialized clients outside of handlers as such
V2 Code Snippet:

const AWS = AwsProvider.getAws();
const s3 = new AWS.S3({ apiVersion: "2006-03-01", signatureVersion: "v4" });

After upgrading and doing some investigation, reading issue other folks have been having led me to implement a 'lazy-loading' strategy
V3 Code Snippet:

let s3Client: S3Client | undefined;

export const s3 = () => {
  if (!s3Client) {
    s3Client = new S3Client({
      apiVersion: "2006-03-01",
    });
  }
  return s3Client;
};

After reading about the changes to keep-alive and maxSockets, I tried to configure our S3 client to behave more like sdkv2, as I've seen some github issues related specifically to the s3 client which our lambdas heavily use.

let s3Client: S3Client | undefined;

const connectionTimeout = 5000;
const socketTimeout = 120000;
const maxSockets = 1000;
const keepAlive = false;

export const s3 = () => {
  if (!s3Client) {
    s3Client = new S3Client({
      apiVersion: "2006-03-01",
      requestHandler: new NodeHttpHandler({
        connectionTimeout,
        socketTimeout,
        httpAgent: new HttpAgent({ maxSockets, keepAlive }),
        httpsAgent: new HttpsAgent({ maxSockets, keepAlive }),
      }),
    });
  }
  return s3Client;
};

And here is an example of our RDS connection and query logic that we use within our lambdas

  let client: Client = {} as Client;
  try {
    client = await SqlClient.getReadClient();
    await SqlClient.openConnection(client);
    let data = await getDataFromRDS(client);

    return data;
  } catch (e) {
    throw e
  } finally {
    await SqlClient.close(client);
  }

Observed Differences/Errors

It's hard to provide specific error messages because this isn't a simple reproducible issue. The issue seems to be something to do with concurrent lambdas executions that are talking to RDS. I took some performance metrics from cloudwatch logs on the lambda to try and see if any one particular client or piece of logic is taking longer now, and it honestly seems like our code execution is faster now on sdk v3, but something behind the scenes is still causing the lambdas to take longer.

Here's the metrics I gathered doing a test with the sdk v3 and ndoe 20 code vs sdkv2 and node16:

On 8/6, with SDK v2 and node 16, we had 586 total executions. Here is a breakdown of avg # of milliseconds to complete the different steps within our lambda function:

{
  init: 39.62,
  streamRead: 58.19,
  customerHistory: 4507.16,
  evalr: 11.54,
  rules (combined customerHistory and evalr) : 4652.07,
  lltest: 9.96,
  streamwrite: 286.58,
  savedynamo: 3415.17,
  finish: 0.89
}

Here are average performance metrics for that same set of executions

{
  cpu_system_time: 320.68,
  total_network: 5915693.03,
  billed_duration: 9126.73,
  cold_start: 12.11%,
  billed_mb_ms: 18691557.78,
  threads_max: 10.018,
  rx_bytes: 3854474.40,
  agent_memory_max: 9.49,
  memory_utilization: 19.95,
  duration: 9125.73,
  used_memory_max: 418.99,
  cpu_user_time: 2367.51,
  cpu_total_time: 2688.21,
  agent_memory_avg: 9.42
}

And compared to AWS SDK v3 on node 20, the same set of metrics for 620 total executions

{
  init: 53.40,
  streamRead: 67.41,
  customerHistory: 2586.70,
  evalr: 9.72,
  rules (combined customerHistory and evalr): 2681.34,
  lltest: 8.65,
  streamwrite: 338.54,
  savedynamo: 1897.37,
  finish: 0.98
}
{
  cpu_system_time: 569.48,
  total_network: 3532959.68,
  billed_duration: 49140.67,
  cold_start: 28.87%,
  billed_mb_ms: 100640107.35,
  threads_max: 9.84,
  rx_bytes: 1919114.32,
  agent_memory_max: 8.62,
  memory_utilization: 16.22,
  duration: 49139.67,
  used_memory_max: 343.27,
  cpu_user_time: 2241.47,
  cpu_total_time: 2810.95,
  agent_memory_avg: 8.59
}

On SDK v2, our lambda logic on average took longer to complete, looking at the steps that take the longest:
Rules v2: 4652.07ms avg
Rules v3: 3415.17ms avg

savedynamo v2: 3415.17
savedynamo v3: 1897.37

Yet the lambda performance logs indicate the duration to be greater on v3:
v2 duration avg 9125.73
v3 duration avg 49139.67

And there is also a significant number of cold starts, going from 12.11% on sdkv2 to 28.87% on sdk V3

I opened up an AWS Support ticket for this, and they pointed out to me that errors in the lambda could explain the higher duration and increased cold starts. But the errors are coming from RDS, which are a result of high concurrent amount of connections, which I can't explain what is happening

Additional Context

AWS services involved:

  • "@aws-sdk/client-s3"
  • import { Upload } from "@aws-sdk/lib-storage";
  • import { getSignedUrl as awsGetSignedUrl } from "@aws-sdk/s3-request-presigner";
  • "@aws-sdk/client-ssm"
  • "@aws-sdk/client-dynamodb"
  • "@aws-sdk/lib-dynamodb"
  • "@aws-sdk/util-dynamodb"

All Project AWS Dependencies:
"@aws-sdk/client-cloudwatch-logs": "3.490.0",
"@aws-sdk/client-dynamodb": "3.614.0",
"@aws-sdk/client-lambda": "3.614.0",
"@aws-sdk/client-location": "3.614.0",
"@aws-sdk/client-s3": "3.614.0",
"@aws-sdk/client-secrets-manager": "3.614.0",
"@aws-sdk/client-ses": "3.614.0",
"@aws-sdk/client-sfn": "3.614.0",
"@aws-sdk/client-sns": "3.614.0",
"@aws-sdk/client-sqs": "3.614.0",
"@aws-sdk/client-ssm": "3.614.0",
"@aws-sdk/client-sts": "3.614.0",
"@aws-sdk/lib-dynamodb": "3.614.0",
"@aws-sdk/lib-storage": "3.614.0",
"@aws-sdk/s3-request-presigner": "3.614.0",
"@aws-sdk/util-dynamodb": "3.614.0",

Our postgres library to connect to RDS:
"pg": "8.12.0",
"pg-promise": "11.9.1",

Notably we currently aren't bundling our app, I've seen that as a recommendation so I'm going to try that even though it feels unrelated to this issue.

We have all of the AWS client packages installed as dev dependencies to utilize the preinstalled sdk in the lambda environments. I tried doing some tests where we would have some aws sdk v2 code that we bundle and deploy alongside v3, but the bundle size becomes too large. We either have to go full node16 with SDK v2, or node 209 with SDK v3; there's no middle ground for us without tediously splitting apart our services and repositories and maintaining separate copies.

We deploy our lambdas through Cloudformation serverless templates, here's what the particular lambda in question is setup like:

  ProblemLambda:
    Type: AWS::Serverless::Function
    Properties:
      Description: SalesIngestionV4
      Handler: dist/lambdas/workflows/ProblemLambda.handler
      Runtime: nodejs20.x
      Architectures:
        - arm64
      FunctionName: "ProblemLambda"
      CodeUri: ./
      LoggingConfig:
        LogFormat: JSON
        SystemLogLevel: WARN
      VpcConfig:
        SecurityGroupIds:
          - !Ref RDSSecurityGroup
        SubnetIds:
          - !Ref RDSSubnet1
          - !Ref RDSSubnet2
      Role: ProblemLambda-role
      Timeout: 900
      Tracing: Active
      MemorySize: 2048
      Tags:
        Name: !Sub ProblemLambda
@gcv-epalmer gcv-epalmer added needs-triage This issue or PR still needs to be triaged. v2-v3-inconsistency Behavior has changed from v2 to v3, or feature is missing altogether labels Aug 8, 2024
@RanVaknin
Copy link
Contributor

Hi @gcv-epalmer

Previously our V2 initialized clients outside of handlers as such
V2 Code Snippet:

const AWS = AwsProvider.getAws();
const s3 = new AWS.S3({ apiVersion: "2006-03-01", signatureVersion: "v4" });

After upgrading and doing some investigation, reading issue other folks have been having led me to implement a 'lazy-loading' strategy
V3 Code Snippet:

let s3Client: S3Client | undefined;

export const s3 = () => {
  if (!s3Client) {
    s3Client = new S3Client({
      apiVersion: "2006-03-01",
    });
  }
  return s3Client;
};

It's not clear how you are instantiating the client though. This reads as in v2 you were instantiating the client outside of the handler but in v3 you dont? The lazy loading method here does not tell us where the client is actually being instantiated.

Please make sure you read this

Thanks,
Ran~

@gcv-epalmer
Copy link
Author

gcv-epalmer commented Aug 12, 2024

My apologies @RanVaknin. In both cases, we are instantiating the client outside of the handler. The difference between v2 and v3 is that we would only instantiate the client when calling a method, vs previously clients were being instantiated whenever the file was imported. Here's our full setup to make things clear:

We have a core library that has all of our AWS clients with some helper methods. Our s3 client for example looks like this:
s3-util.ts

import {
  ...
  S3Client,
  ...
} from "@aws-sdk/client-s3";
import { Agent as HttpAgent } from "node:http";
import { Agent as HttpsAgent } from "node:https";
import { NodeHttpHandler } from "@smithy/node-http-handler";

const connectionTimeout = 5000;
const socketTimeout = 120000;
const maxSockets = 1000;
const keepAlive = false;

let s3Client: S3Client | undefined;

export const s3 = () => {
  if (!s3Client) {
    s3Client = new S3Client({
      apiVersion: "2006-03-01",
      requestHandler: new NodeHttpHandler({
        connectionTimeout,
        socketTimeout,
        httpAgent: new HttpAgent({ maxSockets, keepAlive }),
        httpsAgent: new HttpsAgent({ maxSockets, keepAlive }),
      }),
    });
  }
  return s3Client;
};

export const doGetObjectAsString = async (bucket: string, key: string): Promise<string | null> => {
  try {
    const data = await s3().send(new GetObjectCommand({ Bucket: bucket, Key: key }));

    return data && data.Body ? await data.Body.transformToString() : null;
  } catch (err) {
    throw err
};

Then our handler function looks like this:


import { S3Util } from "@our-shared-library";

export const handler: Handler = async (event: eventInput, context: Context, callback: Callback) => {
  let pool: Pool;
  try {
    ... get some data from dynamo, get a file from s3
    const S3data = await S3Util.doGetObjectAsString(myBucket, myKey); 

     ... get some data from RDS
    pool = await SqlClient.getReadClientPool();
    const rdsData = await getRdsData(pool);

    ... do some work on that data

    return rdsData;
  } catch (e) {
    throw e;
  } finally {
    if (pool) {
      await SqlClient.closePool(pool);
    }
  }
};

Also to give an update, we tested this on Node18 with no success. Next thing we're going to test is setting keepAlive to false on all of the AWS clients (not just s3 which we started with testing)

@aBurmeseDev aBurmeseDev removed the needs-triage This issue or PR still needs to be triaged. label Aug 12, 2024
@aBurmeseDev aBurmeseDev self-assigned this Aug 12, 2024
@gcv-epalmer
Copy link
Author

gcv-epalmer commented Aug 13, 2024

Another update, I did the following things:

  • Reverted the initialization to ensure AWS Clients are instantiated outside of the handler. The code now looks like this for example:
const maxSockets = 1000;
const keepAlive = false;

export const s3 = new S3Client({
  apiVersion: "2006-03-01",
  requestHandler: new NodeHttpHandler({
    socketTimeout,
    httpAgent: new HttpAgent({ maxSockets, keepAlive }),
    httpsAgent: new HttpsAgent({ maxSockets, keepAlive }),
  }),
});

export const doGetObjectAsString = async (bucket: string, key: string): Promise<string | null> => {
  try {
    const data = await s3.send(new GetObjectCommand({ Bucket: bucket, Key: key }));

    return data && data.Body ? await data.Body.transformToString() : null;
  } catch (err) {
    throw err
};
import { S3Util } from "@our-shared-library";

export const handler: Handler = async (event: eventInput, context: Context, callback: Callback) => {
  let pool: Pool;
  try {
    ... get some data from dynamo, get a file from s3
    const S3data = await S3Util.doGetObjectAsString(myBucket, myKey); 

     ... get some data from RDS
    pool = await SqlClient.getReadClientPool();
    const rdsData = await getRdsData(pool);

    ... do some work on that data

    return rdsData;
  } catch (e) {
    throw e;
  } finally {
    if (pool) {
      await SqlClient.closePool(pool);
    }
  }
};
  • Applied keepAlive to false for all of our AWS clients
  • Reduced the concurrency on the given lambda to alleviate the source of the problem

After doing these changes, I started getting different errors from different processes that high with a high level of parallelism with a Promise.all:
Error: getaddrinfo EMFILE [dynamodb.us-east-1.amazonaws.com](http://dynamodb.us-east-1.amazonaws.com/)

After googling this, I'm seeing all sorts of similar issues with connections being kept alive:
https://www.serverlessguru.com/blog/emfile-issue
https://bahr.dev/2021/06/03/lambda-emfile/

Ultimately the best approach is to reduce the concurrency of our functions to reduce the total number of connections, but it's still troubling to me that all of this behavior changed in such a way just after updating the clients. I'm now trying to figure out, for each of our AWS clients, what the best strategy for initializing, keepAlive, and various socket/connection parameters so that our business functions back to normal

@kuhe
Copy link
Contributor

kuhe commented Aug 13, 2024

  • keepAlive=true is supposed to help with higher throughput. Other than matching the JS v2 SDK default, do you need to set it to false?
    • you can also call s3Client.destroy() after you're done with it to ensure sockets it's using close immediately.
    • note: s3Client.destroy() does not render the client unusable or anything like that, it only closes sockets. You can continue to make new requests with the client, which will open new sockets. It should be safe to call this at the end of the Lambda invocation even without initializing a new client instance to replace it.
  • as an aside, what credentials are you using? Lambda ENV static credentials with no expiry?
  • what does the parallel dynamodb code look like?

@gcv-epalmer
Copy link
Author

  • keepAlive=true is supposed to help with higher throughput. Other than matching the JS v2 SDK default, do you need to set it to false?

I'm really not sure, networking is not my strong suite. I came to testing this parameter because I have ran into similar issues with RDS dropping connections from high concurrency processes and causing them all to fail. Prior to updating our SDK to v3 our nightly process was churning through everything fine, and we made no adjustments to that process. After updating the SDK, suddenly the RDS issues are cropping up from it, and the issues are eerily similar to what I've seen before. But if nothing changed but the SDK, I started looking at the differences of the SDK, and the main one that stood out to me was the keepAlive change and ability to override the requestHandler. Now there was definitely room for us to optimize our process, but I wanted to attempt to 'go back to the way it used to work' by turning keepAlive off on all our clients.

  • you can also call s3Client.destroy() after you're done with it to ensure sockets it's using close immediately.
  • note: s3Client.destroy() does not render the client unusable or anything like that, it only closes sockets. You can continue to make new requests with the client, which will open new sockets. It should be safe to call this at the end of the Lambda invocation even without initializing a new client instance to replace it.

That's interesting, something I might want to try. Though would there be risk of multiple lambda invocations in a single context destroying that socket while another invocation is using it? Or would each lambda invoc have its own socket?

  • as an aside, what credentials are you using? Lambda ENV static credentials with no expiry?

Not 100% sure, we deploy our Lambdas with a Serverless Cloudformation template. The lambdas are given an IAM role with policies to give permissions to other AWS services under our account. The SDK seems to handle credentials under the hood

  • what does the parallel dynamodb code look like?

somewhat psuedo coded but

export const parralelJob: Handler = async (event: any, context: Context) => {
  try {
    const allOrganizations = await getAllOrgs(); // we have a few thousand of these

    const promises = allOrganizations.map(doWork)

    // all error handling and logging takes place inside the functions
    await Promise.allSettled(promises);
  } catch (e) {
    throw e
  }
};

const doWork = async (org) => {
     const records = await DynamoUtil.getRecords(table, org.id);

     for (const record of record) {
          ... get more dynamo records

         ... conditionally invoke another lambda and use response

         ... conditionally more calls to dynamo
     }
}

I've since gone through and put this behind a limiter to limit the concurrency. It was just very telling that as soon as I started playing with the settings of our various AWS service clients that different issues came up, all related to concurrency and TCP connection things. And again I'm terrible at networking stuff, I'm a code person, I'm just trying my best without fully understanding the issue at hand, so I apologize and appreciate all of your help :)

@kuhe
Copy link
Contributor

kuhe commented Aug 16, 2024

Though would there be risk of multiple lambda invocations in a single context destroying that socket while another invocation is using it? Or would each lambda invoc have its own socket?

Yes, I believe it's possible for invocations to close each others' sockets. Ideally, this function call shouldn't be necessary since node:https manages sockets, and it sounds like it would be dangerous to use it in the middle of other invocations.


Nothing stands out as looking like a bottleneck.

Some possibilities to explore:

  • Always consume S3 getObject streams. It looks like you already do this.
const getObject = await s3.getObject({ ... });

await getObject.Body?.transformToString(); // consumes the stream and allows the socket to close.
  • Are you initializing a lot of DynamoDB clients?
    You can share requestHandlers between clients.
import { NodeHttpHandler } from "@smithy/node-http-handler";

const requestHandler = NodeHttpHandler.create({
  httpsAgent: { maxSockets: 1000, keepAlive: true }
});

const client1 = new DynamoDB({ requestHandler });
const client2 = new DynamoDB({ requestHandler });

... await requests ...

requestHandler.destroy(); // close sockets for both clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2-v3-inconsistency Behavior has changed from v2 to v3, or feature is missing altogether
Projects
None yet
Development

No branches or pull requests

4 participants