Idle Connections are not being cleaned up under certain conditions. #3020

robert-pitt-foodhub · 2024-09-06T20:37:56Z

Setup

Environment: AWS Lambda (x86)
mysql2 version: v3.110 (latest as of post)
MySQL Versions Tested: 8 & 9

Description:

Hey, I have been noticing some unusual connection errors, and have been tracing it back to the connection cleanup process within the Pool class, below I have outlined my setup and the steps I took to find the issue and how to replicate the problem using an existing test case.

We are getting a fairly sizeable percentage of requests that are failing due to socket timeout or disconnection issues, which typically show as the PROTOCOL_CONNECTION_LOST, as shown below:

Error: Connection lost: The server closed the connection.
    at PromisePoolConnection.query (/opt/nodejs/node_modules/mysql2/promise.js:93:22)
    at /var/task/index.js:1435:25
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Runtime.handler (/var/task/index.js:2659:15) {
  code: 'PROTOCOL_CONNECTION_LOST',
  errno: undefined,
  sql: undefined,
  sqlState: undefined,
  sqlMessage: undefined
}

The configuration we had setup was {"maxIdle": 3, connectionLimit: 3, idleTimeout: 30000 }, and after digging around I found the following line of code, which implies that the maxIdle MUST be less than the connectionLimit, so in this configuration the connection cleanup process is not enabled, meaning once a connection is opened, it will never be removed.

node-mysql2/lib/pool.js

Line 30 in dee0c08

if (this.config.maxIdle < this.config.connectionLimit) {

So I then with excitement updated one of our production lambdas to take on the config {"maxIdle": 1, connectionLimit: 3, idleTimeout: 30000 } in the hope that activating the connection cleanup process would resolve the connection errors we were seeing, I went to bed with the hope of waking up to no red blobs on our dashboard, however to my surprise the connection errors continued to happen, following a similar patterns to the previous nights, as if it made no effect.

I looked at some of the logs and I noticed that the Timer object / reference was on the pool properties since I dropped the maxIdle to a lower number than the connectionLimit

Before:

After:

So I was certain the timer was now running, however I was still having the same connection errors 😠 ..., I added some more logs to track how long each connections thread id was lasting and what the time delays were between queries

So at this point I knew there was a possibility the the bug is within the node-mysql2 library, so I cloned the repository and setup a dockerised mysql sandbox to be able to execute the integration test suite, which thankfully was a piece of cake to get up and running, I ran the full suite to validate that everything is expected, and then I took at a look the test/integration/test-pool-release-idle-connection.test.cjs test file as that was closes to what I was doing.

https://github.com/sidorares/node-mysql2/blob/dee0c0813854658baf5f73055e5de36a1a78e7bf/test/integration/test-pool-release-idle-connection.test.cjs

I setup my MYSQL, ensured that there were no connections from anything else and executed the test file as is, without making any changes

MySQL before the tests:

Executed the tests via the following command:

yarn poku ./test/integration/test-pool-release-idle-connection.test.cjs

The test PASSES as expected 🎉

Within the MYSQL console, I was refreshing the process list every second and I can see 3 connections created, as expected, and I see those same three connections drop off after around 7-8 seconds, which again Is what I expect given the setTimeout within this test scenario.

After the test was completed, mysql was back down to the initial two base connections:

Ok, here's where it seems to go wrong, which I think could be the cause for connection socket timeouts that I am experiencing, when I change the configuration to {connectionLimit: 3, maxIdle: 2}, the test just hangs, it never finishes (even after 10/15+ minutes).

When checking MySQL it seems as though the there are connections outstanding that have not been closed

Im pretty certain that the cause of this bug is somewhere near this code, however I don't have a huge amount of experience with this library so looking for some support here on replicating the bug to offering a solution that wont cause any regressions.

Thanks

References:

The text was updated successfully, but these errors were encountered:

robert-pitt-foodhub · 2024-09-07T12:19:35Z

Hey, I just wanted to follow up with some more findings that I have found, as I really want to get this bug pinned down if possible, I was thinking last night that the reason why the test might be hanging could be due to the assertion within the setTimeout are firing because I changed the connectionLimit and maxIdle, so I commented those out this morning and run the test again and they passed, I tried with a few configurations and they passed, but I still think that there are connections that are not being destroyed in certain conditions.

Ive just pulled some logs from one of our lambdas that are having the issue, and wanted to outline a scenario where this occurs:

Setup

Environment: AWS Lambda (x86)
mysql2 version: v3.110 (latest as of post)
MySQL Version: 8
connectionLimit = 5
maxIdle = 1
idleTimeout = 5000

Timeline

I have taken a screenshot of my logs, showing the logs for a single failure scenario within a single lambda instance, the screenshot shows the timeline of 5 requests over time against the same pool instance, for each request I am logging the delta for the last time the connection was used (not since it was created), the connection.threadId

Here's Each Request separated (for better readability):

Request 1

Request 2

Request 3

Request 4

Request 5

Observations

Given that I have set a 5000ms idleTimeout, it should not be possible for a connection to be returned to the user land code if the connection has surpassed the idleTimeout, however, within this configuration of maxIdle = 1, I think that the logic within the connection cleanup process is not cleaning up connections below the count of maxIdle.

You can observe the last activity of the connection is was used was over 344 seconds, which is backed up by the lambda timings column on the left that also shows the large difference.

So far, I am looking at this function in Pool.js as being the area that needs adjusting, however I am not sure if using maxIdle and connectionLimit is enough to do this, or wether we need to introduce maybe a new config such as maxConnectionIdleTimeout..

Here is my current suggestion:

  _removeIdleTimeoutConnections() {
    if (this._removeIdleTimeoutConnectionsTimer) {
      clearTimeout(this._removeIdleTimeoutConnectionsTimer);
    }

    this._removeIdleTimeoutConnectionsTimer = setTimeout(() => {
      try {
        const now = Date.now();
        while (this._freeConnections.length > 0) {
          const connection = this._freeConnections.get(0);
          const idleTime = now - connection.lastActiveTime;

          if (
            this._freeConnections.length > this.config.maxIdle ||
            idleTime > this.config.idleTimeout
          ) {
            connection.destroy();
          } else {
            // If we reach a connection that shouldn't be removed, we can stop checking
            break;
          }
        }
      } finally {
        this._removeIdleTimeoutConnections();
      }
    }, 1000);
  }

This approach attempts to loop over every record, first checking the idle timeout, if the connection has not timed out it will check if the current length of connection is above the threshold, if either case is true the connection is removed.

Thoughts

robert-pitt-foodhub · 2024-09-08T18:43:09Z

Hey @wellwelwel, Im just wondering what the next steps are in moving from PR to Release, if did attempt to deploy from my github clone via npm, but during our CI it failed saying that the lur-cache module was not found, so I am assuming that there may be a build step your side to having a working package.

If your able to provide me with a npm installable package I will be happy to deploy to our staging infra across our 200+ lambdas that use the mysql2 library, we can update all the lambdas relatively easily and monitor for any observations.

Thanks

wellwelwel · 2024-09-08T18:52:00Z

Im just wondering what the next steps are in moving from PR to Release

I'd like to take a closer look before merging (if @sidorares doesn't do it first).

during our CI it failed saying that the lur-cache module was not found, so I am assuming that there may be a build step your side to having a working package.

To better understand the lru-cache conflicts, see #2988 and its tracks 🙋🏻‍♂️

robert-pitt-foodhub · 2024-09-08T19:40:57Z

Thanks @wellwelwel, yea please take a closer look, and in regards to the #2988, I have created a branch with this change and the connection cleanup change, currently deploying to our SIT environment where I can leave it running for a few hours to see see if there is any regressions.

Ill update here soon

robert-pitt-foodhub · 2024-09-08T21:46:56Z

My updates so far:

I am using AWS CDK to manage my infrastructure, so by me making tis version change and deploying I have redeployed hundreds of lambda functions that now use the latest code, the latest code being this issue and the patch I applied from #2988.

A full deployment like this took close to 2 hours to fully rollout, below are my observations.

The deployment started around 19:35 UTC, where the CloudFormation deployment process would update the lambda functions of over 40+ stacks/services, during this window CloudFormation will deploy a new instances/versions of lambda functions running with the new code, if CloudFormation sees all changes are deployed, it will then shut down the previous deployment of lambdas.

During this time, I would expected the number of database connections to increase as more lambda instances are running during the deployment, and the number of connections should reduce back to the running average.

Connections During Deployment:

Total Connections:

The above screenshot shows the pattern I expected, and also shows that the number of connections reduced back to the expected amount.

I have also attached the screenshot of the Aborted Client metric, showing that during the deployment window the number of aborted clients jumped around as connections were bing made from new lambdas, and connections being aborted my lambdas being terminated

Aborted Clients during deployment

So far I am not observing any issues, I selected the sum of Errors for 500 lambdas over the past 12 hours, and I see no difference in the volume of errors before vs after the release

I am still monitoring, as I also have a metric for the Socket Timeout error but as it's intermittent I will collect some data for several hours and see if we are still experiencing the errors.

Will post back soon

robert-pitt-foodhub · 2024-09-09T08:52:44Z

Hey @wellwelwel, so as far as any regressions are concerns, theres nothing on our side statistically that shows any increase in error or significant change in the number of connections for our RDS, which tells me that the new deployed code is working as expected.

Here's a screenshot of the Errors that we were tracking for the Socket Timeout error, and as you can see after I released the patch to our staging environment around8:35pm, the Socket Timeout error seems to have stopped

robert-pitt-foodhub · 2024-09-09T09:31:06Z

@wellwelwel I think we still to resolve the issue where the connection cleanup process doesn't start to tick unless the maxIdle is lower then the connectionLimit, my suggestion is that we just remove the if condition and start the timeout ticks regardless of the value, so we are able to update all connections based my logix change in the _removeIdleTimeoutConnections function.

node-mysql2/lib/pool.js

Line 30 in dee0c08

if (this.config.maxIdle < this.config.connectionLimit) {

Thoughts

wellwelwel added the needs investigation label Sep 7, 2024

robert-pitt-foodhub mentioned this issue Sep 7, 2024

Updated connection cleanup process to handle expired connections and those exceeding config.maxIdle #3022

Merged

4 tasks

wellwelwel linked a pull request Sep 7, 2024 that will close this issue

Updated connection cleanup process to handle expired connections and those exceeding config.maxIdle #3022

Merged

4 tasks

robert-pitt-foodhub mentioned this issue Sep 8, 2024

Receiving "Connection lost: The server closed the connection." with createPool #2247

Open

wellwelwel added bug and removed needs investigation labels Sep 9, 2024

wellwelwel closed this as completed in #3022 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idle Connections are not being cleaned up under certain conditions. #3020

Idle Connections are not being cleaned up under certain conditions. #3020

robert-pitt-foodhub commented Sep 6, 2024 •

edited

Loading

robert-pitt-foodhub commented Sep 7, 2024 •

edited

Loading

robert-pitt-foodhub commented Sep 8, 2024

wellwelwel commented Sep 8, 2024

robert-pitt-foodhub commented Sep 8, 2024

robert-pitt-foodhub commented Sep 8, 2024 •

edited

Loading

robert-pitt-foodhub commented Sep 9, 2024

robert-pitt-foodhub commented Sep 9, 2024

Idle Connections are not being cleaned up under certain conditions. #3020

Idle Connections are not being cleaned up under certain conditions. #3020

Comments

robert-pitt-foodhub commented Sep 6, 2024 • edited Loading

Setup

Description:

robert-pitt-foodhub commented Sep 7, 2024 • edited Loading

Setup

Timeline

Request 1

Request 2

Request 3

Request 4

Request 5

Observations

robert-pitt-foodhub commented Sep 8, 2024

wellwelwel commented Sep 8, 2024

robert-pitt-foodhub commented Sep 8, 2024

robert-pitt-foodhub commented Sep 8, 2024 • edited Loading

robert-pitt-foodhub commented Sep 9, 2024

robert-pitt-foodhub commented Sep 9, 2024

robert-pitt-foodhub commented Sep 6, 2024 •

edited

Loading

robert-pitt-foodhub commented Sep 7, 2024 •

edited

Loading

robert-pitt-foodhub commented Sep 8, 2024 •

edited

Loading