Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) and temporary FIX #55330

Closed
IbrahimTanyalcin opened this issue Oct 9, 2024 · 6 comments
Labels
http Issues or PRs related to the http subsystem. v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.

Comments

@IbrahimTanyalcin
Copy link

IbrahimTanyalcin commented Oct 9, 2024

Version

v20.18.0

Platform

WSL2/ Ubuntu 20.04 / amd64

Subsystem

node:http

What steps will reproduce the bug?

Using node:http, make consecutive

http.get(`http://${serverDetails.host}:${serverDetails.port}`, (res) => ...)

calls to a standard exressjs/<http.Server> server

How often does it reproduce? Is there a required condition?

Randomly (every 1 out of 3 ) jest tests that stress test the server will fail with ECONNRESET / ECONNREFUSED

What is the expected behavior? Why is that the expected behavior?

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

What do you see instead?

if you attach error handlers on the request object returned from http.get and console log it, you will end up with one of:

  • ECONNRESET
  • ECONNREFUSED
    With the error message of refused or socket hang up.

Additional information

I recently updated my application from Node:16 to 20, along with the dependencies and all my integration tests were passing accept the server stress test, that makes 100 requests at the same time and feeds data over Server-Sent events. I am using nvm so I can quickly test different versions.

I initially thought this was because one of these packages were updated;

...
    "express": "^4.21.0",
    "express-session": "^1.18.0",
    "memcached": "^2.2.2",
    "nodemon": "^3.1.7",
    ....
  },
  "devDependencies": {
    "jest": "^29.7.0"
    }
...

but switching back to Node:16 with NEW or OLD versions of the packages consistently passed the tests, where as Node:20 randomly failed the stress test with either NEW or OLD versions of the packages. So I searched in the issues here and over the net and found that many people have similar issues where the server hands over a closed TCP socket to the client/agent, here are other threads:

ChainSafe/lodestar#5765
axios/axios#5929
#47130
node-fetch/node-fetch#1735
#39810
#43456

My original test, consistently passing with Node:16 is as follows (before and after is redacted):

test(`are we able to handle ${nClients} clients?`, async () => {
        expect.assertions(2);
        return Promise.allSettled(
            [...Array(nClients)]
            .map(d => simulClient(serverDetails))
        ).then(async (results) => {
            expect(results.map(d => d.value?.statusCode))
            .toEqual([...Array(nClients)].map(d => 200));
            const [{value:lastCli}] = results.slice(-1);
            let rawData = '';
            lastCli.setEncoding('utf8');
            lastCli.on('data', chunk => rawData += chunk)
            //check if we really have n clients live
            return await until(function(){
                return this.test(rawData)
            },{
                thisArg: new RegExp(
                    `data:\\s*\\x22?size\\x22?\\s*:\\s*${nClients}`,
                    "mi"
                ), 
                interval: 100
            })
        }).then(val => expect(val).toBe(true))
    }, serverDetails.timeout);

above nClients is 100 and serverDetails is:

const serverDetails = {
       started: true,
       host: "localhost",
       port: 3000,
       oProcess: nodeProcess,
       timeout: 20000,
       ...
   }

The simulClient function mimics a user, requests base url, gets the cookie and starts a event-stream (a live response that is never closed) to send data via Server-Sent events {size: //number of clients currently receiving feed}:

const http = require('node:http');

/**
@description simulates a client navigating to '/',
saving a session cookie and openning
a live TCP connection for server-sent events
@param {Object} serverDetails an object about server info
path to the directiory under which the folders
will be created.
@param {string} serverDetails.host host
@param {string|number} serverDetails.port port
@param {number} serverDetails.timeout timeout in milliseconds
@returns {Promise<http.IncomingMessage>} an `http.IncomingMessage` type from
Node that has an extra `__cliCookie` array as property
@example
const cliResponse = await simulClient(serverDetails);
*/
const simulClient = (serverDetails) => new Promise((r, j) => {
   let timeout = setTimeout(
       ()=> j(new Error (
           "Client request simulation timedout"
       )),
       serverDetails?.timeout ?? 5000
   );
   try {
       http.get(`http://${serverDetails.host}:${serverDetails.port}`, (res) => {
           if (res.statusCode !== 200) {
               throw new Error("Base route non 200 status")
           }
           const 
               headers = new Map(Object.entries(res.headers)),
               cookie = headers.get("set-cookie")?.map(d => d.split(';')[0]);
           http.get(
               `http://${serverDetails.host}:${serverDetails.port}/estream/subscribe`, 
               {headers: {cookie}},
               (res) => {
                   if (res.statusCode !== 200) {
                       throw new Error("Event stream non 200 status")
                   } else if (
                       !res.headers
                       ?.["content-type"]
                       ?.toString()
                       ?.toLowerCase()
                       ?.includes("text/event-stream")
                   ) {
                       throw new Error("Event stream has wrong content-type")
                   }
                   res.__cliCookie = cookie;
                   clearInterval(timeout);
                   r(res)
               }
           )
       })
   } catch (err) {
       j(err)
   }
});
module.exports = simulClient;

I know the throw statements inside the promises wont do anything, please ignore it.
One thing that I want to address is that, in the other threads I pasted above, people seem to think keepAlive:false was to be set on the http.Agent but I think this is not the case. I think default Node16 behavior might be:

keepAlive [<boolean>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If set to true, it enables keep-alive functionality on the socket immediately after a new incoming connection is received, similarly on what is done in [socket.setKeepAlive([enable][, initialDelay])][socket.setKeepAlive(enable, initialDelay)]. Default: false.

so this is done one the http.createServer and not on the agent. Although Node:20.18 docs say default is still false, I think this might not be the case.

'FIX'

My fix for this behavior was as follows:

for StressTest:

test(`are we able to handle ${nClients} clients?`, async () => {
        expect.assertions(2);
        return Promise.allSettled(
            [...Array(nClients)]
            .map((d, i) => simulClient({...serverDetails, delay: i * 10}))
        ).then(async (results) => {
            expect(results.map(d => d.value?.statusCode))
            .toEqual([...Array(nClients)].map(d => 200));
            const [{value:lastCli}] = results.slice(-1);
            let rawData = '';
            lastCli.setEncoding('utf8');
            lastCli.on('data', chunk => rawData += chunk)
            //check if we really have n clients live
            return await until(function(){
                return this.test(rawData)
            },{
                thisArg: new RegExp(
                    `data:\\s*\\x22?size\\x22?\\s*:\\s*${nClients}`,
                    "mi"
                ), 
                interval: 100
            })
        }).then(val => expect(val).toBe(true))
    }, serverDetails.timeout);

and simulClient became:

const simulClient = (serverDetails) => new Promise(async (r, j) => {
    try {
        let retry = serverDetails?.retry ?? 3;
        if (!retry){
            throw new Error("Maximum amount of retries were reached")
        }
        let delay = serverDetails?.delay ?? 0,
            iteration = serverDetails?.iteration ?? 0,
            retryPenalty = serverDetails?.retryPenalty ?? 100;
        await new Promise((r) => setTimeout(r, delay));
        await new Promise((r) => setTimeout(r, iteration * retryPenalty));
        let timeout = setTimeout(
                ()=> j(new Error (
                    "Client request simulation timedout"
                )),
                serverDetails?.timeout ?? 5000
            ),
            agent = serverDetails?.agent ?? new http.Agent({keepAlive: false, timeout: serverDetails?.timeout ?? 100});
        const handleRetry = (err) => {
            clearTimeout(timeout); // Clear timeout before retrying
            console.log(`Retrying due to error: ${err.message}. Attempts left: ${retry - 1}`);
            // Retry with decremented retry count and backoff
            r(simulClient({ ...serverDetails, agent, iteration: ++iteration, retry: --retry, delay: 0}));
        };
        http.get(`http://${serverDetails.host}:${serverDetails.port}`, {agent}, (res) => {
            if (res.statusCode !== 200) {
                return handleRetry(new Error("Base route non 200 status"));
            }
            const 
                headers = new Map(Object.entries(res.headers)),
                cookie = headers.get("set-cookie")?.map(d => d.split(';')[0]);
            http.get(
                `http://${serverDetails.host}:${serverDetails.port}/estream/subscribe`, 
                {headers: {cookie}, agent},
                (res) => {
                    if (res.statusCode !== 200) {
                        return handleRetry(new Error("Event stream non 200 status"));
                    } else if (
                        !res.headers
                        ?.["content-type"]
                        ?.toString()
                        ?.toLowerCase()
                        ?.includes("text/event-stream")
                    ) {
                        return handleRetry(new Error("Event stream has wrong content-type"));
                    }
                    res.__cliCookie = cookie;
                    clearTimeout(timeout);
                    r(res)
                }
            ).on("error", (err) => {
                handleRetry(err)
            })
        }).on("error", (err) => {
            handleRetry(err)
        })
    } catch (err) {
        j(err)
    }
});

so the above combines 3 things:

  • Retry mechanism via handleRetry
  • a small delay in the stressTest delay: i * 10 to force incoming requests to fall into different cycles of eventloop
  • an agent where keepAlive is explicity set to false.

With 3 of these, running the test on cold start or many times 1 after another, consistently passed the tests. Removing 1 of these randomly cropped up the socket hang up error again.

IMO this breaks user space. I understand people pointing out to limitations and race conditions in http 1.1 but Node did not suddenly remember the spec in Node:20. This behavior seems to break user space and I think it is more than just setting keepAlive to false on http.Server. Something in the Node's event loop handling since node 19 must have changed for this type of behavior to emerge.

If the keepAlive on http.Server is the true culprit, shouldn't it be set back to what it was in Node:16?
What can be done about it so that user space is prioritized above the spec compliance?

@IbrahimTanyalcin IbrahimTanyalcin changed the title Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) and temporary FIX Oct 9, 2024
@avivkeller
Copy link
Member

This might be an issue with the server. Can you reproduce when fetching something else? (Try nodejs.org)

@IbrahimTanyalcin
Copy link
Author

IbrahimTanyalcin commented Oct 9, 2024

@redyetidev if it is an issue with the server, why is Node:16, Node:18 is consistently passing the tests?
I have attached at least 5 threads related to this issue. They all had issue with the server?
@anonrig Did anything change in Node's event loop handling since Node:18 ?

@avivkeller
Copy link
Member

avivkeller commented Oct 9, 2024

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

Oh, I didn't see that statement, can you try and reproduce in v22?

@avivkeller avivkeller added v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. http Issues or PRs related to the http subsystem. labels Oct 9, 2024
@IbrahimTanyalcin
Copy link
Author

Like in Node 18, Node 16 or before, node should be able to handle concurrent requests and not hand over to the client a socket that has been closed.

Oh, I didn't see that statement, can you try and reproduce in v22?

I will try with 22, will let you know.

@IbrahimTanyalcin
Copy link
Author

@redyetidev I tested with 22, I get the same symptoms, HOWEVER, I am starting to think this is because of memcached connections being closed to frequently and Node20/22 hitting a dead socket. I can see that Node18 and before processes the socket connections in a burst of 50 batches, whereas 20/22 is more transient and continuous, which could increase the chances of hitting a dead socket. So to test it out, I commented out cache.end() to memcached (which is not necessary since memcached closes idle sockets after timeout) and now 20/22 is passing the stress test.

I am not sure, I am now inspecting with NODE_DEBUG=NET and will report back once I have more conviction. I think the remedy seems to be not to call cache.end() and let the memcached timeout clear it out on its own. Will report back.

@IbrahimTanyalcin
Copy link
Author

IbrahimTanyalcin commented Oct 11, 2024

Closing this issue as it is indriectly related to whats changed since Node 16. I was able to debug whats going on with NODE_DEBUG=NET, if someone is interested I can attach them. Here is the summary of what was going on:

  • Under Node 20 and onwards, if you do call cache.end(), you will see a lot more connections than usual like NET NNNNN: SERVER handle? true connections? 150
  • If you do not clear current memcached connections,the server can handle connections well.
  • In contrast, Node 16 closes connections more frequently
  • You will see a lot of NET NNNNN: destroy under Node 16 that hints Node 16 explicitly closes connections
  • Node 20 does a lot more work with connections for example you will see NET NNNNN: afterConnect, NET NNNNN: SERVER _emitCloseIfDrained

So as a result:

  • Node 16 seems to handle connections in batches and agressively closes connections which avoid connection buildup.
  • Node 20 does more work with connections such as afterConnect or _emitCloseIfDrained
  • Node 20 handles connections NOT in batches but more continuously and more eager to handle new connections, as opposed to Node 16 working in batches and more eager to close connections.

I do not know if these differences are due to changes in the event loop, but nonetheless they increase your chances of hitting a dead socket. So the FIX is to NOT call cache.end() on currently active pool of memcached connections and let them expire on their own. I am leaving these in here in case anyone bumps to the same problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. v20.x v20.x Issues that can be reproduced on v20.x or PRs targeting the v20.x-staging branch. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.
Projects
None yet
Development

No branches or pull requests

2 participants