Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UND_ERR_CONNECT_TIMEOUT errors thrown when there is CPU intensive code on the event loop #3410

Closed
mknichel opened this issue Jul 17, 2024 · 16 comments · Fixed by #3495 or #3552
Closed
Labels
bug Something isn't working

Comments

@mknichel
Copy link

Bug Description

Work on the event loop can interrupt the Undici lifecycle for making requests, causing errors to be thrown even when there is no problem with the underlying connection. For example, if a fetch request is started and then work on the event loop takes more than 10 seconds (default connect timeout), Undici will throw a UND_ERR_CONNECT_TIMEOUT error even if the connection could be established very quickly.

I believe what is happening is:

  1. When the fetch request is started, Undici starts the work to make a connection. Undici calls setTimeout with the value of the connectTimeoutMs to throw an error and cancel the connection if it takes too long (https://github.com/nodejs/undici/blob/main/lib/core/connect.js). It makes a call to GetAddrInfoReqWrap (https://github.com/nodejs/node/blob/main/lib/dns.js#L221), but this is asynchronous and processing of the callback will be delayed until the next event loop.
  2. User tasks block the event loop for a long period of time.
  3. The onConnectTimeout timer is run because the previous task took longer than the timeout. onConnectTimeout calls setImmediate with a function to destroy the socket and throw the error. https://github.com/nodejs/undici/blob/main/lib/core/connect.js
  4. The GetAddrInfoReq lookup callback (emitLookup in node:net) is run. This code begins the TCP connection (internalConnect is called in https://github.com/nodejs/node/blob/main/lib/net.js#L1032) but that is also asynchronous, so it won't finish in this round of the event loop.
  5. The setImmediate function is run in the next phase which destroys the socket and throws the UND_ERR_CONNECT_TIMEOUT error.
  6. Undici never gets a chance to handle the TCP connection response.

Internally at Vercel, we have been seeing a high number of these UND_ERR_CONNECT_TIMEOUT issues while pre-rendering pages in our Next.js application. I can't run this task on my local machine so it's harder to debug, but it's a CPU intensive task and moving fetch requests to a worker thread eliminated the Undici errors. We tried other suggestions (like --dns-result-order=ipv4first and verified that we were not seeing any packet loss) that did not resolve the issue. Increasing the connect timeout resolves the issue in the reproduction but not the issue in our Next.js build (which I can't explain).

Reproducible By

A minimal reproduction is available at https://github.com/mknichel/undici-connect-timeout-errors.

We can reproduce the behavior on Node 18.x and 20.x and with the 5.24.0 and the latest version of Undici (6.19.2)

Expected Behavior

The Undici request lifecycle could operate on a separate thread that does not get blocked by user code. By separating it out from the user code, this would remove impact of any user code on requests.

To test this theory, we created a dispatcher that proxied the fetch request to a dedicated worker thread (new Worker from worker_threads). This eliminated all the Undici errors that we were seeing in our Next.js build.

Logs & Screenshots

In the minimal reproduction, the error is:

TypeError: fetch failed
    at fetch (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/index.js:112:13)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at fetchExample (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:21:20)
    at main (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:66:3) {
  [cause]: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:190:24)
      at /Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:133:46
      at Immediate._onImmediate (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/undici@6.19.2/node_modules/undici/lib/core/connect.js:174:9)
      at process.processImmediate (node:internal/timers:478:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'

In our Next.js builds, the error is:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12618:11
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async s (elided path)
    at async elided path {
  cause: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (node:internal/deps/undici/undici:7760:28)
      at node:internal/deps/undici/undici:7716:50
      at Immediate._onImmediate (node:internal/deps/undici/undici:7748:13)
      at process.processImmediate (node:internal/timers:478:21)
      at process.callbackTrampoline (node:internal/async_hooks:130:17) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
    }
  }

Environment

The reproduction repo was erroring for me on Mac OS 14.4, while internally we are seeing issues on AWS EC2 Intel machines.

Additional context

Vercel/Next.js users have reported UND_ERR_CONNECT_TIMEOUT issues to us:

@mknichel mknichel added the bug Something isn't working label Jul 17, 2024
@ronag
Copy link
Member

ronag commented Jul 18, 2024

I think something like this would also do the trick:

diff --git a/lib/util/timers.js b/lib/util/timers.js
index d0091cc1..fa56be8b 100644
--- a/lib/util/timers.js
+++ b/lib/util/timers.js
@@ -2,13 +2,13 @@
 
 const TICK_MS = 499
 
-let fastNow = Date.now()
+let fastNow = 0
 let fastNowTimeout
 
 const fastTimers = []
 
 function onTimeout () {
-  fastNow = Date.now()
+  fastNow += TICK_MS
 
   let len = fastTimers.length
   let idx = 0

Though it needs some work to update tests.

@dipesh112
Copy link

@mknichel any progress or updates on this one? My Vercel/Next.js based projects are facing bad experiences.

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 11, 2024

I am currently working on this

@thisislvca
Copy link

Any updates?

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 18, 2024

I am on it ;)

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 27, 2024

this bug is potentially still existing. we need more investigating.

@philipsolarz
Copy link

Hi, I've been getting the UND_ERR_CONNECT_TIMEOUT for a few months now. For my use case I have a default Next.js 14 App, using next-auth@beta with GitHub as provider. I go to http://localhost:3000/api/auth/signin and attempt to sign in.

  1. npm run dev (Fail)
  2. npm run build && npm run start (Fail)
  3. docker build (Success) using: https://nextjs.org/docs/pages/building-your-application/deploying#docker-image
  4. same image running in k8s (Fail)

Code: https://github.com/philipsolarz/my-next-app/

Any ideas or hints?

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 28, 2024

ah, a new reproduction. let me check it

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 28, 2024

@philipsolarz

I dont see how i can provoke the issue. Do you have discord? twitter? something were we could chat?

@Uzlopak
Copy link
Contributor

Uzlopak commented Aug 28, 2024

@philipsolarz
I added you on linkedin. Maybe we can chat there?

@thisislvca
Copy link

Hi, I've been getting the UND_ERR_CONNECT_TIMEOUT for a few months now. For my use case I have a default Next.js 14 App, using next-auth@beta with GitHub as provider. I go to http://localhost:3000/api/auth/signin and attempt to sign in.

  1. npm run dev (Fail)
  2. npm run build && npm run start (Fail)
  3. docker build (Success) using: https://nextjs.org/docs/pages/building-your-application/deploying#docker-image
  4. same image running in k8s (Fail)

What OS are you running & what's your hardware configuration? I've been having the issue in prod for the past three-ish months, but never had it locally...

@kaitok
Copy link

kaitok commented Sep 5, 2024

I also got same issue.
node uses ipv6 by default, I specified ipv4 in next.config.js and it didn't solve the problem.

MacOS
Next.js version: 14.0.1

const dns = require('dns')
dns.setDefaultResultOrder('ipv4first')

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 5, 2024

It is not because of ipv6 or ipv4

@2-fly-4-ai
Copy link

Same issue here.

@2-fly-4-ai
Copy link

It is not because of ipv6 or ipv4

Has anyone figured out how to solve this issue yet?

@LikeDreamwalker
Copy link

I can reproduce this issue sometimes when I am trying to update next itself or installed some new dependencies.
So if you are facing this issue and you sure that your network and service are fine, you could try to remove the node_modules and .next folders content, to re-install and re-build again.
That works for me and I can be sure that this issue is not related to next itself, so you won't need to worry that if your updates on next cause some break issues on your server request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
9 participants