Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "net: enable autoSelectFamily by default" #48403

Closed
wants to merge 1 commit into from

Conversation

mcollina
Copy link
Member

@mcollina mcollina commented Jun 9, 2023

This reverts commit 8b51c1a.

I think we were too eager to turn this on. In the Fastify community, we have been trying to get our tests to pass on Node v20 for a month and a half with minimal luck. The only solution was to use NODE_OPTIONS=no-network-family-autoselection.

Note that tests spuriously fail even with v20.3.0, so the fixes are not complete. I was not able to reproduce the problem individually. My 2 cents is that there is a race condition somewhere triggered by many open sockets in a short period of time.

This reverts commit 8b51c1a.

Signed-off-by: Matteo Collina <hello@matteocollina.com>
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/net

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Jun 9, 2023
@mcollina
Copy link
Member Author

mcollina commented Jun 9, 2023

cc @nodejs/releasers @nodejs/tsc

This is technically semver-major but it's useful only if it is backported to Node v20.

@mcollina
Copy link
Member Author

mcollina commented Jun 9, 2023

We might want to land this only to v20.x.

@mcollina mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Jun 9, 2023
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jun 9, 2023
@nodejs-github-bot
Copy link
Collaborator

@RafaelGSS RafaelGSS added dont-land-on-v16.x dont-land-on-v18.x PRs that should not land on the v18.x-staging branch and should not be released in v18.x. labels Jun 9, 2023
@tniessen
Copy link
Member

tniessen commented Jun 9, 2023

For context, there are related discussions in #47644, #48028, and #48145. @silverwind suggested reverting in #48145, and other collaborators have mentioned that they'd have preferred not to enable Happy Eyeballs in the first place. However, according to @ShogunPanda, all known bugs (aside from deviating from the RFC) are fixed in 20.3.0. Do the fastify issues persist with 20.3.0 @mcollina?

@ShogunPanda
Copy link
Contributor

Yes, it persist with 20.3.0.
But it seems to be a different (possibly partially unrelated) issue as none of the known bugs seems to surface up. I'm debugging this as we speak.

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Jun 9, 2023

About having this enabled by default: I'm a bit conflicted on this. Regardless of the implementation, I wonder if will cause more harm than good to users.

To clarify, it is good that we had lot of bugs reports and we were able to fix them. But I start thinking a feature like this has too many network implications that probably we should have it disable by default and always have user explicitly opt-in when they need it.

@silverwind
Copy link
Contributor

My preference would be to revert only for v20 and working out the remaining issues on v21+ (#48145 would be a must-have for me).

@ShogunPanda
Copy link
Contributor

I don't understand the parallel connection request. According to Happy Eyeballs RFC 8305, section 5, connections attempts SHOULD NOT be made in parallel.
Why do we want to deviate from this?

@silverwind
Copy link
Contributor

I think you are misinterpreting the RFC. It just says they should not start in parallel (meaning true concurrency, e.g. multiple CPUs), but they can and should be pending at the same time. Essentially the connections are raced against each other. At least that is the whole point of Happy Eyeballs to my understanding and Wikipedia seems to agree.

@ShogunPanda
Copy link
Contributor

I'm not quite sure about it. The RFC complete sentence is:

In order to avoid unreasonable network load, connection attempts SHOULD NOT be made simultaneously.

So it's not referring to just resources on the originating machine. If we implement like that, each new connection will result in at least two connection attempts. This will affect source machine, routers, ISP and so forth.

@tniessen
Copy link
Member

tniessen commented Jun 9, 2023

I'm not quite sure about it. The RFC complete sentence is:

In order to avoid unreasonable network load, connection attempts SHOULD NOT be made simultaneously.

I think the RFC authors mean that multiple connection attempts should not be started simultaneously. A good implementation should base the delay between attempts on the SYN retransmission, so at most two SYN packets would be sent at a time in a proper parallel implementation.

@silverwind is right. The important sentence from the RFC is this:

Starting a new connection attempt does not affect previous attempts, as multiple connection attempts may occur in parallel.

@richardlau
Copy link
Member

The discussion around parallel connections should happen in #48145. As far as this PR goes, there are test failures, so presumably either more commits need to be reverted or additional commits to fix the tests.

@ShogunPanda
Copy link
Contributor

@richardlau Sure, we will move the discussion there.

@mcollina I checked your fastify failure and I made several tests using main branch. The problems seems not to be in family autoselection (as I manually disabled in the code) but in the presence of NODE_OPTIONS being set to any value. I know it makes no sense and I kinda hope somebody proves me wrong. As a proof of this, you can have a green run by running NODE_OPTIONS=whatever tap

@ShogunPanda
Copy link
Contributor

I confirmed in private with @mcollina. The fastify failures are unrelated to family auto selection and it seems to be a weird combination between the tap ecosystem, Node and NODE_OPTIONS.

Do we want to close this PR or do we still want to proceed?

@tniessen
Copy link
Member

I don't have a strong opinion. The current implementation is flawed, apparently both due to technical reasons and because of disagreement w.r.t. how the RFC should be interpreted, and most importantly, it does introduce new network problems. On the other hand, it also solves some historic (and somehow still relevant) network problems. Myself and others would have preferred autoSelectFamily to be opt-in instead of opt-out to keep the stable and predictable pre-20 behavior by default, but others approved #46790 so 🤷.

@tniessen
Copy link
Member

According to KararTY/dank-twitch-irc#13 (comment), the ERR_INTERNAL_ASSERTION bug still exists in 20.3.0 despite repeated claims that this release would fix all known bugs. Every day, users open new duplicate bug reports (three within the last two hours alone). @ShogunPanda suggested in #47644 (comment):

The best course of action, IMVHO, would be to wait for 20.3.0 to settle a little bit and see if bug reports stop. If not we disable it.

Does this imply that you are in favor of disabling the feature @ShogunPanda?

@ShogunPanda
Copy link
Contributor

@tniessen To be honest no.

I checked KararTY/dank-twitch-irc#13 (comment) and I could not reproduce the issue. All the other bug reports you mentioned are lacking the version so it's impossible to see whether they were in 20.3.0 or not.

Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this?

@tniessen
Copy link
Member

Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this?

Because each "annoying duplicate bug report", as you put it, represents at least one but potentially even a large number of negatively affected users, whereas we have no reliable metrics for the benefits of the feature. (To be fair, I am no expert in this matter so I tend to rely on others' expertise, and I have yet to meet someone more knowledgeable than Ben, for example, who was against enabling this by default.)

@ShogunPanda
Copy link
Contributor

I see.
In my opinion we should try to solve the bugs around the issue rather than "hiding it under the carpet" (which is what I fear will happen if we disabled it by default, as I'm not able to reproduce these issues locally and thus those report are really valuable).

@joyeecheung
Copy link
Member

joyeecheung commented Jun 13, 2023

In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0. But in retrospect I think we should've reverted this a month ago. When we notice any large-scale regression, it's usually never really worth waiting for a non-trivial fix. Reverting and do a patch release ASAP and then wait for a fix to reland is almost always better. The more we wait the more users get frustrated and will have to add workarounds to their code/environment which are meant to go away.

@tniessen
Copy link
Member

I couldn't agree more @joyeecheung. We should have reverted much earlier and postponed to Node.js 21, which also won't become LTS, or perhaps indefinitely.

In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0

It has been confirmed, see KararTY/dank-twitch-irc#13 (comment). But I guess the intention is to fix yet another bug and hope that that finally resolves the issue.

@ShogunPanda
Copy link
Contributor

@tniessen I already fixed that bug. AFAIK there is only npm/cli#6409 (which I'm working on at the moment) and then I have nothing else reported. Am I missing any bug?

@tniessen
Copy link
Member

I'm talking about the bug discussed in KararTY/dank-twitch-irc#13, which was supposed to be fixed in 20.3.0 :)

@ShogunPanda
Copy link
Contributor

Lol, yes. I know. Coincidentally, two bugs failed in the same assertion. I fixed the first one and now I'm taking care of the second one.

@joyeecheung
Copy link
Member

joyeecheung commented Jun 14, 2023

If we still have bugs that are pending a fix, I think the best thing to do is have a release ASAP that turns it off by default again. The more we wait the more users are going to be bothered by regressions. Improving the implementation of a new feature is not a good justification for regressions, IMO.

(Do we really need a full revert like what this PR does though? It seems to me we just need to set the CLI option to false by default again)

@ShogunPanda
Copy link
Contributor

@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately.

Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore.

@RafaelGSS
Copy link
Member

I don't have a strong opinion either, but It's unlikely to have this PR ready for this week and have it released due to the upcoming security release. Considering the main purpose of this issue was to prevent connectivity issues within IPV6, I feel this is something we'll need to have at some moment.

As stated earlier in this thread, if we had to revert, we should have rolled back weeks ago. Reverting it now into a breaking change isn't great IMO.

@silverwind
Copy link
Contributor

silverwind commented Jun 14, 2023

@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately.

Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore.

How about only disabling in on the v20 branch? I guess v21 can accept a bit more instability as you have seen that v19 barely got any real testing, and that will give plenty of time to fix the issues before v22.

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Jun 14, 2023

That might be an acceptable solution, but I'm still pretty skeptical about the approach. I think we might be able to solve all the (major) issues in few weeks.
So far, AFAIK, I only have two left, one already solved locally.

I totally agree that we should have reverted this long ago probably (and I apologize about this), but now the pro and cons of a breaking change lean towards the cons.

@ShogunPanda
Copy link
Contributor

FYI: I fixed all issue I'm aware of in #48464.

@joyeecheung
Copy link
Member

joyeecheung commented Jun 15, 2023

The option was renamed when I turned it on so yes, we need a full revert unfortunately.

Why do we need a full revert just because it's renamed? We could also keep the name but just turn it off by default again?

I'm afraid that if we turn this off we will never be able to turn it on reliably anymore.

Why would you think that would be the case? We could just turn it off and quickly release a patch, then land the fix and reenable it. If bug reports still come in, we do that again. Turning it on and off isn't ideal, but probably better than having a large-scale regression frustrating users for weeks. This happens a lot in Chromium/V8, for example, though they have a different release model (mostly Canary + Stable, but not too different from our Current + LTS I would say).

That said now that the security release is in progress, it's probably too late to do this cycle now. But we should probably try something similar for large-scale regressions in the future.

@ShogunPanda
Copy link
Contributor

Why do we need a full revert just because it's renamed? We could also keep the name but just turn it off by default again?

It's not just the rename of the options. We also have to revert some tests that changed error behavior.

Why would you think that would be the case? We could just turn it off and quickly release a patch, then land the fix and reenable it. If bug reports still come in, we do that again. Turning it on and off isn't ideal, but probably better than having a large-scale regression frustrating users for weeks. This happens a lot in Chromium/V8, for example, though they have a different release model (mostly Canary + Stable, but not too different from our Current + LTS I would say)

Well, all flipping is technically a breaking change. If we don't care too much then probably is not an issue.
But the real problem is that when we flip it off we can't really reproduce any error. The large scale userbase is, saying it really cynically, the best resource to quickly tackle problems.

@joyeecheung
Copy link
Member

It's not just the rename of the options. We also have to revert some tests that changed error behavior.

I think we could just enable the flag in those tests?

Well, all flipping is technically a breaking change.

The point about not breaking is so that we don't have large-scale regressions. As this feature is very new I doubt anyone is currently relying on it being the default, meanwhile judging from the bug reports, there are definitely more people relying on it not regressing.

But the real problem is that when we flip it off we can't really reproduce any error.

I don't see that as a problem when we are already sure about the errors. Again, we could just revert it while we work on known and certain bugs, instead of leaving users frustrated while we work on the bugs that we already know. We can just flip it back when the fix lands to see if the error still reproduces, Chromium/V8 has been doing this for many years and they have a much larger user base than ours.

@tniessen
Copy link
Member

The large scale userbase is, saying it really cynically, the best resource to quickly tackle problems.

And yet, these problems have not been tackled quickly. Node.js 20 was released more than two months ago and #47644 was opened just a day later. Still, every 20.x release since then has contained bugs related to this feature that affect a large number of users.

@ShogunPanda
Copy link
Contributor

The large scale userbase is, saying it really cynically, the best resource to quickly tackle problems.

And yet, these problems have not been tackled quickly. Node.js 20 was released more than two months ago and #47644 was opened just a day later. Still, every 20.x release since then has contained bugs related to this feature that affect a large number of users.

I'm not sure where you are going with this.

As soon as bugs started popping I promptly tried to fix all of them at my maximum speed. Unfortunately, unless we went for patch release (and, TBH, I don't even know what is the procedure for this), we had to wait for new release to be cut for people to benefit for it.

@ShogunPanda
Copy link
Contributor

It's not just the rename of the options. We also have to revert some tests that changed error behavior.

I think we could just enable the flag in those tests?

Good point. Yes, we can.

I don't see that as a problem when we are already sure about the errors. Again, we could just revert it while we work on known and certain bugs, instead of leaving users frustrated while we work on the bugs that we already know. We can just flip it back when the fix lands to see if the error still reproduces, Chromium/V8 has been doing this for many years and they have a much larger user base than ours.

We can definitely do that. At the moment I've fixed all the bugs I had a report for.
We can make a choice for 20.4.0 to flip it off or give 20.4.0 a last chance to see if all bugs are finally settled.

@tniessen
Copy link
Member

I'm not sure where you are going with this.

I think we all agree that we should have reverted many weeks ago. That's all I wanted to say :)

@lpinca
Copy link
Member

lpinca commented Jul 23, 2023

It seems things are better now. I'm closing this. We can reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. dont-land-on-v18.x PRs that should not land on the v18.x-staging branch and should not be released in v18.x. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants