Freeze releases and website changes, pending cache fixes? #1416

ljharb · 2023-07-13T16:15:10Z

Every time a commit is pushed to the website, or a release is done, I'm told the CloudFlare cache of nodejs.org/dist is purged, which causes a lot of server churn as the cache is repopulated, which also causes both nodejs.org/dist and iojs.org/dist to break.

During this time, anyone trying to install node may encounter 5xx errors; anyone using nvm to do anything remote may encounter 5xx errors (nvm relies on both index.tab files to list available versions to install); and any CI based on dynamically building a matrix from index.tab is likely to encounter 5xx errors.

I would offer my opinion that "changes to the website" are likely never more important than "people's ability to install node", and "a new release of node" is, modulo security fixes, almost never more important than that ability either.

Fixing the problem requires people with all of access, ability, and time, and one or more of those has been lacking for awhile - and to be clear, I'm not complaining about this fact: everyone involved in node is doing their best to volunteer (or wrangle from an employer) what time they can. However, I think it's worth considering ways to avoid breakage until such time as a fix can be implemented.

Additionally, this seems like very critical infrastructure work that perhaps @openjs-foundation could help with - cc @rginn, @bensternthal for thoughts on prioritizing this work (funding and/org person-hours) for DESTF?

I'd love to hear @nodejs/build, @nodejs/releasers, and @nodejs/tsc's thoughts on this.

Related: nodejs/nodejs.org#5302 nodejs/nodejs.org#4495 and many more

The text was updated successfully, but these errors were encountered:

MattIPv4 · 2023-07-13T16:35:27Z

Major +1 here, the frequent releases and website updates cause a full cache purge in Cloudflare every single time, putting a massive load on the origin that it is currently unable to handle, leading to Node.js becoming essentially unavailable to download.

Until time is put into reworking the origin (likely moving it primarily to R2 with a Worker that handles fallback to the origin server), and into reworking how cache purging happens for releases, I would agree that doing a freeze of releases + website updates makes sense to ensure the Cloudflare cache is retained so Node.js is actually available for folks to even download (there's no point releasing new versions if folks can't download them or read the docs for them).

ovflowd · 2023-07-13T16:38:53Z

Major +1 here,

A random unordered mental dump:

A lot of threads on Slack regarding these subjects
A lot of man-hours (volunteer, non-paid) of the Build WG + some folks (like @MattIPv4 and me included)
A lot of discussions happening
The website Team receiving most of the complaints, and we're kind of trying to manage things
A lot of flakiness and issues boiling up
Users being affected from time to time. Including other open-source projects and CI systems such as GitHub Actions and more.

Few (potential) suggestions:

Freeze pushes on nodejs.org on US hours (keep pushes on non-working hours)
- Enable GitHub to merge queues to reduce the number of actual pushes
It's ultimately the OpenJS members (sponsors) interests in us having a reliable infra
We can't keep having our volunteers stepping it that much; It's draining them.
Honestly speaking, we should get either the Foundation to hire someone to take care of Build/Infra or have one (or more) member companies hire someone to help us with infra. (Some companies such as Postman, RedHat, Cloudflare, and many others have full-time dedicated people for Open Source; we could get them to open a role to support us. AsyncAPI for example, is fully backed by Postman)

nschonni · 2023-07-13T16:41:46Z

Linking to the cache purge tracking issue nodejs/build#3410

jasnell · 2023-07-13T16:58:05Z

I don't have a lot of context here but it sounds like there's pain and it's great to relieve pain so I'm all for whatever needs to be done here.

mcollina · 2023-07-14T07:08:23Z

There are no good option here. The best outcome would be to have somebody redesign these pipelines and only purges the URLs that are needed, or possibly nothing at all (and use stale-while-revalidate semantics).

Given that this requires a volunteer to lead that effort or funds, possibly the least bad options would be to:

limit cache invalidations to Node.js releases (not canary or nightly)
release the website only once a week / ideally on weekends

I'm not happy with any of these, but I don't think we can do much better right now.

MoLow · 2023-07-14T07:34:24Z

I have raised this before at https://github.com/nodejs/build, but since we are discussing hosting static files, is there a reason why we are managing the infrastructure ourselves, and not using a managed service for this? (e.g amazon s3/azure blob storage/github pages/cloudflare pages etc)
is there some security or integrity consideration that has led to the decision to host files ourselves?

I think the solutions suggested above are fine for the short term, but if there is no other reason we should probably consider a managed solution for the long term.

if this makes sense, I am glad to lead such an effort

mcollina · 2023-07-14T07:48:36Z

@MoLow mostly money. Node.js infrastructure consumes very little of the foundation money.

Moreover, all of this was put in place a long time ago and there were fewer options at the time.

targos · 2023-07-14T08:01:49Z

I think it's also related to the fact that Node.js was a lot less downloaded when this was put into place many years ago.

ovflowd · 2023-07-14T08:35:19Z

Hey @mcollina, just to mention that your proposed solutions will not solve the situation (I guess that's why you mentioned bad options?) but just reduce the problems (from what you've explained, they would reduce somewhat significantly already, but it's just a patch I'd say, because the moment we do cache invalidations the issue happens, because our servers are just unable to handle)

ovflowd · 2023-07-14T08:37:26Z

I have raised this before at nodejs/build, but since we are discussing hosting static files, is there a reason why we are managing the infrastructure ourselves, and not using a managed service for this? (e.g amazon s3/azure blob storage/github pages/cloudflare pages etc)

We are having talks over adopting Cloudflare R2, they offered us the R2 service (similar to AWS S3) for free with all the traffic and needs we have. It is a path we're exploring!

ovflowd · 2023-07-14T08:39:06Z

I think the solutions suggested above are fine for the short term, but if there is no other reason we should probably consider a managed solution for the long term.

A managed solution still requires someone to "manage" them or at least maintain them. In the case of R2 we need to write Cloudflare Workers and do a lot of initial configs just to mirror our current www-standalone server to R2 (at least the binaries and assets)

FYI a lot of discussion is happening on #nodejs-build, our issue right now is definitely not the lack of good plans/ideas, but the lack of someone able to execute them.

ovflowd · 2023-07-14T08:41:16Z

if this makes sense, I am glad to lead such an effort

I think it's better to leave the people at Node.js Build WG that understand the situation completely to lead this initiative technically. What we need is an ack from the TSC about this issue and that we're able to dedicate resources into this.

Not to mention, what @ljharb suggested would already be a temporary "workaround" to improve the user experience, by reducing Website builds and releasing "promotions". We still need someone (or a bunch of people) to be able to do the long term-plan...

mhdawson · 2023-07-14T13:21:31Z

I'd be ok with not invalidating for nightly and canary releases or possibly doing them less often. For the others I think that releases are not often enough that we should slow down releases of Current and LTS lines.

mhdawson · 2023-07-14T14:30:23Z

As @ovflowd mentions the key question is what we do in the mid to long term in terms of "We still need someone (or a bunch of people) to be able to do the long term-plan...".

Sounds like @MoLow who is a member of the build WG has offered to lead work on the mid to longer term plan in #1416 (comment) and I think that it would be great to start working on that.

I also think that in terms of keeping things up/running even after we have a new/better infrastructure we need people who can drop everything else when needed to address problems with the downloads, OR set the expectation that it's a best effort and there is no SLA. The downloads may not be available at any point in time and people should plan for that. On this front I've asked for help from the Foundation in the past on the build side, presented to the board, worked with Foundation staff on summaries of work etc. but unfortunately that did not result in resources to let us be more proactive. It may be a different time, and or the situation more urgent now so looking at that again might make sense.

ovflowd · 2023-07-14T21:18:53Z

Sounds like @MoLow who is a member of the build WG has offered to lead work on the mid to longer term plan in #1416 (comment) and I think that it would be great to start working on that.

I completely forgot @MoLow was on the build team, +1 for him to lead the initiative!

danielleadams · 2023-07-17T21:34:53Z

Thanks for bringing this up. Currently, there is an LTS release in flight that I'd like to get out because it has a lot of anticipated changes (nodejs/node#48694). I had planned to get it out around 1:00 UTC to accommodate a "low activity" time, but that doesn't look like it's going to happen.

Instead, I'm just going to get this release out as soon as possible (hopefully in the next 12 hours), and then in the next release meeting we can discuss optimal time frames for promoting builds.

ovflowd · 2023-07-17T22:04:31Z

Thanks @danielleadams, I'll be monitoring our infra, I'll let you know if anything weird happens 👀

richardlau · 2023-07-18T17:13:14Z

In terms of actual releases we're not doing them that often (for example, the last non-security 18.x release prior to the one @danielleadams is working on was back in April) -- I don't think freezing releases would actually gain much. The last actual release, for example, was 20.4.0 on 5 July and we've had plenty of issues since then without a new release being put out.

We are purging the CloudFlare cache perhaps three or more times a day for the nightly and v8-canary builds -- as far as the current tooling/scripts are concerned there is no difference in how those are treated vs releases (so it's one thing saying that maybe they should not be, but another to do the remedial work). And while frequent cache purges are certainly not helping the situation, I'm not convinced that the problem is entirely related to the CloudFlare cache.

MattIPv4 · 2023-07-18T17:44:46Z

I think perhaps the wording here, regarding freezing of releases, was intended to also capture the release of nightly/canary builds, as those also cause cache purges.

While I agree I don't think cache purging itself is probably the issue at its core here, the origin seems to just be rather unhappy, avoiding purging the cache many times a day is definitely going to massively improve the situation, as Cloudflare will be able to actually serve stuff from their cache, rather than it being repeatedly wiped and forcing traffic to be served from the struggling origin.

ovflowd · 2023-07-18T17:57:05Z

I think perhaps the wording here, regarding freezing of releases, was intended to also capture the release of nightly/canary builds, as those also cause cache purges.

☝️ exactly this!

While I agree I don't think cache purging itself is probably the issue at its core here, the origin seems to just be rather unhappy, avoiding purging the cache many times a day is definitely going to massively improve the situation,

Same here. If we can avoid purging caches for nightly/canary releases as it might not be that much needed, that'd be great!

mhdawson · 2024-04-15T21:50:37Z

I think we chose not to freeze release or changes to the website so this can be closed. Unless there are objections to be closing this in the next few days I'll go ahead and do that.

MattIPv4 · 2024-04-15T22:34:00Z

+1, the website is no longer served out of NGINX and releases are now served from R2 afaik, so I think this is no longer a problem.

richardlau mentioned this issue Jul 14, 2023

DigitalOcean www server nodejs/build#3424

Open

mhdawson mentioned this issue Jul 18, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-07-19 #1417

Closed

mhdawson added the tsc-agenda label Jul 18, 2023

This was referenced Jul 24, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-07-26 #1419

Closed

Node.js Technical Steering Committee (TSC) Meeting 2023-08-02 #1421

Closed

This was referenced Aug 7, 2023

Node.js Technical Steering Committee (TSC) Meeting 2023-08-09 #1426

Closed

Node.js Technical Steering Committee (TSC) Meeting 2023-08-16 #1427

Closed

mcollina removed the tsc-agenda label Aug 16, 2023

UlisesGascon mentioned this issue Aug 22, 2023

Integrity checks for R2 migration nodejs/build#3469

Open

8 tasks

mcollina mentioned this issue Sep 28, 2023

Regular errors downloading Node.js tarballs nodejs/help#4254

Closed

shadowspawn mentioned this issue Sep 28, 2023

Retry when downloading/extraction of Node.js tar fails tj/n#784

Closed

mhdawson closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze releases and website changes, pending cache fixes? #1416

Freeze releases and website changes, pending cache fixes? #1416

ljharb commented Jul 13, 2023 •

edited

Loading

MattIPv4 commented Jul 13, 2023

ovflowd commented Jul 13, 2023 •

edited

Loading

nschonni commented Jul 13, 2023

jasnell commented Jul 13, 2023

mcollina commented Jul 14, 2023

MoLow commented Jul 14, 2023 •

edited

Loading

mcollina commented Jul 14, 2023

targos commented Jul 14, 2023

ovflowd commented Jul 14, 2023 •

edited

Loading

ovflowd commented Jul 14, 2023

ovflowd commented Jul 14, 2023

ovflowd commented Jul 14, 2023

mhdawson commented Jul 14, 2023

mhdawson commented Jul 14, 2023

ovflowd commented Jul 14, 2023

danielleadams commented Jul 17, 2023

ovflowd commented Jul 17, 2023

richardlau commented Jul 18, 2023

MattIPv4 commented Jul 18, 2023

ovflowd commented Jul 18, 2023

mhdawson commented Apr 15, 2024

MattIPv4 commented Apr 15, 2024

Freeze releases and website changes, pending cache fixes? #1416

Freeze releases and website changes, pending cache fixes? #1416

Comments

ljharb commented Jul 13, 2023 • edited Loading

MattIPv4 commented Jul 13, 2023

ovflowd commented Jul 13, 2023 • edited Loading

nschonni commented Jul 13, 2023

jasnell commented Jul 13, 2023

mcollina commented Jul 14, 2023

MoLow commented Jul 14, 2023 • edited Loading

mcollina commented Jul 14, 2023

targos commented Jul 14, 2023

ovflowd commented Jul 14, 2023 • edited Loading

ovflowd commented Jul 14, 2023

ovflowd commented Jul 14, 2023

ovflowd commented Jul 14, 2023

mhdawson commented Jul 14, 2023

mhdawson commented Jul 14, 2023

ovflowd commented Jul 14, 2023

danielleadams commented Jul 17, 2023

ovflowd commented Jul 17, 2023

richardlau commented Jul 18, 2023

MattIPv4 commented Jul 18, 2023

ovflowd commented Jul 18, 2023

mhdawson commented Apr 15, 2024

MattIPv4 commented Apr 15, 2024

ljharb commented Jul 13, 2023 •

edited

Loading

ovflowd commented Jul 13, 2023 •

edited

Loading

MoLow commented Jul 14, 2023 •

edited

Loading

ovflowd commented Jul 14, 2023 •

edited

Loading