Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLOW OR FAILED (500 ERROR) NODE.JS DOWNLOADS #1993

Closed
watson opened this issue Oct 23, 2019 · 65 comments
Closed

SLOW OR FAILED (500 ERROR) NODE.JS DOWNLOADS #1993

watson opened this issue Oct 23, 2019 · 65 comments

Comments

@watson
Copy link
Member

watson commented Oct 23, 2019

Edited by the Node.js Website Team

⚠️ PLEASE AVOID CREATING DUPLICATED ISSUES

Learn more about this incident at https://nodejs.org/en/blog/announcements/node-js-march-17-incident

tl;dr: The Node.js website team is aware of ongoing issues with intermittent download instability.

More Details: #1993 (comment)

Original Issue Below


All binary downloads from nodejs.org are very slow currently. E.g. https://nodejs.org/dist/v13.0.1/node-v13.0.1-darwin-x64.tar.xz.

In many cases, they will time out, which affects CI systems like Travis who rely on nvm to install Node.js.

According to @targos, the CPU of the server is spinning 100% being used by nginx, but we can't figure out what's wrong.

@targos
Copy link
Member

targos commented Oct 23, 2019

Top of the top:
image

@targos
Copy link
Member

targos commented Oct 23, 2019

@nodejs/build-infra

@atulmy
Copy link

atulmy commented Oct 23, 2019

Same here, using tj/n, the download just terminates after a while

Screenshot 2019-10-23 at 8 09 52 PM

@kachkaev
Copy link

kachkaev commented Oct 23, 2019

Azure DevOps pipelines also suffer, that's because a standard Node.js setup task downloads the binary from nodejs.org.

Screenshot 2019-10-23 at 15 43 41

Kudos to those who are looking into the CPU usage ✌️ You'll nail it! 🙌

@sam-github
Copy link
Contributor

@targos what machine is the top output from?

Fwiw, only infra team members are: https://github.com/orgs/nodejs/teams/build-infra/members and Gibson isn't current, he should be removed.

@MylesBorins
Copy link
Contributor

I wonder what traffic looks like... could be a DOS

We've talked about it in the past, but we really should be serving these resources via a CDN... I believe that the issue was maintaining our metrics. I can try and find some time to help out with the infrastructure here if others don't have the time

@sam-github
Copy link
Contributor

It could also be that 13.x is super popular? Hard to know without access to access logs.

@MylesBorins the infrasture team is even less staffed than the build team as a whole, if you wanted to pick that up, it would be 🎉 fabulous.

@mhdawson
Copy link
Member

The issue has been the metrics in terms of why they are not on a CDN, @joaocgreis and @rvagg do have an active discussion on what we might do on that front.

@MylesBorins if you can get help for this specific issue that might be good (please check with @rvagg and @joaocgreis ). Better would be ongoing sustained engagement and contribution to the overall build work.

@targos
Copy link
Member

targos commented Oct 23, 2019

@sam-github it's from direct.nodejs.org. I don't have root access to the machine, so top is basically the only thing I could do to help.

@sam-github
Copy link
Contributor

I don't even have non-root access, I guess you do because you are on the release team.

We'll have to wait for an infra person.

@mhdawson
Copy link
Member

I can help take a look after the TSC meeting.

@michaelsbradleyjr
Copy link

In addition to downloads only intermittently succeeding, when they do succeed it seems that sometimes (?) the bundled node-gyp is broken. I'm seeing that behavior in CI builds on Azure Pipelines.

For linux builds with successful downloads of Node.js v8.x and v10.x, node-gyp fails to build sources that were previously unproblematic.

For windows builds with successful downloads of Node.js v8.x and v10.x, I'm seeing errors like this:

gyp ERR! stack Error: win-x86/node.lib local checksum 629a559d347863902b81fad9b09ba12d08d939d20a0762578e904c6ee1a7adab not match remote fea7d0c5a94fc834a78b44b7da206f0e60ef11dd20d84dc0d49d57ee77e20e16

That is, in the case of windows node-gyp doesn't attempt to compile; rather, it fails outright owing to the checksum error.

@mhdawson
Copy link
Member

@michaelsbradleyjr I suspect that the node-gyp failures would have been related to downloading header files.

@mhdawson
Copy link
Member

Was looking into this with @targos. It seems to have resolved itself at this point.

Looking at the access logs it was a bit hard to tell if there was more traffic or not as the files for the previous day are compressed. We should be able to compare tomorrow more easily.

@mhdawson
Copy link
Member

Top on the machine looks pretty much the same as it did while there were issues as well

@rvagg
Copy link
Member

rvagg commented Oct 23, 2019

hm, that top doesn't look exceptional (load average would have been nice to see though @targos, should be up the top right of top). Cloudflare also monitors our servers for overload and switches between them as needed and they're not recording any incidents. It's true that this is a weak point in our infra and that needing to get downloads fully CDNified is a high priority (and we're working on it but are having trouble getting solid solutions that solve all our needs), but I'm not sure we can fully blame this on our main server, there might be something network-related at play that we don't have insight to.

btw, it's myself and @jbergstroem having the discussion about this @mhdawson, not so much @joaocgreis, although you and he are also in the email chain.

@MylesBorins one thing that has come up that might be relevant to you is that if we can get access to Cloudflare's Logpush feature (still negotiating) then we'd need a place to put logs that they support. GCP Storage is an option for that, so maybe getting hold of some credits may be handy in this instance.

@mhdawson
Copy link
Member

@rvagg sorry right @jbergstroem, got the wrong name.

@mhdawson
Copy link
Member

mhdawson commented Oct 23, 2019

Possibly unrelated, but I did see a complaint of a completely different site reporting slow traffic around the same time.

@watson
Copy link
Member Author

watson commented Oct 24, 2019

I can ask my company if they'd be interested in sponsoring us with a hosted Elasticsearch and Kibana setup that can pull all our logs and server metrics and make them easily accessible. This will also give us the ability to send out alerts etc. Would we be interested in that?

@rvagg
Copy link
Member

rvagg commented Oct 24, 2019

Thanks @watson, it's not the metrics gathering process that's difficult, it's gathering the logs in the first place and doing it in a reliable enough way that we have the confidence that it'll keep on working even if we're not checking it regularly, we don't have dedicated infra staff to do that level of monitoring.
Our Cloudflare sponsorship got a minor bump to include logpull, normally an enterprise feature, but so far we've not come up with a solution that we're confident enough to set-and-forget like we currently can with nginx logs. Logpush, a newer feature, would be nice because they take care of storing the logs in a place we nominate. Part of this is also the people-time to invest in doing it, access to server logs is not something we can hand off to just anyone either so there's a small number of individuals who have enough access and trust to implement all of this stuff.

@bvitale
Copy link

bvitale commented Oct 24, 2019

Hi folks, this seems to be cropping up again this morning.

@mhdawson
Copy link
Member

Current load average:

cat /proc/loadavg
2.12 2.15 1.96 1/240 15965

@mhdawson
Copy link
Member

load average seems faily steady:

cat /proc/loadavg
1.76 1.90 1.91 3/252 26955

@renaud-dev
Copy link

Hello,
Having the issue again on our Azure pipelines (west europe) :
image

Thank you & good luck to the team !

@mhdawson
Copy link
Member

captured ps -ef to /root/processes-oct-24-11am so that we can compare later on.

@chrizzo84
Copy link

Hi all,
problem still exists... Tested on Azure DevOps Agent.
gzip: stdin: unexpected end of file
/bin/tar: Unexpected EOF in archive
/bin/tar: Unexpected EOF in archive
/bin/tar: Error is not recoverable: exiting now

@jbergstroem
Copy link
Member

jbergstroem commented Nov 4, 2019

Small update: we are now carefully testing a caching strategy for the most common (artifact) file formats. If things proceed well (as it does seem to), we will start covering more parts of the site and be more generous with TTL's.

@rvagg
Copy link
Member

rvagg commented Nov 6, 2019

I don't believe we'll see this particular incarnation of problems with nodejs.org from now on. Future problems are likely to be of a different nature, so I'm closing this issue.

We're now fully fronting our downloads with CDN and it appears to be working well. It's taken considerable load off our backend servers of course and we're unlikely to hit the bandwidth peaks at the network interface that appears to have been a problem previously.

Primary server bandwidth:

Screenshot 2019-11-06 12 02 39

Cloudflare caching:

Screenshot 2019-11-06 12 01 35

@rvagg rvagg closed this as completed Nov 6, 2019
@OriginalEXE
Copy link

Thank you to everyone for their hard work, sending virtual 🤗.

@matsaman
Copy link

Apparently encountering this type of issue still. For the past two days I get around a 20% failure rate consistently:

$ for i in {01..10}; do echo "$i"' '"$(date -u)"; curl https://nodejs.org/dist/v12.13.1/node-v12.13.1-linux-x64.tar.xz -o node-v12.13.1-linux-x64.tar.xz; done
01 Wed Nov 20 19:46:43 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0   217k      0  0:01:03  0:01:03 --:--:-- 1620k
02 Wed Nov 20 19:47:46 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  4237k      0  0:00:03  0:00:03 --:--:-- 4236k
03 Wed Nov 20 19:47:49 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  1651k      0  0:00:08  0:00:08 --:--:-- 1247k
04 Wed Nov 20 19:47:57 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 62 13.4M   62 8527k    0     0  28569      0  0:08:12  0:05:05  0:03:07 19759
curl: (18) transfer closed with 5335120 bytes remaining to read
05 Wed Nov 20 19:53:02 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  3489k      0  0:00:03  0:00:03 --:--:-- 3488k
06 Wed Nov 20 19:53:06 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 83 13.4M   83 11.2M    0     0  25695      0  0:09:07  0:07:39  0:01:28 19389
curl: (18) transfer closed with 2255352 bytes remaining to read
07 Wed Nov 20 20:00:46 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  4398k      0  0:00:03  0:00:03 --:--:-- 4398k
08 Wed Nov 20 20:00:49 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  2798k      0  0:00:04  0:00:04 --:--:-- 2798k
09 Wed Nov 20 20:00:54 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  2340k      0  0:00:05  0:00:05 --:--:-- 2282k
10 Wed Nov 20 20:01:00 UTC 2019
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.4M  100 13.4M    0     0  2149k      0  0:00:06  0:00:06 --:--:-- 2197k

@rvagg
Copy link
Member

rvagg commented Nov 21, 2019

Where are you @matsaman? could be a cloudflare problem in your region. See https://www.cloudflarestatus.com/ and check whether one of the edges close to you is having trouble. Without further reports of problems or something we can reproduce ourselves our best guess is going to be that it's a problem somewhere between Cloudflare and your computer since CF should have it cached.

@matsaman
Copy link

Items on that page in my region say 'Operational' at this time. Should I encounter the issue again I will check that page after my tests and include my findings here.

@appu24
Copy link

appu24 commented Jan 14, 2020

Hi,

image

I am still facing this problem on Azure DevOps.
Do we have a fix for this by any chance?

Thanks in advance.

@mhdawson
Copy link
Member

I'm getting fast downloads. Please open a new issue if you believe there is still a problem versus commenting on this close issue.

@rvagg
Copy link
Member

rvagg commented Jan 14, 2020

@appu24 tar failure doesn't necessarily indicate a download error. Are you able to get more log output for that to see exactly what it's failing on. I would assume something in the chain is generating some stderr logging at least.

@tbsvttr
Copy link

tbsvttr commented Aug 21, 2023

very slow download for me

@saschanos
Copy link

Experiencing issues for weeks now; from different IPs and servers. Is there any known reason for this?

@MattIPv4
Copy link
Member

Is there any known reason for this?

Yes, the origin server is being overloaded with traffic, in part due to every release of Node.js purging the Cloudflare cache for the entire domain.

There are a bunch of issues open that aim to help rectify this:

@hellmediadev
Copy link

nodejs.org file delivery is currently not working again. This might be a working mirror:
https://mirrors.cloud.tencent.com/nodejs-release/

@stevenmettler
Copy link

Hi there, can't download node. Continued "Failed - Network Error" from my mac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests