-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INCIDENT: Primary nodejs.org web server taken offline by hosting provider #1659
Comments
Server is back online, we don't have full information (yet) but it appears to be a false positive. We've run afoul of some heuristic in their safety system. Unfortunately, in the process of restoring, SSH host keys got reset (I don't have a solid explanation for this yet but my guess is that it's related to their snapshot logic), and I don't have a backup of the old ones. So anyone (and anything) using SSH access will get errors about host key mismatching. @nodejs/releasers are going to have to fix this up before they can run releases again. If you want to get this sorted out now run I need to manually do the same process for all release machines, unfortunately, so we'll need to keep an eye on nightly builds to make sure we are getting everything we're supposed to. ci-release should be red on upload failures so it should be obvious if it's not fixed. |
We're over the hump on this one and our new server seems to be operating just fine. The experience has demonstrated that our current redundancy strategy is acceptable, but not perfect. There were a few problems with the setup that we've identified, but for the vast majority of people there was no noticeable impact. We don't know exactly what heuristic we hit, I still suspect it was just the volume of traffic being all outgoing and focused on a small number of hosts (Cloudflare edge locations). This isn't a typical pattern and is because we are still relying on our nginx logs for metric collection so we serve all binaries from our own servers even with Cloudflare in front. The effort to replace it with Cloudflare logging has stalled, it's not a straightforward process, but this incident is a good prompt to continue that work. Once that's done, the traffic out of DigitalOcean (and Joyent, where our backup lives) will be considerably smaller and we could even consider downsizing the servers we use. There's also other architectural concerns that this incident highlights, but we have been having ongoing discussions about our architecture and re-engineering it to distribute resources, tools and access better rather than doing so much on a single server which ends up having "crown jewels" status with only a few people with the ability to properly administer it. Overall though, not a terrible experience. |
Finally got a response, confirms the theory:
|
DigitalOcean have taken our primary nodejs.org server offline due to suspicious traffic patterns. From the information they have provided it looks like a simple false positive and we're trying to get it resolved ASAP.
In the meantime, our backup server is handling the load and will hopefully provide appropriate continuity. Please provide details here of anything out of shape and we'll try and get it addressed.
Until resolved:
The text was updated successfully, but these errors were encountered: