Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling registry updates #2452

Closed
SimonSapin opened this issue Mar 8, 2016 · 6 comments
Closed

Scaling registry updates #2452

SimonSapin opened this issue Mar 8, 2016 · 6 comments

Comments

@SimonSapin
Copy link
Contributor

TL;DR: This is a problem we don’t have yet. I mostly want to record some information in case we do in the long term.


This comment: CocoaPods/CocoaPods#4989 (comment) explains how the CocoaPods/Specs repository gets so much traffic that GitHub rate-limits it severely, causing fetches to take a very long time or fail.

We understand that part of the CocoaPods workflow is that its end users (i.e., not just the people contributing to CocoaPods/Specs) fetch regularly from GitHub

This sounds exactly like rust-lang/crates.io-index.

Rate-limiting from GitHub has not been a problem for us as far as I know, but there may be some precautions we can take to avoid it.

Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.

I think we’re OK here since Cargo uses libgit2 which does not support shallow clones anyway.

Finally, the layout of the repo itself doesn't help. Specifically, the Specs directory, which contains 16k+ subdirectories, causes some Git operations to be unexpectedly expensive, further driving up CPU usage.

Here as well we’re doing pretty good since rust-lang/crates.io-index already has two levels of directory nesting, each (roughly) with two characters from the start of crates’s names. 26^4 is 456,976; npm has 249,825 packages right now.

Another comment CocoaPods/CocoaPods#4989 (comment) suggests:

this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs which also will make things better for your users as a no-op API HTTP call is significantly faster for you (and less expensive for GitHub) than a no-op git fetch.

This sounds beneficial even if we don’t hit rate-limiting. I’ve filed #2451 separately.

@alexcrichton
Copy link
Member

Thanks for the issue @SimonSapin! I've been thinking about this as well after seeing that post.

I believe your tl;dr; is correct in that we're fine here. We've already implemented almost all the mitigation strategies pointed out by the github staff, namely:

  • We don't have everything in one directory, everything is sharded by name. We have two levels of sharding, each with two characters, so unless all crates start with "foo" we're covered.
  • We don't store one file per version, which massively cuts down on the number of files in the registry.
  • All updates to the registry are basically append-only, so downloading and calculating incremental updates should be fast.
  • No shallow fetches are performed (because libgit2 doesn't support it)
  • If the registry ever takes too long to clone, we can roll the entire history into one commit, force push, and start again from scratch. We're not tied to an ever-long history.

Using a special API to detect whether the repository doesn't have any commits seems like it'll be useful, though, regardless (as it's faster). Probably best discussed in a separate issue though! (as you've done)

I'm going to close this for now as there's not really anything for us to do. We're already employing basically all of the mitigation strategies outlined in that thread, and we have other mitigation strategies in place for if operations on the registry become a pain in the future.

@SimonSapin
Copy link
Contributor Author

If the registry ever takes too long to clone, we can roll the entire history into one commit, force push, and start again from scratch. We're not tied to an ever-long history.

You may want to make sure that Cargo doesn’t freak out if you do that. The git command line tool rejects non-fast-forward fetches unless you use --force, but I don’t know about libgit2. Many people don’t use Cargo Nigthly, so you want "old" Cargo versions to support this by the time you actually need to force push.

Also, it seems like removing the git history would mean losing some data. For example in #2326 (comment) I relied on commit dates. I couldn’t have done that analysis without git history.

Does the PostgreSQL database behind crates.io have more data than what’s in the index? Could that data be made more readily available?

@alexcrichton
Copy link
Member

Yes support for rolling the history into one commit has been with Cargo since day 1. And yes it would break scripts that rely on git history, but that's not really something we can work around. And no I don't think the crates.io database can be used to rebuild the index.

@SimonSapin
Copy link
Contributor Author

I’m getting off-topic here but I mentioned the database not to rebuild the index but making it available to enable anyone to do all kinds of unforeseen analysis like "which crates/versions were uploaded in this date range and might have be in GNU tarball format" (#2326 (comment)) or "make a distribution graph of crates by download count"

@alexcrichton
Copy link
Member

Yes, the database should contain enough information to do something like that. To make it more accessible we'd likely want to just enhance the JSON api

@telotortium
Copy link

@alexcrichton Hopefully too many people don't decide to start their project names with "rust" :). Specifically, as of 8551e70, /ru/st/ contains 118 entries (and /go/og/ contains 113).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants