-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce new archive index format based on SQLite #2068
Conversation
I would love some feedback on this improvement, cc @rust-lang/docs-rs |
The other interesting metric would be size, both before and after compression, I assume it'll be a little larger to store the index. Another option discussed ages ago was |
I'll answer in more detail later, one thought already was that sqlite is much easier to debug since you have db and cli tools that can easily edit / change the database. |
I like this idea a lot. |
Some sizes:
Which in my mind is OK, when I see the performance advantage + the ease of maintaining / fixing the indexes.
In my mind the "debugability" of SQLite is just so much easier, since we can just download the DB and find command line tools everywhere. On top of the ability to just add new tables / columns if we need to. |
what do you believe is missing? I intentionally would only do the new format for new releases, so we can rollback and rebuild. Next step if that works would be adding some migration for old releases, so we are safer against the crawler / high load. |
Overall I'm in favor! 👍🏻 I had to rehydrate some mental state on this, so writing it down for others who might find it useful: The archive index was first implemented as part of #1342. My first question was "doesn't the zip central directory solve this?" And in fact that's discussed in that PR:
Also, doing a little more research, one problem with the zip central directory is that it occurs at the end of the file and its beginning position is not known - you have to seek backwards from the end of the file to find it. Also, if the problem we're dealing with here is that the index itself is so large that decompressing / parsing the whole index is too expensive, then the zip central directory has the same problem - we would have to parse the whole thing to find the file we want. It would be possible to generate a zip central directory in sorted order and binary search across it, but because directory entries are not of constant size this would get very messy very quickly. Similarly, we could make the CBOR index sorted and do binary search, but it presumably has the same problem - variable length entries to to path name. Though we could maybe solve that by storing filename hashes instead of filenames? Still, I see why it makes sense to switch to SQLite, which has indexing of data by strings as a first class feature, rather than reinventing indexes files for ourselves from scratch. 👍🏻 Another longstanding and simple technology that fills the same niche is SSTables, which has a Rust implementation: https://crates.io/crates/sstable. For thoroughness, another approach we could take is to turn the problem inside-out. Right now we generate zip files as our first-class storage, and offer them for direct download (cheap for us) and serve webpages from them (expensive for us, particularly for large crates). But fetching webpages is a frequent thing, and the world expects it to be cheap (thus, crawlers). Downloading zip files is a less frequent thing, and while it's good for it to be cheap, it's not so bad if it's a little expensive. So what if we changed our archive storage format directly to something like SQLite or SSTable that has good performance and straightforward implementation when fetching up single files? Then when serving a request for a zip file we would convert the SQLite or SSTable to zip on the file, possibly with a caching layer. |
Also, what's the distribution of the next-biggest crates after stm32ral? Is it reasonable to use /robots.txt to tell people not to bother crawling stm32ral and some other handful of mega-crates? It seems like we could spend a lot of engineering trying to make it fast to serve single pages from mega-crates in various ways. |
I once did this analysis based on a (~2021) inventory of our S3 bucket, when trying to debug bigger index performance :) it came down to ~200 crates with more than an 100k files in their docs, but goes down to only ~10 crates with more than 500k files.
This is definitely a valid idea. The main advantage of using the archive as we have it is that we can to S3 range requests to fetch single files out of the archive. I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request, especially when we account for the new ECS setup where we will scale up / down all the time. In all the design discussions about archive-storage , downloadable docs where actually only second priority, more a nice-to-have.
thank you for taking the time for the thorough comment & research! |
SSTables have an index, which should make it possible to fetch just the index and use range requests to get a specific file. But looking again at the sstable crate, it was last released over a year ago, so maybe not a great choice. It's a bit of a bummer that the current approach requires reading the whole SQLite index before making other fetches, since 291MB is still a lot of data to process. Speaking of which, your benchmark shows that fetching an entry from the 291MB index (including decompression?) takes about 60 _micro_seconds? That seems impossible. |
I didn't know about the SSTable index, TIL :)
About decompression: same as the current CBOR index, we only compress it on S3. When we don't find the index locally for a requested crate, we download it, and store it decompressed in a local directory. So for rustdoc request we only have to:
Which is in the two benchmarks, once with the SQLite file already opened, once with an existing open connection. Since I didn't work much with criterion until now I did a sanity check, and in a short command line program, 1 million file-fetches on the same 290MB DB take 8.2 seconds. I assume that SSTables would be even faster, or other formats, but IMO this is not the focus. It should be fast enough, it should be scalable for bigger crates (that's definitely a learning), and IMO easy to debug. As a side-note: |
Ah, thanks for explaining. So the 60 microseconds is to open a file whose contents are already uncompressed and in the buffer cache, and do a handful of random reads? That seems more reasonable. And interesting about the archive indexes being cached on local disk. It sounds like we never clear them up? Do we just have a massive disk, or are the archive indexes that small? One other thing: what about storing hashes of filenames instead of full filenames? Might make the index size smaller. |
Most are relatively small, right now the whole folder has ~31 GB, for all releases since we introduced archive-storage.
It makes the local index a little smaller, but the compressed ( = remotely stored) index bigger.
|
Ah, interesting. I guess compressibility makes those repetitive filenames more efficient to represent than the unrepetitive md5sums. The unique part of each filename is generally less than 16 bytes. I agree with your conclusion, then. Just to confirm, this is storing the md5 as a blob of bytes, not a hex string, right? |
fun fact, no, it was a string, seemlingly evening-tiredness ;) Thanks!
|
read-speed is quite similar with MD5 or without it. This is an example with a more regular-sized crate (17k files):
Which feels like using md5 with bytes would be a micro-optimization here, costing the ease of working with the data directly. |
Yep, I concur. |
coming from the feedback I'll
and then ping someone for the final review :) |
240be38
to
0c44456
Compare
a365acf
to
67741c6
Compare
b2e76e3
to
ce868cc
Compare
thank you for the review & the input @Nemo157 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested a build on my server, along with loading docs from old archives, and grabbed the index from minio to inspect with sqlite manually, seems to all Just Work 👍.
thank you for the review! ❤️ |
Coming from the performance issues with the archive index I took some time to update the whole topic.
While I started to dig into optimizing the CBOR reader I then thought about the fact that CBOR will never be scalable for bigger crates, so I started looking for alternatives that would include some indexing.
I did some high level tests with LMDB, sled and rocksdb, but settled for a boring, old technology: SQLite.
( yes, I'm definitely biased towards older but perhaps more reliable things).
In my mind, this would give us a reliable, fast, single-file format, which would also be easily extensible if we need to add more information. File-size is slightly higher, but IMO not an issue.
This PR would only use the new format for new builds, while I would plan for a slow (lazy or active) migration of the other archive-stored releases.
micro-benchmarks:
using the archive index for
stm32ral
(~1.2 million files)That speed doesn't fully scale with parallelism, but IMO should be good enough, especially compared to the current speed.
deployment / next steps
This PR intentionally only uses the new format for new builds, so i anything goes wrong we'll be able to revert & rebuild.
A next step could be to write an index repack command that would convert old existing indexes so we can remove the old cbor code & benefit from the performance improvements.