Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce new archive index format based on SQLite #2068

Merged
merged 11 commits into from
May 14, 2023

Conversation

syphar
Copy link
Member

@syphar syphar commented Mar 3, 2023

Coming from the performance issues with the archive index I took some time to update the whole topic.

While I started to dig into optimizing the CBOR reader I then thought about the fact that CBOR will never be scalable for bigger crates, so I started looking for alternatives that would include some indexing.

I did some high level tests with LMDB, sled and rocksdb, but settled for a boring, old technology: SQLite.

( yes, I'm definitely biased towards older but perhaps more reliable things).

In my mind, this would give us a reliable, fast, single-file format, which would also be easily extensible if we need to add more information. File-size is slightly higher, but IMO not an issue.

This PR would only use the new format for new builds, while I would plan for a slow (lazy or active) migration of the other archive-stored releases.

micro-benchmarks:

using the archive index for stm32ral (~1.2 million files)


with open               time:   [59.434 µs 59.604 µs 59.810 µs]
                        change: [-1.5928% -0.5698% +0.3278%] (p = 0.27 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

without open            time:   [7.0644 µs 7.1273 µs 7.2172 µs]
                        change: [+2.2772% +3.9894% +5.9668%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe
  • "with open" -> including creating a new SQLite connection opening the file
  • "without open" -> reusing an existing connection.
  • for comparison: reading an old cbor-index of that size takes ~20 seconds

That speed doesn't fully scale with parallelism, but IMO should be good enough, especially compared to the current speed.

deployment / next steps

This PR intentionally only uses the new format for new builds, so i anything goes wrong we'll be able to revert & rebuild.

A next step could be to write an index repack command that would convert old existing indexes so we can remove the old cbor code & benefit from the performance improvements.

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Mar 3, 2023
@syphar
Copy link
Member Author

syphar commented Mar 3, 2023

I would love some feedback on this improvement,

cc @rust-lang/docs-rs

@Nemo157
Copy link
Member

Nemo157 commented Mar 3, 2023

The other interesting metric would be size, both before and after compression, I assume it'll be a little larger to store the index.

Another option discussed ages ago was fst which essentially is just the index part without any data storage (it provides an API sort of like Map<Path, u64>, but you can use it via mmap without deserialization). We could combine that with something like a series of individually CBOR encoded FileInfos and have fst to lookup the offset on where the FileInfo starts.

@syphar
Copy link
Member Author

syphar commented Mar 3, 2023

I'll answer in more detail later, one thought already was that sqlite is much easier to debug since you have db and cli tools that can easily edit / change the database.

@jyn514
Copy link
Member

jyn514 commented Mar 3, 2023

I like this idea a lot.

@syphar syphar marked this pull request as ready for review March 7, 2023 16:22
@syphar
Copy link
Member Author

syphar commented Mar 7, 2023

@Nemo157

The other interesting metric would be size, both before and after compression, I assume it'll be a little larger to store the index.

Some sizes:

# old format
.rw-r--r--@ 134M syphar 24 Oct  2021  archive_index_1_million.zip.index
.rw-r--r--@  29M syphar 24 Oct  2021  archive_index_1_million.zip.index.zst
# new format 
.rw-r--r--  291M syphar  3 Mar 07:40  output.db
.rw-r--r--   62M syphar  7 Mar 17:23  output.db.zstd

Which in my mind is OK, when I see the performance advantage + the ease of maintaining / fixing the indexes.

Another option discussed ages ago was fst which essentially is just the index part without any data storage (it provides an API sort of like Map<Path, u64>, but you can use it via mmap without deserialization). We could combine that with something like a series of individually CBOR encoded FileInfos and have fst to lookup the offset on where the FileInfo starts.

In my mind the "debugability" of SQLite is just so much easier, since we can just download the DB and find command line tools everywhere. On top of the ability to just add new tables / columns if we need to.

@syphar
Copy link
Member Author

syphar commented Mar 7, 2023

what do you believe is missing? I intentionally would only do the new format for new releases, so we can rollback and rebuild.

Next step if that works would be adding some migration for old releases, so we are safer against the crawler / high load.

@jsha
Copy link
Contributor

jsha commented Mar 7, 2023

Overall I'm in favor! 👍🏻

I had to rehydrate some mental state on this, so writing it down for others who might find it useful:

The archive index was first implemented as part of #1342. My first question was "doesn't the zip central directory solve this?" And in fact that's discussed in that PR:

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

The ZIP directory is just the end of the file. It contains all the filenames, and also the positions to fetch. I had a working prototype that had a kind of virtual buffer which prefetched the end of the archive and then the range for the file. After that worked, it seemed more complex than just using our own index, and decompressing the stream ourselves.

Also, doing a little more research, one problem with the zip central directory is that it occurs at the end of the file and its beginning position is not known - you have to seek backwards from the end of the file to find it.

Also, if the problem we're dealing with here is that the index itself is so large that decompressing / parsing the whole index is too expensive, then the zip central directory has the same problem - we would have to parse the whole thing to find the file we want. It would be possible to generate a zip central directory in sorted order and binary search across it, but because directory entries are not of constant size this would get very messy very quickly.

Similarly, we could make the CBOR index sorted and do binary search, but it presumably has the same problem - variable length entries to to path name. Though we could maybe solve that by storing filename hashes instead of filenames? Still, I see why it makes sense to switch to SQLite, which has indexing of data by strings as a first class feature, rather than reinventing indexes files for ourselves from scratch. 👍🏻 Another longstanding and simple technology that fills the same niche is SSTables, which has a Rust implementation: https://crates.io/crates/sstable.

For thoroughness, another approach we could take is to turn the problem inside-out. Right now we generate zip files as our first-class storage, and offer them for direct download (cheap for us) and serve webpages from them (expensive for us, particularly for large crates).

But fetching webpages is a frequent thing, and the world expects it to be cheap (thus, crawlers). Downloading zip files is a less frequent thing, and while it's good for it to be cheap, it's not so bad if it's a little expensive.

So what if we changed our archive storage format directly to something like SQLite or SSTable that has good performance and straightforward implementation when fetching up single files? Then when serving a request for a zip file we would convert the SQLite or SSTable to zip on the file, possibly with a caching layer.

@jsha
Copy link
Contributor

jsha commented Mar 7, 2023

Also, what's the distribution of the next-biggest crates after stm32ral? Is it reasonable to use /robots.txt to tell people not to bother crawling stm32ral and some other handful of mega-crates? It seems like we could spend a lot of engineering trying to make it fast to serve single pages from mega-crates in various ways.

@syphar
Copy link
Member Author

syphar commented Mar 8, 2023

Also, what's the distribution of the next-biggest crates after stm32ral? Is it reasonable to use /robots.txt to tell people not to bother crawling stm32ral and some other handful of mega-crates? It seems like we could spend a lot of engineering trying to make it fast to serve single pages from mega-crates in various ways.

I once did this analysis based on a (~2021) inventory of our S3 bucket, when trying to debug bigger index performance :) it came down to ~200 crates with more than an 100k files in their docs, but goes down to only ~10 crates with more than 500k files.

For thoroughness, another approach we could take is to turn the problem inside-out. Right now we generate zip files as our first-class storage, and offer them for direct download (cheap for us) and serve webpages from them (expensive for us, particularly for large crates).

But fetching webpages is a frequent thing, and the world expects it to be cheap (thus, crawlers). Downloading zip files is a less frequent thing, and while it's good for it to be cheap, it's not so bad if it's a little expensive.

So what if we changed our archive storage format directly to something like SQLite or SSTable that has good performance and straightforward implementation when fetching up single files? Then when serving a request for a zip file we would convert the SQLite or SSTable to zip on the file, possibly with a caching layer.

This is definitely a valid idea.

The main advantage of using the archive as we have it is that we can to S3 range requests to fetch single files out of the archive. I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request, especially when we account for the new ECS setup where we will scale up / down all the time.
The bigger doc-archives can reach multiple gigabytes.

In all the design discussions about archive-storage , downloadable docs where actually only second priority, more a nice-to-have.

Overall I'm in favor! 👍🏻

thank you for taking the time for the thorough comment & research!

@jsha
Copy link
Contributor

jsha commented Mar 8, 2023

I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request,

SSTables have an index, which should make it possible to fetch just the index and use range requests to get a specific file. But looking again at the sstable crate, it was last released over a year ago, so maybe not a great choice.

It's a bit of a bummer that the current approach requires reading the whole SQLite index before making other fetches, since 291MB is still a lot of data to process. Speaking of which, your benchmark shows that fetching an entry from the 291MB index (including decompression?) takes about 60 _micro_seconds? That seems impossible.

@syphar
Copy link
Member Author

syphar commented Mar 8, 2023

I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request,

SSTables have an index, which should make it possible to fetch just the index and use range requests to get a specific file. But looking again at the sstable crate, it was last released over a year ago, so maybe not a great choice.

I didn't know about the SSTable index, TIL :)
But I agree on our assessment, probably not the best choice.

It's a bit of a bummer that the current approach requires reading the whole SQLite index before making other fetches, since 291MB is still a lot of data to process. Speaking of which, your benchmark shows that fetching an entry from the 291MB index (including decompression?) takes about 60 _micro_seconds? That seems impossible.

About decompression: same as the current CBOR index, we only compress it on S3. When we don't find the index locally for a requested crate, we download it, and store it decompressed in a local directory.

So for rustdoc request we only have to:

  • open the SQLite file
  • use the BTree index on the files table to find the file.

Which is in the two benchmarks, once with the SQLite file already opened, once with an existing open connection.

Since I didn't work much with criterion until now I did a sanity check, and in a short command line program, 1 million file-fetches on the same 290MB DB take 8.2 seconds.

I assume that SSTables would be even faster, or other formats, but IMO this is not the focus.

It should be fast enough, it should be scalable for bigger crates (that's definitely a learning), and IMO easy to debug.

As a side-note:
Currently we directly put the index onto the disk, after the build, so we never have to download the index from S3.
With the new ECS setup this will change, and I assume every time we reboot the servers / scale etc an archive-index potentially has to be downloaded, once per webserver.

@jsha
Copy link
Contributor

jsha commented Mar 8, 2023

Ah, thanks for explaining. So the 60 microseconds is to open a file whose contents are already uncompressed and in the buffer cache, and do a handful of random reads? That seems more reasonable.

And interesting about the archive indexes being cached on local disk. It sounds like we never clear them up? Do we just have a massive disk, or are the archive indexes that small?

One other thing: what about storing hashes of filenames instead of full filenames? Might make the index size smaller.

@syphar
Copy link
Member Author

syphar commented Mar 8, 2023

And interesting about the archive indexes being cached on local disk. It sounds like we never clear them up? Do we just have a massive disk, or are the archive indexes that small?

Most are relatively small, right now the whole folder has ~31 GB, for all releases since we introduced archive-storage.
( everything before that is still single HTML files, pending a migration or rebuild).

One other thing: what about storing hashes of filenames instead of full filenames? Might make the index size smaller.

It makes the local index a little smaller, but the compressed ( = remotely stored) index bigger.
Since the size is not that different, IMO debugability wins, but happy to discuss:

.rw-r--r--  281M syphar         1 day output.db
.rw-r--r--   62M syphar         1 day output.db.zstd
.rw-r--r--  168M syphar    37 seconds output_md5.db
.rw-r--r--   94M syphar     2 seconds output_md5.db.zstd

@jsha
Copy link
Contributor

jsha commented Mar 8, 2023

Ah, interesting. I guess compressibility makes those repetitive filenames more efficient to represent than the unrepetitive md5sums. The unique part of each filename is generally less than 16 bytes. I agree with your conclusion, then.

Just to confirm, this is storing the md5 as a blob of bytes, not a hex string, right?

@syphar
Copy link
Member Author

syphar commented Mar 8, 2023

Ah, interesting. I guess compressibility makes those repetitive filenames more efficient to represent than the unrepetitive md5sums. The unique part of each filename is generally less than 16 bytes. I agree with your conclusion, then.

Just to confirm, this is storing the md5 as a blob of bytes, not a hex string, right?

fun fact, no, it was a string, seemlingly evening-tiredness ;) Thanks!

.rw-r--r--  281M syphar         1 day  output.db
.rw-r--r--   62M syphar         1 day  output.db.zstd
.rw-r--r--  168M syphar    43 minutes  output_md5.db
.rw-r--r--   94M syphar    42 minutes  output_md5.db.zstd
.rw-r--r--  107M syphar    41 seconds  output_md5_bytes.db
.rw-r--r--   82M syphar     2 seconds  output_md5_bytes.db.zstd

@syphar
Copy link
Member Author

syphar commented Mar 8, 2023

read-speed is quite similar with MD5 or without it.

This is an example with a more regular-sized crate (17k files):

.rw-r--r--  3.8M syphar    15 seconds  small.db
.rw-r--r--  726k syphar      1 second  small.db.zst
.rw-r--r--  1.5M syphar      1 minute  small_md5_bytes.db
.rw-r--r--  830k syphar    45 seconds  small_md5_bytes.db.zst

Which feels like using md5 with bytes would be a micro-optimization here, costing the ease of working with the data directly.

@jsha
Copy link
Contributor

jsha commented Mar 9, 2023

Which feels like using md5 with bytes would be a micro-optimization here, costing the ease of working with the data directly.

Yep, I concur.

@syphar syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Mar 11, 2023
@syphar syphar changed the title introduce new archive index format based on SQLite (RFC) introduce new archive index format based on SQLite Mar 11, 2023
@syphar
Copy link
Member Author

syphar commented Mar 11, 2023

coming from the feedback I'll

  • fix the conflicts
  • write tests for the sqlite connection pool
  • write more other tests

and then ping someone for the final review :)

@syphar syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Apr 16, 2023
@syphar syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels May 3, 2023
@syphar
Copy link
Member Author

syphar commented May 3, 2023

I think we can proceed here , does anyone have time for a next review? @jyn514 @Nemo157 @jsha ?

I'm happy to add anything missing.

src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Outdated Show resolved Hide resolved
src/storage/archive_index.rs Show resolved Hide resolved
@syphar
Copy link
Member Author

syphar commented May 12, 2023

thank you for the review & the input @Nemo157 !

@syphar syphar requested a review from Nemo157 May 12, 2023 18:12
Copy link
Member

@Nemo157 Nemo157 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested a build on my server, along with loading docs from old archives, and grabbed the index from minio to inspect with sqlite manually, seems to all Just Work 👍.

@syphar
Copy link
Member Author

syphar commented May 14, 2023

thank you for the review! ❤️

@syphar syphar merged commit a502896 into rust-lang:master May 14, 2023
@syphar syphar deleted the faster-archive-index branch May 14, 2023 08:17
@github-actions github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels May 14, 2023
@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants