introduce new archive index format based on SQLite #2068

syphar · 2023-03-03T13:29:23Z

Coming from the performance issues with the archive index I took some time to update the whole topic.

While I started to dig into optimizing the CBOR reader I then thought about the fact that CBOR will never be scalable for bigger crates, so I started looking for alternatives that would include some indexing.

I did some high level tests with LMDB, sled and rocksdb, but settled for a boring, old technology: SQLite.

( yes, I'm definitely biased towards older but perhaps more reliable things).

In my mind, this would give us a reliable, fast, single-file format, which would also be easily extensible if we need to add more information. File-size is slightly higher, but IMO not an issue.

This PR would only use the new format for new builds, while I would plan for a slow (lazy or active) migration of the other archive-stored releases.

micro-benchmarks:

using the archive index for stm32ral (~1.2 million files)


with open               time:   [59.434 µs 59.604 µs 59.810 µs]
                        change: [-1.5928% -0.5698% +0.3278%] (p = 0.27 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

without open            time:   [7.0644 µs 7.1273 µs 7.2172 µs]
                        change: [+2.2772% +3.9894% +5.9668%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

"with open" -> including creating a new SQLite connection opening the file
"without open" -> reusing an existing connection.
for comparison: reading an old cbor-index of that size takes ~20 seconds

That speed doesn't fully scale with parallelism, but IMO should be good enough, especially compared to the current speed.

deployment / next steps

This PR intentionally only uses the new format for new builds, so i anything goes wrong we'll be able to revert & rebuild.

A next step could be to write an index repack command that would convert old existing indexes so we can remove the old cbor code & benefit from the performance improvements.

syphar · 2023-03-03T13:29:53Z

I would love some feedback on this improvement,

cc @rust-lang/docs-rs

Nemo157 · 2023-03-03T13:48:11Z

The other interesting metric would be size, both before and after compression, I assume it'll be a little larger to store the index.

Another option discussed ages ago was fst which essentially is just the index part without any data storage (it provides an API sort of like Map<Path, u64>, but you can use it via mmap without deserialization). We could combine that with something like a series of individually CBOR encoded FileInfos and have fst to lookup the offset on where the FileInfo starts.

syphar · 2023-03-03T13:52:27Z

I'll answer in more detail later, one thought already was that sqlite is much easier to debug since you have db and cli tools that can easily edit / change the database.

jyn514 · 2023-03-03T13:59:03Z

I like this idea a lot.

syphar · 2023-03-07T16:28:27Z

@Nemo157

The other interesting metric would be size, both before and after compression, I assume it'll be a little larger to store the index.

Some sizes:

# old format
.rw-r--r--@ 134M syphar 24 Oct  2021  archive_index_1_million.zip.index
.rw-r--r--@  29M syphar 24 Oct  2021  archive_index_1_million.zip.index.zst
# new format 
.rw-r--r--  291M syphar  3 Mar 07:40  output.db
.rw-r--r--   62M syphar  7 Mar 17:23  output.db.zstd

Which in my mind is OK, when I see the performance advantage + the ease of maintaining / fixing the indexes.

Another option discussed ages ago was fst which essentially is just the index part without any data storage (it provides an API sort of like Map<Path, u64>, but you can use it via mmap without deserialization). We could combine that with something like a series of individually CBOR encoded FileInfos and have fst to lookup the offset on where the FileInfo starts.

In my mind the "debugability" of SQLite is just so much easier, since we can just download the DB and find command line tools everywhere. On top of the ability to just add new tables / columns if we need to.

syphar · 2023-03-07T16:29:47Z

what do you believe is missing? I intentionally would only do the new format for new releases, so we can rollback and rebuild.

Next step if that works would be adding some migration for old releases, so we are safer against the crawler / high load.

jsha · 2023-03-07T21:39:08Z

Overall I'm in favor! 👍🏻

I had to rehydrate some mental state on this, so writing it down for others who might find it useful:

The archive index was first implemented as part of #1342. My first question was "doesn't the zip central directory solve this?" And in fact that's discussed in that PR:

in the beginning I was actually thinking about using the zip central directory as the index, but that lead to some complexity (while keeping the file hashes). Here in this project the custom index, integrated with the existing compression logic made more sense.

The ZIP directory is just the end of the file. It contains all the filenames, and also the positions to fetch. I had a working prototype that had a kind of virtual buffer which prefetched the end of the archive and then the range for the file. After that worked, it seemed more complex than just using our own index, and decompressing the stream ourselves.

Also, doing a little more research, one problem with the zip central directory is that it occurs at the end of the file and its beginning position is not known - you have to seek backwards from the end of the file to find it.

Also, if the problem we're dealing with here is that the index itself is so large that decompressing / parsing the whole index is too expensive, then the zip central directory has the same problem - we would have to parse the whole thing to find the file we want. It would be possible to generate a zip central directory in sorted order and binary search across it, but because directory entries are not of constant size this would get very messy very quickly.

Similarly, we could make the CBOR index sorted and do binary search, but it presumably has the same problem - variable length entries to to path name. Though we could maybe solve that by storing filename hashes instead of filenames? Still, I see why it makes sense to switch to SQLite, which has indexing of data by strings as a first class feature, rather than reinventing indexes files for ourselves from scratch. 👍🏻 Another longstanding and simple technology that fills the same niche is SSTables, which has a Rust implementation: https://crates.io/crates/sstable.

For thoroughness, another approach we could take is to turn the problem inside-out. Right now we generate zip files as our first-class storage, and offer them for direct download (cheap for us) and serve webpages from them (expensive for us, particularly for large crates).

But fetching webpages is a frequent thing, and the world expects it to be cheap (thus, crawlers). Downloading zip files is a less frequent thing, and while it's good for it to be cheap, it's not so bad if it's a little expensive.

So what if we changed our archive storage format directly to something like SQLite or SSTable that has good performance and straightforward implementation when fetching up single files? Then when serving a request for a zip file we would convert the SQLite or SSTable to zip on the file, possibly with a caching layer.

jsha · 2023-03-07T21:49:39Z

Also, what's the distribution of the next-biggest crates after stm32ral? Is it reasonable to use /robots.txt to tell people not to bother crawling stm32ral and some other handful of mega-crates? It seems like we could spend a lot of engineering trying to make it fast to serve single pages from mega-crates in various ways.

syphar · 2023-03-08T08:17:21Z

Also, what's the distribution of the next-biggest crates after stm32ral? Is it reasonable to use /robots.txt to tell people not to bother crawling stm32ral and some other handful of mega-crates? It seems like we could spend a lot of engineering trying to make it fast to serve single pages from mega-crates in various ways.

I once did this analysis based on a (~2021) inventory of our S3 bucket, when trying to debug bigger index performance :) it came down to ~200 crates with more than an 100k files in their docs, but goes down to only ~10 crates with more than 500k files.

For thoroughness, another approach we could take is to turn the problem inside-out. Right now we generate zip files as our first-class storage, and offer them for direct download (cheap for us) and serve webpages from them (expensive for us, particularly for large crates).

But fetching webpages is a frequent thing, and the world expects it to be cheap (thus, crawlers). Downloading zip files is a less frequent thing, and while it's good for it to be cheap, it's not so bad if it's a little expensive.

So what if we changed our archive storage format directly to something like SQLite or SSTable that has good performance and straightforward implementation when fetching up single files? Then when serving a request for a zip file we would convert the SQLite or SSTable to zip on the file, possibly with a caching layer.

This is definitely a valid idea.

The main advantage of using the archive as we have it is that we can to S3 range requests to fetch single files out of the archive. I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request, especially when we account for the new ECS setup where we will scale up / down all the time.
The bigger doc-archives can reach multiple gigabytes.

In all the design discussions about archive-storage , downloadable docs where actually only second priority, more a nice-to-have.

Overall I'm in favor! 👍🏻

thank you for taking the time for the thorough comment & research!

jsha · 2023-03-08T18:17:07Z

I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request,

SSTables have an index, which should make it possible to fetch just the index and use range requests to get a specific file. But looking again at the sstable crate, it was last released over a year ago, so maybe not a great choice.

It's a bit of a bummer that the current approach requires reading the whole SQLite index before making other fetches, since 291MB is still a lot of data to process. Speaking of which, your benchmark shows that fetching an entry from the 291MB index (including decompression?) takes about 60 _micro_seconds? That seems impossible.

syphar · 2023-03-08T19:01:08Z

I assume with SQLite for the docs, or SSTables, we would have to fetch the whole doc-package on the first request,

SSTables have an index, which should make it possible to fetch just the index and use range requests to get a specific file. But looking again at the sstable crate, it was last released over a year ago, so maybe not a great choice.

I didn't know about the SSTable index, TIL :)
But I agree on our assessment, probably not the best choice.

It's a bit of a bummer that the current approach requires reading the whole SQLite index before making other fetches, since 291MB is still a lot of data to process. Speaking of which, your benchmark shows that fetching an entry from the 291MB index (including decompression?) takes about 60 _micro_seconds? That seems impossible.

About decompression: same as the current CBOR index, we only compress it on S3. When we don't find the index locally for a requested crate, we download it, and store it decompressed in a local directory.

So for rustdoc request we only have to:

open the SQLite file
use the BTree index on the files table to find the file.

Which is in the two benchmarks, once with the SQLite file already opened, once with an existing open connection.

Since I didn't work much with criterion until now I did a sanity check, and in a short command line program, 1 million file-fetches on the same 290MB DB take 8.2 seconds.

I assume that SSTables would be even faster, or other formats, but IMO this is not the focus.

It should be fast enough, it should be scalable for bigger crates (that's definitely a learning), and IMO easy to debug.

As a side-note:
Currently we directly put the index onto the disk, after the build, so we never have to download the index from S3.
With the new ECS setup this will change, and I assume every time we reboot the servers / scale etc an archive-index potentially has to be downloaded, once per webserver.

jsha · 2023-03-08T19:15:55Z

Ah, thanks for explaining. So the 60 microseconds is to open a file whose contents are already uncompressed and in the buffer cache, and do a handful of random reads? That seems more reasonable.

And interesting about the archive indexes being cached on local disk. It sounds like we never clear them up? Do we just have a massive disk, or are the archive indexes that small?

One other thing: what about storing hashes of filenames instead of full filenames? Might make the index size smaller.

syphar · 2023-03-08T19:31:43Z

And interesting about the archive indexes being cached on local disk. It sounds like we never clear them up? Do we just have a massive disk, or are the archive indexes that small?

Most are relatively small, right now the whole folder has ~31 GB, for all releases since we introduced archive-storage.
( everything before that is still single HTML files, pending a migration or rebuild).

One other thing: what about storing hashes of filenames instead of full filenames? Might make the index size smaller.

It makes the local index a little smaller, but the compressed ( = remotely stored) index bigger.
Since the size is not that different, IMO debugability wins, but happy to discuss:

.rw-r--r--  281M syphar         1 day output.db
.rw-r--r--   62M syphar         1 day output.db.zstd
.rw-r--r--  168M syphar    37 seconds output_md5.db
.rw-r--r--   94M syphar     2 seconds output_md5.db.zstd

jsha · 2023-03-08T19:47:39Z

Ah, interesting. I guess compressibility makes those repetitive filenames more efficient to represent than the unrepetitive md5sums. The unique part of each filename is generally less than 16 bytes. I agree with your conclusion, then.

Just to confirm, this is storing the md5 as a blob of bytes, not a hex string, right?

syphar · 2023-03-08T20:14:25Z

Ah, interesting. I guess compressibility makes those repetitive filenames more efficient to represent than the unrepetitive md5sums. The unique part of each filename is generally less than 16 bytes. I agree with your conclusion, then.

Just to confirm, this is storing the md5 as a blob of bytes, not a hex string, right?

fun fact, no, it was a string, seemlingly evening-tiredness ;) Thanks!

.rw-r--r--  281M syphar         1 day  output.db
.rw-r--r--   62M syphar         1 day  output.db.zstd
.rw-r--r--  168M syphar    43 minutes  output_md5.db
.rw-r--r--   94M syphar    42 minutes  output_md5.db.zstd
.rw-r--r--  107M syphar    41 seconds  output_md5_bytes.db
.rw-r--r--   82M syphar     2 seconds  output_md5_bytes.db.zstd

syphar · 2023-03-08T20:25:01Z

read-speed is quite similar with MD5 or without it.

This is an example with a more regular-sized crate (17k files):

.rw-r--r--  3.8M syphar    15 seconds  small.db
.rw-r--r--  726k syphar      1 second  small.db.zst
.rw-r--r--  1.5M syphar      1 minute  small_md5_bytes.db
.rw-r--r--  830k syphar    45 seconds  small_md5_bytes.db.zst

Which feels like using md5 with bytes would be a micro-optimization here, costing the ease of working with the data directly.

jsha · 2023-03-09T02:32:53Z

Which feels like using md5 with bytes would be a micro-optimization here, costing the ease of working with the data directly.

Yep, I concur.

syphar · 2023-03-11T16:38:07Z

coming from the feedback I'll

fix the conflicts
write tests for the sqlite connection pool
write more other tests

and then ping someone for the final review :)

syphar · 2023-05-03T19:09:23Z

I think we can proceed here , does anyone have time for a next review? @jyn514 @Nemo157 @jsha ?

I'm happy to add anything missing.

src/storage/archive_index.rs

syphar · 2023-05-12T18:11:59Z

thank you for the review & the input @Nemo157 !

Nemo157

Tested a build on my server, along with loading docs from old archives, and grabbed the index from minio to inspect with sqlite manually, seems to all Just Work 👍.

syphar · 2023-05-14T08:17:40Z

thank you for the review! ❤️

github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Mar 3, 2023

syphar marked this pull request as ready for review March 7, 2023 16:22

syphar added S-waiting-on-author Status: This PR is incomplete or needs to address review comments and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Mar 11, 2023

syphar changed the title ~~introduce new archive index format based on SQLite (RFC)~~ introduce new archive index format based on SQLite Mar 11, 2023

syphar force-pushed the faster-archive-index branch from 240be38 to 0c44456 Compare March 11, 2023 19:35

syphar force-pushed the faster-archive-index branch from a365acf to 67741c6 Compare April 16, 2023 14:19

introduce new archive index format based on SQLite

e3e7c46

syphar added 7 commits May 3, 2023 19:28

set SQLITE synchronous mode when writing

708c77f

run VACUUM after creating archive index SQLite db

71af45f

migrate to a real connection pool for the sqlite connections

cec41ce

fix compile

ff78fcb

ignore code block as doctest

49bf28a

add more documentation for limits on sqlite connection pool

b694df6

first test for pool

ce868cc

syphar force-pushed the faster-archive-index branch from b2e76e3 to ce868cc Compare May 3, 2023 17:28

syphar added S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed and removed S-waiting-on-author Status: This PR is incomplete or needs to address review comments labels May 3, 2023

Nemo157 reviewed May 12, 2023

View reviewed changes

src/storage/archive_index.rs Outdated Show resolved Hide resolved

src/storage/archive_index.rs Outdated Show resolved Hide resolved

src/storage/archive_index.rs Outdated Show resolved Hide resolved

src/storage/archive_index.rs Show resolved Hide resolved

syphar added 3 commits May 12, 2023 19:40

don't clone string

3b6d40b

replace panic with error when decoding compression from archive index

d57889f

replace ignore with text for to prevent execution as doc-test

82de382

syphar requested a review from Nemo157 May 12, 2023 18:12

Nemo157 approved these changes May 12, 2023

View reviewed changes

syphar merged commit a502896 into rust-lang:master May 14, 2023

syphar deleted the faster-archive-index branch May 14, 2023 08:17

github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels May 14, 2023

syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label May 14, 2023

syphar mentioned this pull request Aug 3, 2023

big crates are slow to use with new archive-storage #1528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce new archive index format based on SQLite #2068

introduce new archive index format based on SQLite #2068

syphar commented Mar 3, 2023 •

edited

Loading

syphar commented Mar 3, 2023

Nemo157 commented Mar 3, 2023 •

edited

Loading

syphar commented Mar 3, 2023

jyn514 commented Mar 3, 2023

syphar commented Mar 7, 2023

syphar commented Mar 7, 2023

jsha commented Mar 7, 2023

jsha commented Mar 7, 2023

syphar commented Mar 8, 2023

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023 •

edited

Loading

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023 •

edited

Loading

syphar commented Mar 8, 2023

jsha commented Mar 9, 2023

syphar commented Mar 11, 2023

syphar commented May 3, 2023

syphar commented May 12, 2023

Nemo157 left a comment

syphar commented May 14, 2023

introduce new archive index format based on SQLite #2068

introduce new archive index format based on SQLite #2068

Conversation

syphar commented Mar 3, 2023 • edited Loading

micro-benchmarks:

deployment / next steps

syphar commented Mar 3, 2023

Nemo157 commented Mar 3, 2023 • edited Loading

syphar commented Mar 3, 2023

jyn514 commented Mar 3, 2023

syphar commented Mar 7, 2023

syphar commented Mar 7, 2023

jsha commented Mar 7, 2023

jsha commented Mar 7, 2023

syphar commented Mar 8, 2023

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023 • edited Loading

jsha commented Mar 8, 2023

syphar commented Mar 8, 2023 • edited Loading

syphar commented Mar 8, 2023

jsha commented Mar 9, 2023

syphar commented Mar 11, 2023

syphar commented May 3, 2023

syphar commented May 12, 2023

Nemo157 left a comment

Choose a reason for hiding this comment

syphar commented May 14, 2023

syphar commented Mar 3, 2023 •

edited

Loading

Nemo157 commented Mar 3, 2023 •

edited

Loading

syphar commented Mar 8, 2023 •

edited

Loading

syphar commented Mar 8, 2023 •

edited

Loading