Compress documentation per-crate, not per-file #1004

jyn514 · 2020-08-25T19:30:03Z

Currently, docs.rs stores each generated HTML file individually on S3. This has the advantage that downloading a single page is fast and efficient, but means that it's very expensive to download all files for a crate (cc #174), particularly because some crates can have many thousands of generated files. It also makes uploads more expensive, since S3 charges per file stored.

Docs.rs should instead store a single archive per crate and compress the entire archive. That would decrease storage costs, upload times, and allow retrieving the entire crates' documentation efficiently. It would have the downside that for crates with many gigabytes of documentation, loading a single page would take much longer - perhaps some crates could be exempted from archives if they're over a certain size?

This would also make it more feasible to implement #464, since the upload costs would be greatly decreased.

jyn514 · 2020-08-25T19:34:14Z

Another idea @pietroalbini had was to have split archives for very large crates: have one file storing an 'index' of the byte offset in the archive for each file. That would allow doing range-requests for that specific file, without having to download the whole archive.

This would require compressing each individual file and not compressing the archive, but should make it scalable even to crates with many gigabytes of documentation. For small crates (say, < 3 MB), we could still have the index as part of the archive itself.

pietroalbini · 2020-08-25T19:35:00Z

I'm not sure it was my idea (I remember reading it on Discord a while ago) :)

Nemo157 · 2020-10-31T12:00:46Z

Recent stats from the #1019 metrics:

The drops are from when the service restarted during a deploy yesterday. I'm not sure what caused the spike, seems likely to be some kind of crawler active for an hour.

One interesting stat we can draw from this: non-default platforms are used, but relatively rarely. Over the last hour before the screenshots there were 6130 different versions of crates accessed, and 7340 different platforms of those versions, so ~1.19 platforms per version (compared with the 5 platforms per version that are built). That does imply that we definitely want to compress documentation per-platform, since the majority of alternative platforms are unlikely to be loaded and we don't want to waste space caching their indexes locally (maybe also relevant for #343 @jyn514).

The main thing we can draw from these stats: if we can hit a 10k item MRU cache then we get an ~1 hour eviction period, for 5k items ~30 minute eviction.

I've started experimenting with a library + CLI to handle the archive and indexing at https://github.com/Nemo157/oubliette.

Nemo157 · 2020-11-05T09:08:00Z

From discussion on discord:

It's likely worth deduping all platforms into the same archive file
But that would make it more likely to hit the threshold for caching the full archive locally
We could build multiple indexes pointing into the same archive file to avoid bloating the index size
We could decide per-crate whether to split across multiple archives based on how close the default target is to the caching threshold
Or we could always put the default target into a separate archive, and archive the other platforms together
or a more complex scheme where it isn't deduped against but other targets can point to its files, and allow truncating the archive after one target (this is hairy but is the kind of thing you can do if its a totally custom format)

you could imagine the archive format as basically a bunch of concatenated "target archives" one after the other, where a target archive is only allowed to reference files either in its own archive or one before. then if you put the default first, you can produce a valid archive by truncating it on the end

syphar · 2021-02-01T17:18:13Z

I was intrigued by the topic and digging into this. I'm not sure about the goals of topic.

After talking to @jyn514 and @pietroalbini, I think

primary goals

maintenance burden of many files on S3 (coming from @pietroalbini )
downloadable docs for offline readers ( coming from Downloadable docs #174 )
constraints: not needing more storage space, roughly the current speed (coming from @jyn514 )

secondary goals

better response times for rustdoc pages (caching a full archive for the version/target and serving from the local archive)
needing less storage space on S3

Coming from these, I'm not sure why we chose these approaches in the previous comments. IMHO inventing a custom archive format or deduping only serves the purpose of even smaller size, while we then need to recompress to offer downloadable docs (or let users use some custom format).

Wouldn't a simple approach be better to start with?

a simple archive format which supports range-requests (ZIP for example)
just an archive per crate/version/target
storing an index (file name + offset) next to it (can be regenerated from the ZIP whenever we want)
caching the indexes locally (then we even could answer exist-queries directly from the index)
downloadable docs can directly use the ZIP.

Wouldn't that give us all the primary goals? (I understand that it would not use much less space, since ZIP would only compress by-file, to give us the advantage of downloading single files out of the archive).
When optimising a little we could even not have the index on S3 and just the last bytes of the ZIP to get its directory.

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

What am I missing?

Nemo157 · 2021-02-01T22:27:09Z

downloadable docs for offline readers ( coming from #174 )

This was not a primary goal AIUI, just something that this could potentially make possible. I would personally rate improving response times higher on the list than it.

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

Using zstd with a custom dictionary and per-file compression gives a large space savings, something around 1/4 or 1/5th the total compressed size—benchmark results here. That doesn't matter so much in terms of S3 usage, but might help a little if data transfer rates from S3 are slowing us down (I assume it's all lookup overhead and the actual data transfer is miniscule). The place it really helps is if it can reduce some archives small enough that we can trivially cache them locally on the web server and avoid the remote lookup at all. According to grafana that S3 lookup is currently about 82 of the 105ms on average to render a rustdoc page at the 95th percentile, with a locally cached archive I would expect that to be sub-ms reducing the total down to like 23ms.

jyn514 · 2021-02-01T22:42:40Z

maintenance burden of many files on S3

In particular, if we stored files in a single archive, it would be feasible to re-upload docs for old crates (#464). Right now that costs several thousand dollars.

syphar · 2021-02-02T07:48:22Z

downloadable docs for offline readers ( coming from #174 )

This was not a primary goal AIUI, just something that this could potentially make possible. I would personally rate improving response times higher on the list than it.

At the least I was right in thinking that I hear conflicting goals on this topic 😄. Or that what I heard from @jyn514 and @pietroalbini didn't match with the discussions here on this issue. (If I misunderstood, please correct me)

Even using zstd and a custom archive format wouldn't benefit much from the overlap between all the HTML files arcross versions / targets because to be able to access single files without decompressing the whole archive, we always have to compress the files on their own.

Using zstd with a custom dictionary and per-file compression gives a large space savings, something around 1/4 or 1/5th the total compressed size—benchmark results here. That doesn't matter so much in terms of S3 usage, but might help a little if data transfer rates from S3 are slowing us down (I assume it's all lookup overhead and the actual data transfer is miniscule). The place it really helps is if it can reduce some archives small enough that we can trivially cache them locally on the web server and avoid the remote lookup at all. According to grafana that S3 lookup is currently about 82 of the 105ms on average to render a rustdoc page at the 95th percentile, with a locally cached archive I would expect that to be sub-ms reducing the total down to like 23ms.

When thinking world-wide by far the biggest lever on site speed is IMHO not the S3 download, but using the CDN.
For a normal rustdoc page, most of the time is not spent on the server, but on the roundtrip to the US (100-150ms), and content-download (100-600ms). Add another roundtrip for every redirect that users hit, depending on where they come from.

I've done multiple setups with Fastly (there is a special open source program, which CloudFront doesn't have), which (without the OS program) were all cheaper and faster than CloudFront.

We could have worldwide, stable, response times <30ms for ~~all~~ most pages, with <1s time between a release and the page being updated, with perhaps a day of work (mostly around returning correct caching headers and purging the right parts automatically), while likely saving money. Fastly can also serve stale content and fetch the new page in the background, still the new content is live after 1-2 seconds.

Even when optimising the hell out of server-side response times, for the most part of the world it would only make a difference of 10-20% of the response time, while building the server-local caching .

so to sum it up
IMHO in making speed improvements a secondary goal for this issue would reduce effort and simplify risk (in using a standard archive format), while being able to support #464 , #174 , and reducing the maintenance burden)

jyn514 · 2021-02-02T17:30:49Z

+1 for focusing on shrinking the number of files rather than improving response times. I think it would be nice to improve response times but not the primary focus. I don't think the difference in size between zstd and DEFLATE is worth giving up range-requests. We haven't had issues with storage size in quite a while, it's been between 3 and 4 TB since about a year ago which seems reasonable (it dropped off quite a bit after we started compressing files).

syphar · 2021-02-02T18:04:41Z

If I remember the code by @Nemo157 correctly, it also would have allowed range requests, since the files were compressed separately in the archive. It was using zstd file-by-file and was concatenating the compressed streams into a single archive.

With the focus on file-numbers we could start with compressing after build and range-requests. While of course having the option to add webserver-local archive-caching later.

jsha · 2021-06-03T00:32:18Z

Out of curiosity, what are the crates with gigabytes of documentation?

jyn514 · 2021-06-03T01:29:50Z

@jsha most of the stm32* crates are enormous

beautifulentropy · 2021-06-04T04:57:28Z

You could also setup a CloudFront distribution for the S3 bucket and serve docs from cached edge locations. Just keep in mind that you'll be increasing complexity on the AWS side of thing. When an object in the S3 bucket is replaced the cache in the CloudFront distribution will persist until the path they reside in is invalidated. Each path invalidation call can take from 60 - 300 seconds, first 1000 are free, $0.005 USD per after that. A lambda job can "push" these path invalidations when an object is updated (nice writeup on this approach here: https://kupczynski.info/2019/01/09/invalidate-cloudfront-with-lambda-s3.html).

syphar · 2021-06-04T05:15:54Z

You could also setup a CloudFront distribution for the S3 bucket and serve docs from cached edge locations.

Currently docs.rs is rewriting the rustdoc HTML code for example to add the footer and the header.
Since this is happening in the web and not the build-process, directly serving from CloudFront to S3 won't work for us.

jsha · 2021-06-08T20:55:26Z

Doing some back-of-the-envelope calculations, according to https://aws.amazon.com/s3/pricing/, uploading costs $0.005 per 1k objects uploaded, and storage is $0.023 per GB/month. So if the average crate is 200MB and 1000 files, that's $0.0046 in monthly storage costs, and $0.005 in per-upload costs. So uploading all crates every 6 weeks would (assuming these numbers, which I don't have a good basis for) approximately double costs.

@syphar is this still something you're interested in working on? I'd love to get #464 unblocked. We've been making a bunch of UI changes to rustdoc output and I'm worried folks will be confused seeing a variety of subtly different interfaces on docs.rs.

jyn514 · 2021-06-08T21:12:14Z

@jsha see #1342

syphar · 2021-06-09T06:51:19Z

Yes, I'm still working on it, though progress wasn't as fast as I planned it to be in the last weeks.

I have a working prototype which still needs some work, but I would say I'm half there.

jyn514 added A-builds Area: Building the documentation for a crate C-enhancement Category: This is a new feature labels Aug 25, 2020

jyn514 mentioned this issue Aug 25, 2020

Feature request: rebuild documentation for old crates #464

Open

Nemo157 mentioned this issue Aug 30, 2020

Add metrics tracking number of recently loaded crates/versions/platforms #1019

Merged

jyn514 mentioned this issue Oct 22, 2020

Downloadable docs #174

Closed

syphar mentioned this issue Dec 26, 2020

Request: Rust crates docs from Docs.rs Kapeli/Dash-User-Contributions#2364

Closed

syphar mentioned this issue Jan 23, 2021

(WIP) rustdoc redirect caching in the CDN #1262

Closed

jyn514 mentioned this issue Aug 15, 2021

compressed storage for rustdoc- and source-files #1342

Merged

7 tasks

jyn514 closed this as completed in #1342 Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compress documentation per-crate, not per-file #1004

Compress documentation per-crate, not per-file #1004

jyn514 commented Aug 25, 2020 •

edited

Loading

jyn514 commented Aug 25, 2020

pietroalbini commented Aug 25, 2020

Nemo157 commented Oct 31, 2020

Nemo157 commented Nov 5, 2020 •

edited

Loading

syphar commented Feb 1, 2021

Nemo157 commented Feb 1, 2021

jyn514 commented Feb 1, 2021

syphar commented Feb 2, 2021 •

edited

Loading

jyn514 commented Feb 2, 2021

syphar commented Feb 2, 2021

jsha commented Jun 3, 2021

jyn514 commented Jun 3, 2021

beautifulentropy commented Jun 4, 2021 •

edited

Loading

syphar commented Jun 4, 2021

jsha commented Jun 8, 2021

jyn514 commented Jun 8, 2021

syphar commented Jun 9, 2021

Compress documentation per-crate, not per-file #1004

Compress documentation per-crate, not per-file #1004

Comments

jyn514 commented Aug 25, 2020 • edited Loading

jyn514 commented Aug 25, 2020

pietroalbini commented Aug 25, 2020

Nemo157 commented Oct 31, 2020

Nemo157 commented Nov 5, 2020 • edited Loading

syphar commented Feb 1, 2021

Nemo157 commented Feb 1, 2021

jyn514 commented Feb 1, 2021

syphar commented Feb 2, 2021 • edited Loading

jyn514 commented Feb 2, 2021

syphar commented Feb 2, 2021

jsha commented Jun 3, 2021

jyn514 commented Jun 3, 2021

beautifulentropy commented Jun 4, 2021 • edited Loading

syphar commented Jun 4, 2021

jsha commented Jun 8, 2021

jyn514 commented Jun 8, 2021

syphar commented Jun 9, 2021

jyn514 commented Aug 25, 2020 •

edited

Loading

Nemo157 commented Nov 5, 2020 •

edited

Loading

syphar commented Feb 2, 2021 •

edited

Loading

beautifulentropy commented Jun 4, 2021 •

edited

Loading