Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to use docs.rs offline for pages that have been visited at least once #845

Open
jyn514 opened this issue Jun 21, 2020 · 31 comments
Labels
A-frontend Area: Web frontend E-medium Effort: This requires a fair amount of work help-wanted P-low Low priority issues

Comments

@jyn514
Copy link
Member

jyn514 commented Jun 21, 2020

It'd be great to turn docs.rs into an offline-first PWA (Progressive Web App). So the user would still be able to browse the docs they have already visited before even when offline, without having to use a separate website or app.

The same could be done for doc.rust-lang.org.

Originally posted by @teohhanhui in #174 (comment)

@jyn514
Copy link
Member Author

jyn514 commented Jun 21, 2020

The same could be done for doc.rust-lang.org.

You can open an issue on https://github.com/rust-lang/www.rust-lang.org for that site, it's managed by a different team. I imagine they would be very receptive since it's a completely static site.

Another alternative is a browser extension to redirect online version -> offline version, similar to what the IPFS Companion extension does. For example:
https://doc.rust-lang.org/std/sync/struct.RwLock.html -> file:///home/teohhanhui/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/rust/html/std/sync/struct.RwLock.html

Hmm, this is an interesting idea. I don't think it would work with relative links though, .. would send you to doc/rust/html/std/sync/index.html which might not yet exist. Also I'm not sure that this would work correctly if we used trailing slashes on any page instead of /index.html.

That can be achieved with cargo doc to build local crates and rustup doc for the book, std, and everything else on doc.rustlang.org

The whole point of docs.rs is that you don't have to build the docs yourself, so while useful a useful tip I don't think it should replace being able to use docs.rs offline.

I don't know very much about PWAs. If we set pages to be cached for a longer time, would that meet this use case? That way you could visit the cached page even when you lost internet.

@Kixiron
Copy link
Member

Kixiron commented Jun 21, 2020

In regards to relative links, I wouldn't be sad if they went away, as they're not really a great thing in the first place. Replacing relative links would probably help simplify a good portion of code while also being less finicky/more difficult to mess up

@jyn514
Copy link
Member Author

jyn514 commented Jun 21, 2020

In regards to relative links, I wouldn't be sad if they went away, as they're not really a great thing in the first place. Replacing relative links would probably help simplify a good portion of code while also being less finicky/more difficult to mess up

I strongly disagree. Without relative links we'd have to hardcode https://docs.rs at the start of every url, which would break this anyway.

@jyn514
Copy link
Member Author

jyn514 commented Jun 21, 2020

Also, rustdoc heavily uses relative links for documentation, I don't see a good way to change that since it doesn't know the absolute URL it will be used with.

@Kixiron
Copy link
Member

Kixiron commented Jun 21, 2020

Oh, I thought you meant relative links in reference to how we do our own source browser, with .. being "up", but it seems we already use a canonical link for that

@teohhanhui
Copy link

I don't know very much about PWAs. If we set pages to be cached for a longer time, would that meet this use case? That way you could visit the cached page even when you lost internet.

See https://developer.mozilla.org/en-US/docs/Web/Progressive_web_apps/Offline_Service_workers#Offline_First

Changing the cache expiry would help, however that requires the user to manually toggle offline mode in their browser (which is a very hidden thing nowadays, if not impossible altogether...)

@jyn514
Copy link
Member Author

jyn514 commented Jun 22, 2020

Changing the cache expiry would help, however that requires the user to manually toggle offline mode in their browser

That seems to defeat the point of caching :(

Glancing through the page you linked it seems like the main idea is to have some JavaScript that checks if the page is cached before making a network request. I agree that should be the behavior, but I'm not comfortable enough with JavaScript to implement it /don't have the time. If someone is interested in working on this I'd be happy to mentor though :) almost all of the site can be cached except the home page, /releases, and redirects.

@jyn514 jyn514 added A-frontend Area: Web frontend wishlist P-low Low priority issues and removed wishlist labels Jun 22, 2020
@wanderrful
Copy link

wanderrful commented Sep 20, 2020

Angular apps have service workers built into them implicitly, so if you guys are willing to upgrade this from a Python Jinja-like Tera front-end (https://crates.io/crates/tera) to an Angular front-end then you can get the Service Worker caching for free.

Here's some more info: https://angular.io/guide/service-worker-intro

As for the rust-lang website itself, it has a Handlebars front-end (https://github.com/rust-lang/www.rust-lang.org/blob/master/templates/index.hbs), which could also be replaced with an Angular front-end.

However, I think it'd probably be more on-brand for these Rust websites to have a Rust-based front-end that compiles to WebAssembly rather than be Javascript-based. The only such Crate I'm aware of that might do this is Yew, but it doesn't have Service Workers built into it as far as I know. It's not "production-ready", but since these websites are just static pages I don't think that that's a concern.

Angular could potentially be overkill since these sites are just static pages, but just because it has a bunch of bells and whistles doesn't mean you have to use them.

@jyn514
Copy link
Member Author

jyn514 commented Sep 20, 2020

I'd strongly prefer for docs.rs to remain a static site first and foremost, and especially remain usable with JavaScript disabled. I'm fine with JS adding features on top, but the JS shouldn't be necessary just to use the site.

That said I don't know much about frontend, so maybe Angular can do that?

@wanderrful
Copy link

wanderrful commented Sep 20, 2020

Service workers themselves are implemented on the front-end via Javascript, so I'm not sure that we can have our cake and eat it, too, in this situation.

With that design constraint, I'm not sure we can make this website offline-first. All we could do is just ask users to use their browser's "make available offline" feature if they want to use the site while offline.

Edit: Even WebAssembly requires Javascript to be enabled, so I'm not sure that any Rust-based WASM solution would work either.

@jyn514
Copy link
Member Author

jyn514 commented Sep 20, 2020

Let me approach this from a different angle (I really like the framing in https://internals.rust-lang.org/t/pre-rfc-user-namespaces-on-crates-io/12851/96 to discuss things as problems to solve and not solutions to implement).

docs.rs currently is a dynamic site which serves static HTML. It does not have caching for rustdoc pages, which means the site is not available when you're offline. The goal of this issue is to be able to use docs.rs offline if you've already visited the relevant pages at least once.

If I'd never heard of PWAs, the way I'd imagine imagine implementing this is something like the following:

  • When you visit a page for the first time it loads as HTML. The HTML is cached indefinitely and never expires.
  • The HTML has a link to a .js file. Every time you refresh the page, the JS reaches out to the server to ask for a new version; if there's a newer version it replaces the HTML on the page (preferably with lazy-loading so this doesn't block)

What this gets docs.rs is three things:

  1. JS is completely optional. If you have it disabled, then you just have to force-refresh the page once in a while to get newer versions, which I'm ok with (the rustdoc pages are very rarely updated, only when there's a docs.rs bug).
  2. Pages are viewable offline. If you are offline, the HTML is cached and the JS just doesn't run, so you can view the page fine.
  3. If JS is enabled, the page will always be up-to-date.

Regardless of the technologies or frameworks used, does that basic idea sound feasible?

@jyn514 jyn514 changed the title Turn docs.rs into an offline-first PWA (Progressive Web App) Make it possible to use docs.rs offline for pages that have been visited at least once Nov 13, 2021
@GuillaumeGomez
Copy link
Member

Won't this be an issue for pages like doc.rust-lang.org/nightly/std/whatever.html? I don't think we have an equivalent on docs.rs except when arriving on the crate page (but then it makes a redirection to the last version).

@jyn514
Copy link
Member Author

jyn514 commented Nov 22, 2021

@GuillaumeGomez are you saying that this breaks once latest no longer redirects to another page (#1527)? I think we can avoid that by just having a much shorter cache expiration date on those pages.

@GuillaumeGomez
Copy link
Member

Yes it's what I meant.

@jsha
Copy link
Contributor

jsha commented Nov 25, 2021

I think this is probably feasible. Some questions to figure out: should all of docs.rs be one big PWA, which manages a cache of all the various docs you've visited? Or should each crate's doc be a separate PWA? Ideally we'd like the same behavior on doc.rust-lang.org, which means the functionality should be in rustdoc, which advocates towards a PWA per crate.

Also, it looks like Service Workers allow us to actually prefetch resources that the user hasn't visited yet. So for instance if you visit one page of a crate's docs, it could download all the pages of that crate's docs. The storage could add up fast, though, so we'd need heuristics about when or if to do that.

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

I have a local prototype of this that's kinda neat, and plan to work on it some more and will share results when they're good enough. I had high hopes of precaching a whole crate / the whole stdlib, but fetching that many files individually (30,847 for the stdlib) was prohibitively slow. And users probably wouldn't thank us for using that much data without a more explicit opt-in anyhow.

Here's my current thinking:

  • On Service Worker install, preload the static assets (fonts, JS, CSS, images), and always serve those from the Cache API
  • Whenever a user navigates to an HTML page:
    • If it's not in the Cache API, fetch it, store it, and serve it.
    • If it's in the Cache API
      • load from the local copy
      • fire off a background fetch to see if there's an updated version (new crate version, or rebuild with different rustdoc version).
      • if there is an updated version:
        • store it in the Cache API
        • flush all local storage of the outdated version
        • start preloading the latest version of pages that were locally cached before?
        • update the current page to tell the user there's a newer version available, with a button/link that reloads the page.

Note that in this scenario, nothing changes for users without JS; they never load the Service Worker.

Alternately, we could prefer freshness:

  • Whenever a user navigates to an HTML page:
  • Attempt to fetch it from the network
  • If the network is offline, or after a 2-second timeout, serve from cache (if available).

The first approach is quite similar to the Cache-Control stale-while-revalidate directive. As a simpler approach, we could try changing the headers on HTML pages. Right now they have no Cache-Control header. We could add max-age=0, stale-while-revalidate=5260000. I think that would make the page available offline for up to 2 months, and if there is a newer version available it would get fetched in the background and be ready on the user's next page load. I need to do some testing on this - none of the docs for Cache-Control stale-while-revalidate explicitly mention offline.

Advantage for the Cache-Control approach: much easier to deploy and reason about.

Advantages of the Service Worker approach:

  • deeply customizable. We can provide an interface for people to preload whole crates for offline use. When a new version is available we can purge all URLs from the old version. That avoids a potentially frustrating experience under Cache-Control stale-while-revalidate where each page you load shows the outdated version at first. For docs.rs we could even have an origin wide background JS task that looks for newer versions of all crates where you have a local cache of some pages, and proactively purges / refreshes.
  • we can provide a custom offline page for things we don't have in cache, offering information on what we do have cached.
  • it can work regardless of Cache-Control headers. So when someone deploys docs on their own server, they don't need to get all the headers just right - we can still offer offline via ServiceWorkers.

One of the exciting things about both approaches is they have the potentially to dramatically speed up repeat visits even when online.

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

By the way, to be able to readily experiment with this without the possibility of breaking docs.rs, it should be possible to run some totally third party site that has a Service Worker and fetches / serves pages from docs.rs as if those pages were on its own origin. But that would require settings Access-Control-Allow-Origin on all/most docs.rs pages. Is that reasonable to do?

@jyn514
Copy link
Member Author

jyn514 commented Nov 26, 2021

I had high hopes of precaching a whole crate / the whole stdlib, but fetching that many files individually (30,847 for the stdlib) was prohibitively slow.

This should be possible once we finally implement downloadable docs :) that serves the docs as one big zipfile for the whole crate.

By the way, to be able to readily experiment with this without the possibility of breaking docs.rs, it should be possible to run some totally third party site that has a Service Worker and fetches / serves pages from docs.rs as if those pages were on its own origin. But that would require settings Access-Control-Allow-Origin on all/most docs.rs pages. Is that reasonable to do?

I would be worried about doing this on docs.rs in prod, but it shouldn't be terribly difficult to run a fork of docs.rs somewhere and add Access-Control-Allow-Origin there.

Hmm, I guess that doesn't let you test how it interacts with cloudfront though.

@jyn514
Copy link
Member Author

jyn514 commented Nov 26, 2021

Advantage for the Cache-Control approach: much easier to deploy and reason about.

This is very tempting 😆 it sounds like you're volunteering to do much of the work, which I really appreciate ❤️ but simpler to write also means simpler to review.

How hard would it be to switch between the two ideas at a later time? It sounds like a lot of the work is hooking the service worker up to the Cache API and actually changing the page, which is the same between both, right?

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

Switching at any point would be the same work as doing either change from scratch. If we use the Cache-Control: max-age=0, stale-while-revalidate=N approach, it's a one-liner. We don't touch Service Worker or Cache API at all. If we do the Service Worker approach, it's a decent amount of work - and as you say, involves at least one other person learning enough about Service Worker to adequately review. :-)

The thing I worry about with stale-while-revalidate is this:

  • You load /regex/latest/regex on Nov 24. It's serving version 1.0.
  • You load /regex/latest/regex on Nov 26. You know the crate was updated to 2.0 yesterday, renaming a bunch of structs. Because of stale-while-revalidate, your browser shows you version 1.0. That's confusing! Of course, if you reload, you'll get 2.0.
  • If you don't reload (for instance, you don't know about the revision, or don't care, or actively want to look at 1.0 docs), when you click a link to one of the renamed structs, you will get a 404 (because that struct doesn't exist in 2.0).

Of course, now that I write these out I see these are also a problem for the /latest/ change in general. For instance, you could have /latest/ (version 1.0) loaded in your browser when 2.0 is released, and click a link to one of the now-renamed structs.

The problem also exists for versioned URLs. For instance, visit https://docs.rs/rustls/0.19.0/rustls/trait.Session.html and click "Go to latest version" (Session was renamed to Connection in 0.20). I see somebody has already thought of the problem, and that link takes you to a search page across 0.20. That's pretty neat! Maybe that's adequate?

The other problem with stale-while-revalidate is: say you load the root page, see it's outdated, and reload. Then you click to another page you've visited before. That's also outdated. You have to reload that too. It would get frustrating pretty fast.

@jyn514
Copy link
Member Author

jyn514 commented Nov 26, 2021

I see somebody has already thought of the problem, and that link takes you to a search page across 0.20. That's pretty neat! Maybe that's adequate?

Haha, yeah I spent a while on that :)

Of course, now that I write these out I see these are also a problem for the /latest/ change in general. For instance, you could have /latest/ (version 1.0) loaded in your browser when 2.0 is released, and click a link to one of the now-renamed structs.

Hmm, this should only be a problem if you have the page open for a long time, right? Because (with caching as current, but with #1527) the second you reload the reload the page you'll get the newer version. I think the combination of open for a long time + and intervening release + the struct was renamed is low enough that just having search is fine.

You load /regex/latest/regex on Nov 26. You know the crate was updated to 2.0 yesterday, renaming a bunch of structs. Because of stale-while-revalidate, your browser shows you version 1.0. That's confusing! Of course, if you reload, you'll get 2.0.

Yeah, that seems confusing. I'm not sure that "if you reload you'll get 2.0" is true though - don't you need to do a hard refresh to ignore the cache directive? I don't think we should do that for the /latest/ page. It seems ok for pages other than /latest/ though, they should only change if a bug in rustdoc itself was fixed and the crate was rebuilt.

@jyn514
Copy link
Member Author

jyn514 commented Nov 26, 2021

That said, I'm fairly familiar with service workers from working at Cloudflare so if that sounds fun I say go for it 😁

@jsha
Copy link
Contributor

jsha commented Nov 26, 2021

I'm not sure that "if you reload you'll get 2.0" is true though - don't you need to do a hard refresh to ignore the cache directive?

With max-age=0, stale-while-revalidate, I think it's true. The first load will serve from cache. During the ~dozen seconds you spend looking at the page, the browser will refresh the cache from origin, so by the time you reload there should be a fresh copy in cache.

it shouldn't be terribly difficult to run a fork of docs.rs somewhere and add Access-Control-Allow-Origin there.

Wouldn't it require a lot of CPU and storage to store all the crates? I'm thinking of something that would exist for a period of months, where we'd invite testers to try using it as their daily driver version of docs.rs, to see what weird cases would come out of real-life browsing patterns.

@jyn514
Copy link
Member Author

jyn514 commented Nov 26, 2021

During the ~dozen seconds you spend looking at the page, the browser will refresh the cache from origin, so by the time you reload there should be a fresh copy in cache.

Ahh, that makes sense, I didn't realize that's what the directive did.

Wouldn't it require a lot of CPU and storage to store all the crates? I'm thinking of something that would exist for a period of months, where we'd invite testers to try using it as their daily driver version of docs.rs, to see what weird cases would come out of real-life browsing patterns.

I don't see a realistic way to do this. Either we experiment with it in prod (maybe with a feature flag?) or we can write more tests; it's just not feasible to replicate docs.rs at scale.

@syphar
Copy link
Member

syphar commented Nov 27, 2021

I admit I never worked with this kind of frontend caching, but I'm excited to see it if it works.

Since caching is hard this feels like that there might be edge-cases with confusing mixtures of cached and uncached pages (and assets), so IMHO having a (even user-visible) feature flag / testing phase would be a great idea.

Or building a second setup. I mean, having a staging platform is not a terrible idea :)

@jyn514
Copy link
Member Author

jyn514 commented Nov 27, 2021

Yes, I definitely want to set up a staging server at some point where people can try things out interactively. I just want to set reasonable expectations for it; it's going to end up like staging.crates.io where maybe 5 people a week visit, it won't let us see problems that only appear at scale.

@jsha
Copy link
Contributor

jsha commented Nov 28, 2021

I just tested stale-while-revalidate, and it does make the page nicely available when the network is offline, at least in Chrome.

Proposal: Let's add Cache-Control: max-age=0, stale-while-revalidate=N for all versioned URLs, but not yet for /latest/ URLs (#1527) since things are a little trickier there. I propose N = 2 months to start.

@jyn514
Copy link
Member Author

jyn514 commented Nov 28, 2021

Sounds like a plan! :)

@jsha
Copy link
Contributor

jsha commented Dec 1, 2021

A little hiccup: Iron doesn't seem to support stale-while-revalidate, and doesn't allow setting custom strings for the cache-control header: https://docs.rs/iron/0.6.1/iron/headers/enum.CacheDirective.html

@jyn514
Copy link
Member Author

jyn514 commented Dec 1, 2021

@jsha does Extension not support custom headers?

Anyway, iron hasn't had a publish in 3 years, I wouldn't get your hopes too high. @syphar has been working on and off on switching to Axum.

@jyn514 jyn514 added the S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. label Dec 1, 2021
@syphar syphar added E-medium Effort: This requires a fair amount of work and removed S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. labels Oct 24, 2023
@syphar
Copy link
Member

syphar commented Oct 24, 2023

note that the axum migration is done for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: Web frontend E-medium Effort: This requires a fair amount of work help-wanted P-low Low priority issues
Projects
None yet
Development

No branches or pull requests

7 participants