Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidelines for when to (dis)allow 404 mdn_url values #11291

Closed
ddbeck opened this issue Jun 28, 2021 · 14 comments
Closed

Add guidelines for when to (dis)allow 404 mdn_url values #11291

ddbeck opened this issue Jun 28, 2021 · 14 comments
Labels
docs ✍️ Issues or pull requests regarding the documentation of this project.

Comments

@ddbeck
Copy link
Collaborator

ddbeck commented Jun 28, 2021

Summary

We ought to have a guideline that will help PR reviewers know when to expect mdn_url to point to a URL that doesn't actually exist, and when they should reject them. This would apply to new data and might provide some pointers toward cleaning up existing data, or automating changes to it.

Background

From @sideshowbarker on #5297 (comment)

But I think what’s actually happened instead is, it’s just been arbitrary — that is, the difference between which non-existent URLs got added and which didn’t comes down to who happened to have who written and reviewed the BCD patches that added them, and their individual style/preferences just being different from that of the people whose patches didn’t add any.

Questions that guidelines might help answer

  • Should we ever permit new 404s?
  • What features should not have an mdn_url? What features should never have an mdn_url?
  • If we do allow 404s, what features must have an mdn_url?
  • If we don't allow 404s, what should happen when a feature ought to have an mdn_url, but can't because it's missing? What should the amelioration process be?

I don't have answers to these questions, but I feel like any answers would be fine as an interim practice, while we work through this problem.

Related issues

The PR #5297 has some extended discussion about automation and eliminating/fixing 404s with tools support.

@ddbeck ddbeck added the docs ✍️ Issues or pull requests regarding the documentation of this project. label Jun 28, 2021
@teoli2003
Copy link
Contributor

teoli2003 commented Jun 28, 2021

When creating a new MDN reference page, it is convenient to be able to create the bcd PR (with mdn_url) at the same time as the content PR, that is before the PR on mdn/content lands. The mdn_url will then be a 404 until this happens.

Note that if both repos become strict – that is if mdn/bcd strictly prevents 404s in mdn_url and if mdn/content strictly prevents non-existent browser-compat keys – authors will need to create 3 PRs to get everything set up (1 PR without mdn_url, then 1 content PR, then the mdn_url PR). I think we all want to prevent this burden.

As I don't see any other concrete use of a 404 mdn_url, I think we should forbid them, unless there is mdn/content PR in review adding the missing page, that is unless it is a temporary technical 404. (And if the content PR doesn't get merged in the end, there will be the need to remove the mdn_url in mdn/bcd).

@ddbeck
Copy link
Collaborator Author

ddbeck commented Jun 30, 2021

Note that if both repos become strict – that is if mdn/bcd strictly prevents 404s in mdn_url and if mdn/content strictly prevents non-existent browser-compat keys – authors will need to create 3 PRs to get everything set up (1 PR without mdn_url, then 1 content PR, then the mdn_url PR). I think we all want to prevent this burden.

Yes, this makes sense. It's also consistent with how we've been handling PRs which require content changes on MDN so far (which is, the PR being open on mdn/content is enough to satisfy the requirements to merge on BCD). 👍

@foolip
Copy link
Collaborator

foolip commented Jun 30, 2021

cc @vinyldarkscratch

@foolip
Copy link
Collaborator

foolip commented Jun 30, 2021

Quoting a chat between me and Vinyl from yesterday:

Vinyl:

(Doing so also gave me the idea to check all of the MDN URLs for API entries, and if anything's a 404, either open an issue to get the content written or remove the BCD entry!)

Philip:

You should run that past sideshowbarker, he’s complained about 404 URLs and probably has a draft PR removing them all
Unfortunately it’s hard to lint for because it should be possible to have PRs for both repos that land around the same time.
That could be solved I think, but that’s why there isn’t a lint for it already

Vinyl:

I've got a PR for a linter that checks for broken links throughout BCD (#5228), but it's admittedly super slow (I'm guessing it's due to a lack of async linting). I'm letting that run in the background while browsing through GitHub issues/PRs to see what comes up. I do remember sideshowbarker working on this issue as well, I'll reach out to him and see if we can coordinate efforts there!

Philip:

I think a lint for this should use an in-repo list of all non-redirect MDN URLs that exist, perhaps filtered to ones that include a BCD table. To make two-sided PRs possible, one could add a manually maintained list of extra allowed URLs, but it would be a bit annoying I think :)

Vinyl:

I agree, especially since then someone would need to go through and clean up those allowed URLs every once in a while. Now you've got me wondering how such a linter could be pulled off effectively... :P

Philip:

You’d just have a cron job that updates the list and removes any unneeded manual entries. The nuisance is having to add to the manual list whenever adding a new URL for a page that doesn’t exist yet, which will be the norm when MDN and BCD cover most existing features

@foolip
Copy link
Collaborator

foolip commented Jun 30, 2021

To distill out what I think we should do, it's to disallow 404 mdn_urls, but with a mechanism to get around the problem of BCD+content patches:

  • A generated and auto-updated list of MDN URLs that can be used as mdn_url
  • A manually maintained but auto-pruned list of additional URLs that will soon exist
  • Disallow any URL not on one of those two lists

The auto-updating and auto-pruning would be a scheduled job on GitHub Actions sending PRs to BCD.

@dominiccooney
Copy link
Contributor

@sideshowbarker can I see the source of the CI job which takes a long time processing MDN 404s?

A generated and auto-updated list of MDN URLs that can be used as mdn_url

@sideshowbarker mentioned MDN sitemaps were not supported, but I just fetched https://developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz and it was updated 2021-06-23 and it seems okay?

@foolip Your proposal sounds good. Could you achieve the same effect with fewer moving parts by using the blame age of the 404ing URL in BCD? For example 404ing URLs are OK until they're ~3 weeks old or something? Fewer moving parts. You can start the time limit at +Inf and dial it back to work through the backlog of old 404s.

@foolip
Copy link
Collaborator

foolip commented Jul 1, 2021

@dominiccooney that sounds interesting! How would one keep track of how long a URL has been 404 in this scenario? Is there a cron job that keeps trying all URLs and adds the 404s to a list with a date, perhaps?

@dominiccooney
Copy link
Contributor

dominiccooney commented Jul 2, 2021

@foolip The naive, stateless way to do it: Fetch the URL. If it 404s, run git blame on that line. If the blamed commit is older than the limit then flag the URL. This doesn't tell you how long the URL has been 404ing, but it sounds like what you need?

@foolip
Copy link
Collaborator

foolip commented Jul 2, 2021

Ah, so one unstated constraint I had in mind is to not hit the network while running the lint. Mostly to keep it fast, but also because depending on the network would make it almost certainly flaky.

That's why I was thinking in-tree lists that are updates by a cron job. Maybe there are other ways to make it fast and reliable though.

@ddbeck
Copy link
Collaborator Author

ddbeck commented Jul 2, 2021

A manually maintained but auto-pruned list of additional URLs that will soon exist

Hmm. I wonder if the the mdn/content "review companion" (example) could be upgraded to produce a new URL manifest as an artifact. Then it might be possible to produce a complete list of actual URLs and URLs in pending pull requests.

@sideshowbarker
Copy link
Collaborator

@sideshowbarker can I see the source of the CI job which takes a long time processing MDN 404s?

Yes — but lemme say first that after having spent some time thinking more about the 404s, I’ve think I’m coming around to deciding to quit worrying about them, and so I think I’m probably going to end up deleting or archiving my fork.

And that’s mainly because although I’m still annoyed by the idea of BCD having bad/fake/misleading data, the effects from the 404s for my build times aren’t significant in practice — especially relative to the time it costs me to maintain/update the fork.


Here anyway are the details:

The main source of the CI job that processes the MDN URLs is this:

https://github.com/w3c/mdn-spec-links/blob/master/.browser-compat-data-process.js is the actual script.

You could run it locally by doing:

git clone https://github.com/w3c/mdn-spec-links.git && cd mdn-spec-links && make

That builds using the upstream BCD repo, with all the MDN 404s, rather than my fork without the MDN 404s.

In my local environment, that takes about 12 to 13 minutes to complete. (In the CI environment, it takes maybe a minute less.)

You can then re-run it like this:

make unsafe-dangerously-and-destructively-clean
BCD_REPO=https://github.com/w3c/browser-compat-data.git make -e

That will use my fork, without the MDN 404s, and should take a minute or two less than however long it took in your environment the first time you ran it.

Anyway, it’s at most a two-minute difference. That might bring the build time down to no lower than 10 minutes locally, and no lower than 9 minutes in CI.

And so as as said at the beginning of this comment: after some thought about this — given that building without the MDN 404s doesn’t provide a huge decrease in build time — I’ve decided to quit worrying about it, and I think I’m probably going to end up deleting or archiving my fork.

A generated and auto-updated list of MDN URLs that can be used as mdn_url

@sideshowbarker mentioned MDN sitemaps were not supported, but I just fetched developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz and it was updated 2021-06-23 and it seems okay?

Yeah that seems very useful as far as getting a sitemap of the production site (and I now vaguely recall finding out from a discussion a few months back that we had that from Yari).

But for CI stuff we ideally don’t want the sitemap from the production site but instead from the state of local repo the CI’s being run against. And I’m not sure how that sitemap file gets built, but I suspect it’s by running yarn build. And in my environment, running yarn build takes fully 10 minutes to complete. So making any CI rely on needing to build that file would not be helpful for trimming down CI times.

@foolip
Copy link
Collaborator

foolip commented Jul 13, 2021

Since it has not been stated explicitly in this issue, the main reason I care about 404s in BCD is that I don't enjoy clicking links in compat tables and seeing a 404. https://developer.mozilla.org/en-US/docs/Web/API/PaymentItem#browser_compatibility is an example of this currently.

@Elchi3
Copy link
Member

Elchi3 commented Dec 7, 2021

mdn/yari#5015 proposes that MDN exposes a content/frontmatter inventory which we could query to find out which mdn_urls exist.

@queengooborg
Copy link
Collaborator

Since #23431 has landed, which just flat-out disallows 404 mdn_urls altogether, I'm going to close this as resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs ✍️ Issues or pull requests regarding the documentation of this project.
Projects
None yet
Development

No branches or pull requests

7 participants