-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guidelines for when to (dis)allow 404 mdn_url
values
#11291
Comments
When creating a new MDN reference page, it is convenient to be able to create the bcd PR (with Note that if both repos become strict – that is if mdn/bcd strictly prevents 404s in As I don't see any other concrete use of a 404 |
Yes, this makes sense. It's also consistent with how we've been handling PRs which require content changes on MDN so far (which is, the PR being open on mdn/content is enough to satisfy the requirements to merge on BCD). 👍 |
cc @vinyldarkscratch |
Quoting a chat between me and Vinyl from yesterday: Vinyl:
Philip:
Vinyl:
Philip:
Vinyl:
Philip:
|
To distill out what I think we should do, it's to disallow 404
The auto-updating and auto-pruning would be a scheduled job on GitHub Actions sending PRs to BCD. |
@sideshowbarker can I see the source of the CI job which takes a long time processing MDN 404s?
@sideshowbarker mentioned MDN sitemaps were not supported, but I just fetched https://developer.mozilla.org/sitemaps/en-us/sitemap.xml.gz and it was updated 2021-06-23 and it seems okay? @foolip Your proposal sounds good. Could you achieve the same effect with fewer moving parts by using the blame age of the 404ing URL in BCD? For example 404ing URLs are OK until they're ~3 weeks old or something? Fewer moving parts. You can start the time limit at +Inf and dial it back to work through the backlog of old 404s. |
@dominiccooney that sounds interesting! How would one keep track of how long a URL has been 404 in this scenario? Is there a cron job that keeps trying all URLs and adds the 404s to a list with a date, perhaps? |
@foolip The naive, stateless way to do it: Fetch the URL. If it 404s, run git blame on that line. If the blamed commit is older than the limit then flag the URL. This doesn't tell you how long the URL has been 404ing, but it sounds like what you need? |
Ah, so one unstated constraint I had in mind is to not hit the network while running the lint. Mostly to keep it fast, but also because depending on the network would make it almost certainly flaky. That's why I was thinking in-tree lists that are updates by a cron job. Maybe there are other ways to make it fast and reliable though. |
Hmm. I wonder if the the mdn/content "review companion" (example) could be upgraded to produce a new URL manifest as an artifact. Then it might be possible to produce a complete list of actual URLs and URLs in pending pull requests. |
Yes — but lemme say first that after having spent some time thinking more about the 404s, I’ve think I’m coming around to deciding to quit worrying about them, and so I think I’m probably going to end up deleting or archiving my fork. And that’s mainly because although I’m still annoyed by the idea of BCD having bad/fake/misleading data, the effects from the 404s for my build times aren’t significant in practice — especially relative to the time it costs me to maintain/update the fork. Here anyway are the details: The main source of the CI job that processes the MDN URLs is this: https://github.com/w3c/mdn-spec-links/blob/master/.browser-compat-data-process.js is the actual script. You could run it locally by doing:
That builds using the upstream BCD repo, with all the MDN 404s, rather than my fork without the MDN 404s. In my local environment, that takes about 12 to 13 minutes to complete. (In the CI environment, it takes maybe a minute less.) You can then re-run it like this:
That will use my fork, without the MDN 404s, and should take a minute or two less than however long it took in your environment the first time you ran it. Anyway, it’s at most a two-minute difference. That might bring the build time down to no lower than 10 minutes locally, and no lower than 9 minutes in CI. And so as as said at the beginning of this comment: after some thought about this — given that building without the MDN 404s doesn’t provide a huge decrease in build time — I’ve decided to quit worrying about it, and I think I’m probably going to end up deleting or archiving my fork.
Yeah that seems very useful as far as getting a sitemap of the production site (and I now vaguely recall finding out from a discussion a few months back that we had that from Yari). But for CI stuff we ideally don’t want the sitemap from the production site but instead from the state of local repo the CI’s being run against. And I’m not sure how that sitemap file gets built, but I suspect it’s by running |
Since it has not been stated explicitly in this issue, the main reason I care about 404s in BCD is that I don't enjoy clicking links in compat tables and seeing a 404. https://developer.mozilla.org/en-US/docs/Web/API/PaymentItem#browser_compatibility is an example of this currently. |
mdn/yari#5015 proposes that MDN exposes a content/frontmatter inventory which we could query to find out which mdn_urls exist. |
Since #23431 has landed, which just flat-out disallows 404 |
Summary
We ought to have a guideline that will help PR reviewers know when to expect
mdn_url
to point to a URL that doesn't actually exist, and when they should reject them. This would apply to new data and might provide some pointers toward cleaning up existing data, or automating changes to it.Background
From @sideshowbarker on #5297 (comment)
Questions that guidelines might help answer
mdn_url
? What features should never have anmdn_url
?mdn_url
?mdn_url
, but can't because it's missing? What should the amelioration process be?I don't have answers to these questions, but I feel like any answers would be fine as an interim practice, while we work through this problem.
Related issues
The PR #5297 has some extended discussion about automation and eliminating/fixing 404s with tools support.
The text was updated successfully, but these errors were encountered: