Migration 007: Remove all mdn_url 404s #5297

sideshowbarker · 2019-12-07T01:20:41Z

This pull requests adds a migrations script that checks all mdn_url values in the repository, and removes any that result in a 404 response.

ghost · 2019-12-07T01:20:47Z

🤖 Note: This PR contains more 300 or more files. Some automatic labels may not have been applied.

sideshowbarker · 2019-12-07T01:29:51Z

https://github.com/w3c/browser-compat-data/blob/remove-dead-mdn-urls/remove-dead-mdn-urls.py is the script I ran to create this patch. The set of mdn_url 404s it finds is consistent with the set found by the linter in #5228 (modulo three false positives that linter finds, due to a minor bug it currently has that causes it to not handle URLs containing *).

sideshowbarker · 2019-12-07T01:30:43Z

cc @vinyldarkscratch and @bershanskiy

bershanskiy · 2019-12-07T01:41:38Z

339 files, 1200+ lines changed

Wow... this is huge. However, since this script is in Python, may be, it would make more sense to fix the bug in JS script from #5228 and then just run that and follow the "bulk update" process?

queengooborg · 2019-12-07T22:30:56Z

Seeing this PR got me wondering whether removing the MDN URLs is the right way to do this. I applied a couple of MDN URL removals a while back, however now that I’m thinking about it, I realize that we’ll just have to re-link it when we get the docs written, and not having a linter or something telling us “hey, this doc page doesn’t exist” may further hide the real issue that we just need to write up those docs...

Pinging @Elchi3 and @chrisdavidmills to assist with a decision on this one.

sideshowbarker · 2019-12-07T23:39:46Z

… I realize that we’ll just have to re-link it when we get the docs written

It seems like adding a new mdn_url to BCD ought to be part of the process that should happen anyway when new docs get written.

and not having a linter or something telling us “hey, this doc page doesn’t exist” may further hide the real issue that we just need to write up those docs...

I’d think the absence of an mdn_url for a BCD feature is a good indicator docs need to be written for it.

As far as linting for it, if a BCD feature lacks an mdn_url, the linter could anyway just construct a URL for what the mdn_url should be, and then check to see if that URL is 404 or not. (And if the constructed URL is not 404, then the linter message could be, “mdn_url with [url] needs to be added for this feature”.)

So it seems having an mdn_url in BCD for a non-existing MDN article doesn’t really buy us much. It just gives the misleading appearance we have more BCD features documented in MDN than we actually do.

There’s also a consistency issue with the existing state of things: There are some undocumented BCD features that have no mdn_url, but some others that do. If we are to keep each existing mdn_url that’s a 404, then it seems like for consistency, we’d want to add an mdn_url (with the URL for a not-yet-existing MDN article) to each BCD feature that currently lacks an mdn_url.

So I’d think there’s a bulk update that should rightly happen here regardless: We should either identify each mdn_url that’s 404, and remove it — or else, add an mdn_url to each BCD feature that lacks one.

But the idea of bulk-adding a bunch of mdn_url URLs to BCD that are all just 404 doesn’t seem ideal.

Along with the fact it gives the misleading appearance we have more BCD features documented than we actually do, there’s also a cost to downstream consumers of the data and to the MDN infrastructure. As one of those consumers, I can give evidence that it costs me, because my consuming code makes a request to MDN for each mdn_url it finds in BCD. (My code tries to load each MDN article it finds an mdn_url for, and then parses the Specifications section of the article to get any spec URLs the article has.)

My code runs at regular intervals (on the order of once a week or so, right now) — so that if any new spec URLs are added to MDN articles, or any spec URLs change, I can update my downstream data. So each 404 mdn_url has both a cost to me on my side, as well as the cost of the MDN infrastructure regularly receiving spurious requests for MDN articles that don’t exist.

In practice, I deal with it now by eliminating that cost through applying the patch in this PR to my BCD fork. But I’d rather not need to maintain a patch if the mdn_url 404s could be removed in upstream BCD itself.

All that said, if we were to add a spec_url to all BCD features that have one (as we already did for all BCD features in the javascript subtree), then my code wouldn’t actually need to regularly scrape MDN any more to get the spec URLs. And I wouldn’t even need to maintain a BCD fork any more — because essentially the only difference my fork has is the addition of those spec_url values —

https://github.com/w3c/browser-compat-data

Elchi3 · 2019-12-09T14:03:00Z

Pinging @Elchi3 and @chrisdavidmills to assist with a decision on this one.

Thanks Vinyl!

I agree with what @sideshowbarker says and I would like to work on further integrating his work into the main repo, so that there is no need to fork.

I would like to see a (n automated) way in which we would be able to add in mdn_urls when the pages are created. Removing them now is probably the right call, but we need to have them back when they're available.
I think we've worked on removing 404 links on the wiki itself, too, because Google penalizes us for linking to too many 404 pages or something.

Also, cc'ing @jpmedley who sounded like he has thoughts on this.

And if @chrisdavidmills has thoughts, I'm happy to hear them, too :)

sideshowbarker · 2019-12-10T03:43:28Z

339 files, 1200+ lines changed

Wow... this is huge. … follow the "bulk update" process?

To be clear, is the “bulk update” process the same as what’s documented at https://github.com/mdn/browser-compat-data/blob/master/docs/migrations.md#migrations?

However, since this script is in Python

I can port the https://github.com/w3c/browser-compat-data/blob/remove-dead-mdn-urls/remove-dead-mdn-urls.py script to JavaScript (and move it to the /scripts/migrations subdirectory).

may be, it would make more sense to fix the bug in JS script from #5228 and then just run that

Well, c7f0e38 fixed that bug — but:

at Check for dead/broken URLs (disabled by default) #5228 (comment) @vinyldarkscratch says “not like this PR will be merged anytime soon since it builds off of another”
Check for dead/broken URLs (disabled by default) #5228 anyway doesn’t actually replace any 404s (instead it adds a lint check that reports them)

Given the above, since per #5297 (comment), it seems like we are getting close to agreement on going ahead to actually remove the mdn_url 404s, then I will go ahead and update the patch in this PR to:

use a JavaScript script (rather than a Python script) to automate the mdn_url removals
add the script under the /scripts/migrations subdirectory
follow the other steps in the documented Migrations process

chrisdavidmills · 2019-12-10T08:05:43Z

I am ok with this. The above arguments make sense.

queengooborg · 2019-12-12T02:52:41Z

Thanks, @Elchi3 and @chrisdavidmills! Alrighty then, let's proceed to remove 404-ing MDN URLs. 😉

sideshowbarker · 2020-03-04T11:47:36Z

@Elchi3, @ddbeck This PR is ready for review as a a proposed migration per the guidelines at https://github.com/mdn/browser-compat-data/blob/master/docs/migrations.md#migrations. I’ve (re)tested the migration script and confirmed that it produces the expected results, without any false positives.

Elchi3 · 2020-03-05T12:25:59Z

hey @sideshowbarker, I will try to schedule a review for this in an upcoming sprint. Thanks for your patience.

sideshowbarker · 2020-03-05T14:13:49Z

Hi @Elchi3, I understand, and no rush; I just hadn’t heard from anybody on this, so I had just been hoping for an update, which you’ve now given me — so, thanks, and I’ll just wait to hear back from you when this makes its way up the queue

sideshowbarker · 2021-06-22T22:54:50Z

We currently have 1170 mdn_url values in BCD that are 404s; see w3c@HEAD.

But that number continues to grow — because basically every week, we continue to add new mdn_url values to BCD for MDN articles that don’t actually exist.

So I’m closing this, since I guess it’s time for me to give up on the idea that we’ll agree to put a policy in place upstream to stop doing that. But that means I’ll need to keep maintaining the W3C fork of BCD at https://github.com/w3c/browser-compat-data so that there’s a version of BCD somewhere without all the mdn_url 404s.

Elchi3 · 2021-06-23T07:53:07Z

Thanks Mike! Can you put the 404 URLs in a gist or somewhere where we can take a better look at them?
I wonder if we can cluster them to identify content gaps. It seems like there are a lot of 404 of HTML* APIs and WebDriver, for example. So a follow up here could be to file issues like:

"Document WebDriver APIs (xx missing pages)"
"Document HTML DOM APIs (yy missing pages)"

Does that make sense to you?

sideshowbarker · 2021-06-23T08:37:57Z

Thanks Mike! Can you put the 404 URLs in a gist or somewhere where we can take a better look at them?

The are all in the HEAD commit of https://github.com/w3c/browser-compat-data — which doesn’t always have the same hash, because the way I manage that repo is to amend and force-push with that same commit at the HEAD.

So https://github.com/w3c/browser-compat-data/commit/HEAD.diff is the stable link that’ll always give the current diff of all removed mdn_url values.

And so to bash out from that just a sorted list of the URLs themselves, something like the following should always work —

curl -s https://github.com/w3c/browser-compat-data/commit/HEAD.diff | grep mdn_url | cut -d '"' -f4- \
    | rev | cut -c 3- | rev | sort | uniq

I wonder if we can cluster them to identify content gaps. It seems like there are a lot of 404 of HTML* APIs and WebDriver, for example. So a follow up here could be to file issues like:

"Document WebDriver APIs (xx missing pages)"

"Document HTML DOM APIs (yy missing pages)"

Does that make sense to you?

I guess I wouldn’t object to somebody doing it that way — if they were motivated to take time to look through the list and try do some analysis and identify some patterns and raise issues.

But I can say that personally my motivation for doing that is so non-existent that I have never even once actually looked through the list. I guess I have zero curiosity about how any of those URLs got into the data to begin with — and so, as far as identifying content gaps, IMHO the right way to do that would be this:

Start by just removing all those MDN URLs that are 404s
Write some automation that spits out the complete list of features that yet have no mdn_url.
Do analysis on that complete set of missing URLs, to identify overall patterns of areas we’re lacking documentation for, and then raise issues based on that.

To instead do some similar steps based just on the starting point of looking at the subset of features for which somebody at some time for some unknown reason chose to add an mdn_url for an article that doesn’t exist — that does not seem to me personally at least like the approach that’d be the best use of anybody’s time.

I think among the flaws in doing it that way are that it assumes there was sound logic and rationale to those particular URLs getting added when others weren’t.

But I think what’s actually happened instead is, it’s just been arbitrary — that is, the difference between which non-existent URLs got added and which didn’t comes down to who happened to have who written and reviewed the BCD patches that added them, and their individual style/preferences just being different from that of the people whose patches didn’t add any.

teoli2003 · 2021-06-23T08:51:37Z

Independently of what Mike is proposing, I see that HTML*Element returns 285 missing pages (and likely more have not been catched if they don't have the mdn_url set).

Given that it is unlikely that contracting writers will be commissionned to work on these old (but not outdated) APIs, I'm pondering if it can be a good crowdsourced activity. A step toward MDN being feature-complete, with the side effect of solving 25% of the problem here (even if it is not the right canonical way to solve it).

A lot of these pages are likely suitable for beginner writers, and with our review process, we should be able to ensure the quality (which wasn't possible with the wiki).

Elchi3 · 2021-06-23T08:56:04Z

Good point, Mike, I forgot there are BCD features that have no mdn_url. However, I think certain sub features would never get an mdn_url. So, the steps you list make sense to me, just adding a filter step.

Remove 404 URLs
List features that have no mdn_url
Filter that list to remove sub features that generally don't require a dedicated MDN doc.
Analyse and raise issues for patterns/clusters of documentation that is missing.

Also agree with @teoli2003 about crowd sourcing after a cluster has been identified.

teoli2003 · 2021-06-23T09:13:53Z

(I'm all for removing the 404 URLs, btw.)

sideshowbarker · 2021-06-23T10:39:29Z

So, the steps you list make sense to me, just adding a filter step.

Remove 404 URLs

List features that have no mdn_url

OK, as far as step 2 there, I already have some code written into my spec_url-checking script that attempts to do that. You can try it like this:

curl -s -O https://raw.githubusercontent.com/w3c/mdn-spec-links/master/.check-spec-urls.js \
    && node .check-spec-urls.js 2>&1 | tee LOG && grep "no mdn_url" LOG

When I run that from my fork, it lists 2703 features. When I run it from the upstream repo, it lists 1801 features.

Filter that list to remove sub features that generally don't require a dedicated MDN doc.

I agree any automation needs to do that in order to be useful. But I’m not sure how to do that programmatically.

Whether a subfeature of a feature is one that doesn’t require a dedicated MDN doc seems to depend on what kind of feature it’s a subfeature of, and which data file it’s in and what all else is in that data, and how it’s structured — and also how the MDN docs for it are structured.

But there are probably some clever ways to approach it that would narrow things down a bit.

sideshowbarker · 2021-06-24T17:18:30Z

Whether a subfeature of a feature is one that doesn’t require a dedicated MDN doc seems to depend on what kind of feature it’s a subfeature of, and which data file it’s in and what all else is in that data, and how it’s structured — and also how the MDN docs for it are structured.

But there are probably some clever ways to approach it that would narrow things down a bit.

This discussion got me remembering that could help with the above need is: We would add a new key name to BCD to explicitly indicate the feature or subfeature type — with the possible values including, for example:

interface
method
property
element
attribute
javascript-object
javascript-operator
http-header
http-method
http-status

I guess some (or most) of those we could determine heuristically, based on the (sub)directory name and filename. But in my experience, when writing code to walk a filesystem tree and do JSON parsing of file contents, you really would rather not need to keep state information about what the current directory name happens to be, or what the current filename happens to be — instead, you really want the feature type information to be an explicit part of the data itself.

(And incidentally, if we did end up adding this, the type field should be added as a child of the __compat key, rather than as a sibling of __compat key. The reason I say that is for a reason similar to why it’s preferable not to need to keep state info about directory names and filenames: when you run a JSON parser with a reviver function, that returns the keys in reverse tree order — it ascends through he keys rather than descending through the keys. And so unless the type field is at the same level as the rest of the useful keys you’re doing stuff with, you’d have to end up keeping some state and re-checking it as you ascend.)

Elchi3 · 2021-06-28T09:01:48Z

Thanks Mike. I think over in mdn/content we have the idea of adding "page-types" to the files which would pretty much be what you describe here. And I think the mdn/content front matter/markdown files will contain more interesting data points in the future, so I suspect more stuff would rather live in mdn/content than in BCD. So, my sense is that we need to make sure BCD maps nicely into mdn/content rather than putting more data into BCD that isn't directly compat data. Generally, I agree though that such type classification is very useful.

ddbeck · 2021-06-28T17:38:20Z

I'm not exactly sure what the next step should be for removing/changing/fixing 404s, but this caught my attention:

But I think what’s actually happened instead is, it’s just been arbitrary — that is, the difference between which non-existent URLs got added and which didn’t comes down to who happened to have who written and reviewed the BCD patches that added them, and their individual style/preferences just being different from that of the people whose patches didn’t add any.

We can stop making things worse in this regard. I've opened #11291 to discuss what data guidelines we might have as an interim measure, to make this less inconsistent going forward. That would be a good place to settle an important question right now: should we allow new 404s to appear in BCD?

sideshowbarker requested review from Elchi3 and rebloor as code owners December 7, 2019 01:20

sideshowbarker changed the title ~~Remove dead mdn urls~~ Migration: Remove dead mdn urls Dec 10, 2019

Elchi3 removed the request for review from rebloor January 10, 2020 13:16

Elchi3 mentioned this pull request Jan 27, 2020

MediaStream API: onactive has invalid link Fyrd/caniuse#5248

Open

sideshowbarker mentioned this pull request Feb 28, 2020

Add BCD for TextDecoderStream/TextEncoderStream #5764

Merged

bershanskiy mentioned this pull request Mar 8, 2020

Linter for MDN URL structure #5273

Closed

5 tasks

ddbeck mentioned this pull request Nov 24, 2020

Rename api.HTMLElement.formEncType -> formEnctype #7471

Merged

sideshowbarker mentioned this pull request Dec 8, 2020

Add missing URLs. #7352

Merged

ddbeck mentioned this pull request Dec 17, 2020

Issue with "HTMLIFrameElement": … mdn/content#328

Closed

queengooborg added the rebase needed 🚧 label Feb 13, 2021

Elchi3 removed their request for review February 18, 2021 17:17

Base automatically changed from master to main March 24, 2021 12:53

sideshowbarker closed this Jun 22, 2021

sideshowbarker deleted the remove-dead-mdn-urls branch June 22, 2021 22:54

ddbeck mentioned this pull request Jun 28, 2021

Add guidelines for when to (dis)allow 404 mdn_url values #11291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration 007: Remove all mdn_url 404s #5297

Migration 007: Remove all mdn_url 404s #5297

sideshowbarker commented Dec 7, 2019 •

edited

Loading

ghost commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019

bershanskiy commented Dec 7, 2019

queengooborg commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019 •

edited

Loading

Elchi3 commented Dec 9, 2019

sideshowbarker commented Dec 10, 2019

chrisdavidmills commented Dec 10, 2019

queengooborg commented Dec 12, 2019

sideshowbarker commented Mar 4, 2020

Elchi3 commented Mar 5, 2020

sideshowbarker commented Mar 5, 2020

sideshowbarker commented Jun 22, 2021 •

edited

Loading

Elchi3 commented Jun 23, 2021

sideshowbarker commented Jun 23, 2021

teoli2003 commented Jun 23, 2021 •

edited

Loading

Elchi3 commented Jun 23, 2021

teoli2003 commented Jun 23, 2021

sideshowbarker commented Jun 23, 2021 •

edited

Loading

sideshowbarker commented Jun 24, 2021

Elchi3 commented Jun 28, 2021

ddbeck commented Jun 28, 2021

Migration 007: Remove all mdn_url 404s #5297

Migration 007: Remove all mdn_url 404s #5297

Conversation

sideshowbarker commented Dec 7, 2019 • edited Loading

ghost commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019

bershanskiy commented Dec 7, 2019

queengooborg commented Dec 7, 2019

sideshowbarker commented Dec 7, 2019 • edited Loading

Elchi3 commented Dec 9, 2019

sideshowbarker commented Dec 10, 2019

chrisdavidmills commented Dec 10, 2019

queengooborg commented Dec 12, 2019

sideshowbarker commented Mar 4, 2020

Elchi3 commented Mar 5, 2020

sideshowbarker commented Mar 5, 2020

sideshowbarker commented Jun 22, 2021 • edited Loading

Elchi3 commented Jun 23, 2021

sideshowbarker commented Jun 23, 2021

teoli2003 commented Jun 23, 2021 • edited Loading

Elchi3 commented Jun 23, 2021

teoli2003 commented Jun 23, 2021

sideshowbarker commented Jun 23, 2021 • edited Loading

sideshowbarker commented Jun 24, 2021

Elchi3 commented Jun 28, 2021

ddbeck commented Jun 28, 2021

sideshowbarker commented Dec 7, 2019 •

edited

Loading

sideshowbarker commented Dec 7, 2019 •

edited

Loading

sideshowbarker commented Jun 22, 2021 •

edited

Loading

teoli2003 commented Jun 23, 2021 •

edited

Loading

sideshowbarker commented Jun 23, 2021 •

edited

Loading