Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration 007: Remove all mdn_url 404s #5297

Closed
wants to merge 1 commit into from

Conversation

sideshowbarker
Copy link
Collaborator

@sideshowbarker sideshowbarker commented Dec 7, 2019

This pull requests adds a migrations script that checks all mdn_url values in the repository, and removes any that result in a 404 response.

@ghost
Copy link

ghost commented Dec 7, 2019

🤖 Note: This PR contains more 300 or more files. Some automatic labels may not have been applied.

@ghost ghost added data:api 🐇 Compat data for Web APIs. https://developer.mozilla.org/docs/Web/API data:css 🎨 Compat data for CSS features. https://developer.mozilla.org/docs/Web/CSS data:html 📄 Compat data for HTML elements. https://developer.mozilla.org/docs/Web/HTML data:http 🚠 Compat data for HTTP features. https://developer.mozilla.org/docs/Web/HTTP data:js 📟 Compat data for JS/ECMAScript features. https://developer.mozilla.org/docs/Web/JavaScript data:svg 🖌️ Compat data for SVG features. https://developer.mozilla.org/docs/Web/SVG data:webdriver 🏎️ Compat data for WebDriver features. https://developer.mozilla.org/docs/Web/WebDriver data:webext 🎲 Compat data for Browser Extensions. https://developer.mozilla.org/Add-ons/WebExtensions labels Dec 7, 2019
@sideshowbarker
Copy link
Collaborator Author

https://github.com/w3c/browser-compat-data/blob/remove-dead-mdn-urls/remove-dead-mdn-urls.py is the script I ran to create this patch. The set of mdn_url 404s it finds is consistent with the set found by the linter in #5228 (modulo three false positives that linter finds, due to a minor bug it currently has that causes it to not handle URLs containing *).

@sideshowbarker
Copy link
Collaborator Author

cc @vinyldarkscratch and @bershanskiy

@bershanskiy
Copy link
Contributor

339 files, 1200+ lines changed

Wow... this is huge. However, since this script is in Python, may be, it would make more sense to fix the bug in JS script from #5228 and then just run that and follow the "bulk update" process?

@queengooborg
Copy link
Collaborator

Seeing this PR got me wondering whether removing the MDN URLs is the right way to do this. I applied a couple of MDN URL removals a while back, however now that I’m thinking about it, I realize that we’ll just have to re-link it when we get the docs written, and not having a linter or something telling us “hey, this doc page doesn’t exist” may further hide the real issue that we just need to write up those docs...

Pinging @Elchi3 and @chrisdavidmills to assist with a decision on this one.

@queengooborg queengooborg added bulk_update 📦 An update to a mass amount of data, or scripts/linters related to such changes and removed data:api 🐇 Compat data for Web APIs. https://developer.mozilla.org/docs/Web/API data:css 🎨 Compat data for CSS features. https://developer.mozilla.org/docs/Web/CSS data:html 📄 Compat data for HTML elements. https://developer.mozilla.org/docs/Web/HTML data:http 🚠 Compat data for HTTP features. https://developer.mozilla.org/docs/Web/HTTP data:js 📟 Compat data for JS/ECMAScript features. https://developer.mozilla.org/docs/Web/JavaScript data:svg 🖌️ Compat data for SVG features. https://developer.mozilla.org/docs/Web/SVG data:webdriver 🏎️ Compat data for WebDriver features. https://developer.mozilla.org/docs/Web/WebDriver data:webext 🎲 Compat data for Browser Extensions. https://developer.mozilla.org/Add-ons/WebExtensions labels Dec 7, 2019
@sideshowbarker
Copy link
Collaborator Author

sideshowbarker commented Dec 7, 2019

… I realize that we’ll just have to re-link it when we get the docs written

It seems like adding a new mdn_url to BCD ought to be part of the process that should happen anyway when new docs get written.

and not having a linter or something telling us “hey, this doc page doesn’t exist” may further hide the real issue that we just need to write up those docs...

I’d think the absence of an mdn_url for a BCD feature is a good indicator docs need to be written for it.

As far as linting for it, if a BCD feature lacks an mdn_url, the linter could anyway just construct a URL for what the mdn_url should be, and then check to see if that URL is 404 or not. (And if the constructed URL is not 404, then the linter message could be, “mdn_url with [url] needs to be added for this feature”.)

So it seems having an mdn_url in BCD for a non-existing MDN article doesn’t really buy us much. It just gives the misleading appearance we have more BCD features documented in MDN than we actually do.

There’s also a consistency issue with the existing state of things: There are some undocumented BCD features that have no mdn_url, but some others that do. If we are to keep each existing mdn_url that’s a 404, then it seems like for consistency, we’d want to add an mdn_url (with the URL for a not-yet-existing MDN article) to each BCD feature that currently lacks an mdn_url.

So I’d think there’s a bulk update that should rightly happen here regardless: We should either identify each mdn_url that’s 404, and remove it — or else, add an mdn_url to each BCD feature that lacks one.

But the idea of bulk-adding a bunch of mdn_url URLs to BCD that are all just 404 doesn’t seem ideal.

Along with the fact it gives the misleading appearance we have more BCD features documented than we actually do, there’s also a cost to downstream consumers of the data and to the MDN infrastructure. As one of those consumers, I can give evidence that it costs me, because my consuming code makes a request to MDN for each mdn_url it finds in BCD. (My code tries to load each MDN article it finds an mdn_url for, and then parses the Specifications section of the article to get any spec URLs the article has.)

My code runs at regular intervals (on the order of once a week or so, right now) — so that if any new spec URLs are added to MDN articles, or any spec URLs change, I can update my downstream data. So each 404 mdn_url has both a cost to me on my side, as well as the cost of the MDN infrastructure regularly receiving spurious requests for MDN articles that don’t exist.

In practice, I deal with it now by eliminating that cost through applying the patch in this PR to my BCD fork. But I’d rather not need to maintain a patch if the mdn_url 404s could be removed in upstream BCD itself.

All that said, if we were to add a spec_url to all BCD features that have one (as we already did for all BCD features in the javascript subtree), then my code wouldn’t actually need to regularly scrape MDN any more to get the spec URLs. And I wouldn’t even need to maintain a BCD fork any more — because essentially the only difference my fork has is the addition of those spec_url values —

https://github.com/w3c/browser-compat-data

@Elchi3
Copy link
Member

Elchi3 commented Dec 9, 2019

Pinging @Elchi3 and @chrisdavidmills to assist with a decision on this one.

Thanks Vinyl!

I agree with what @sideshowbarker says and I would like to work on further integrating his work into the main repo, so that there is no need to fork.

I would like to see a (n automated) way in which we would be able to add in mdn_urls when the pages are created. Removing them now is probably the right call, but we need to have them back when they're available.
I think we've worked on removing 404 links on the wiki itself, too, because Google penalizes us for linking to too many 404 pages or something.

Also, cc'ing @jpmedley who sounded like he has thoughts on this.

And if @chrisdavidmills has thoughts, I'm happy to hear them, too :)

@sideshowbarker
Copy link
Collaborator Author

339 files, 1200+ lines changed

Wow... this is huge. … follow the "bulk update" process?

To be clear, is the “bulk update” process the same as what’s documented at https://github.com/mdn/browser-compat-data/blob/master/docs/migrations.md#migrations?

However, since this script is in Python

I can port the https://github.com/w3c/browser-compat-data/blob/remove-dead-mdn-urls/remove-dead-mdn-urls.py script to JavaScript (and move it to the /scripts/migrations subdirectory).

may be, it would make more sense to fix the bug in JS script from #5228 and then just run that

Well, c7f0e38 fixed that bug — but:

Given the above, since per #5297 (comment), it seems like we are getting close to agreement on going ahead to actually remove the mdn_url 404s, then I will go ahead and update the patch in this PR to:

  • use a JavaScript script (rather than a Python script) to automate the mdn_url removals
  • add the script under the /scripts/migrations subdirectory
  • follow the other steps in the documented Migrations process

@sideshowbarker sideshowbarker changed the title Remove dead mdn urls Migration: Remove dead mdn urls Dec 10, 2019
@chrisdavidmills
Copy link
Collaborator

I am ok with this. The above arguments make sense.

@queengooborg
Copy link
Collaborator

Thanks, @Elchi3 and @chrisdavidmills! Alrighty then, let's proceed to remove 404-ing MDN URLs. 😉

@queengooborg queengooborg removed data:js 📟 Compat data for JS/ECMAScript features. https://developer.mozilla.org/docs/Web/JavaScript data:svg 🖌️ Compat data for SVG features. https://developer.mozilla.org/docs/Web/SVG data:webdriver 🏎️ Compat data for WebDriver features. https://developer.mozilla.org/docs/Web/WebDriver data:webext 🎲 Compat data for Browser Extensions. https://developer.mozilla.org/Add-ons/WebExtensions labels Dec 22, 2019
@Elchi3 Elchi3 removed the request for review from rebloor January 10, 2020 13:16
@sideshowbarker
Copy link
Collaborator Author

@Elchi3, @ddbeck This PR is ready for review as a a proposed migration per the guidelines at https://github.com/mdn/browser-compat-data/blob/master/docs/migrations.md#migrations. I’ve (re)tested the migration script and confirmed that it produces the expected results, without any false positives.

@Elchi3
Copy link
Member

Elchi3 commented Mar 5, 2020

hey @sideshowbarker, I will try to schedule a review for this in an upcoming sprint. Thanks for your patience.

@sideshowbarker
Copy link
Collaborator Author

Hi @Elchi3, I understand, and no rush; I just hadn’t heard from anybody on this, so I had just been hoping for an update, which you’ve now given me — so, thanks, and I’ll just wait to hear back from you when this makes its way up the queue

@bershanskiy bershanskiy mentioned this pull request Mar 8, 2020
5 tasks
@sideshowbarker sideshowbarker mentioned this pull request Dec 8, 2020
@Elchi3 Elchi3 removed their request for review February 18, 2021 17:17
Base automatically changed from master to main March 24, 2021 12:53
@sideshowbarker
Copy link
Collaborator Author

sideshowbarker commented Jun 22, 2021

We currently have 1170 mdn_url values in BCD that are 404s; see w3c@HEAD.

But that number continues to grow — because basically every week, we continue to add new mdn_url values to BCD for MDN articles that don’t actually exist.

So I’m closing this, since I guess it’s time for me to give up on the idea that we’ll agree to put a policy in place upstream to stop doing that. But that means I’ll need to keep maintaining the W3C fork of BCD at https://github.com/w3c/browser-compat-data so that there’s a version of BCD somewhere without all the mdn_url 404s.

@sideshowbarker sideshowbarker deleted the remove-dead-mdn-urls branch June 22, 2021 22:54
@Elchi3
Copy link
Member

Elchi3 commented Jun 23, 2021

Thanks Mike! Can you put the 404 URLs in a gist or somewhere where we can take a better look at them?
I wonder if we can cluster them to identify content gaps. It seems like there are a lot of 404 of HTML* APIs and WebDriver, for example. So a follow up here could be to file issues like:

  • "Document WebDriver APIs (xx missing pages)"
  • "Document HTML DOM APIs (yy missing pages)"

Does that make sense to you?

@sideshowbarker
Copy link
Collaborator Author

Thanks Mike! Can you put the 404 URLs in a gist or somewhere where we can take a better look at them?

The are all in the HEAD commit of https://github.com/w3c/browser-compat-data — which doesn’t always have the same hash, because the way I manage that repo is to amend and force-push with that same commit at the HEAD.

So https://github.com/w3c/browser-compat-data/commit/HEAD.diff is the stable link that’ll always give the current diff of all removed mdn_url values.

And so to bash out from that just a sorted list of the URLs themselves, something like the following should always work —

curl -s https://github.com/w3c/browser-compat-data/commit/HEAD.diff | grep mdn_url | cut -d '"' -f4- \
    | rev | cut -c 3- | rev | sort | uniq

I wonder if we can cluster them to identify content gaps. It seems like there are a lot of 404 of HTML* APIs and WebDriver, for example. So a follow up here could be to file issues like:

  • "Document WebDriver APIs (xx missing pages)"

  • "Document HTML DOM APIs (yy missing pages)"

Does that make sense to you?

I guess I wouldn’t object to somebody doing it that way — if they were motivated to take time to look through the list and try do some analysis and identify some patterns and raise issues.

But I can say that personally my motivation for doing that is so non-existent that I have never even once actually looked through the list. I guess I have zero curiosity about how any of those URLs got into the data to begin with — and so, as far as identifying content gaps, IMHO the right way to do that would be this:

  1. Start by just removing all those MDN URLs that are 404s
  2. Write some automation that spits out the complete list of features that yet have no mdn_url.
  3. Do analysis on that complete set of missing URLs, to identify overall patterns of areas we’re lacking documentation for, and then raise issues based on that.

To instead do some similar steps based just on the starting point of looking at the subset of features for which somebody at some time for some unknown reason chose to add an mdn_url for an article that doesn’t exist — that does not seem to me personally at least like the approach that’d be the best use of anybody’s time.

I think among the flaws in doing it that way are that it assumes there was sound logic and rationale to those particular URLs getting added when others weren’t.

But I think what’s actually happened instead is, it’s just been arbitrary — that is, the difference between which non-existent URLs got added and which didn’t comes down to who happened to have who written and reviewed the BCD patches that added them, and their individual style/preferences just being different from that of the people whose patches didn’t add any.

@teoli2003
Copy link
Contributor

teoli2003 commented Jun 23, 2021

Independently of what Mike is proposing, I see that HTML*Element returns 285 missing pages (and likely more have not been catched if they don't have the mdn_url set).

curl -s https://github.com/w3c/browser-compat-data/commit/HEAD.diff | grep mdn_url | cut -d '"' -f4- | rev | cut -c 3- | rev | sort | uniq | grep Web/API/HTML.*Element | wc -l

Given that it is unlikely that contracting writers will be commissionned to work on these old (but not outdated) APIs, I'm pondering if it can be a good crowdsourced activity. A step toward MDN being feature-complete, with the side effect of solving 25% of the problem here (even if it is not the right canonical way to solve it).

A lot of these pages are likely suitable for beginner writers, and with our review process, we should be able to ensure the quality (which wasn't possible with the wiki).

@Elchi3
Copy link
Member

Elchi3 commented Jun 23, 2021

Good point, Mike, I forgot there are BCD features that have no mdn_url. However, I think certain sub features would never get an mdn_url. So, the steps you list make sense to me, just adding a filter step.

  1. Remove 404 URLs
  2. List features that have no mdn_url
  3. Filter that list to remove sub features that generally don't require a dedicated MDN doc.
  4. Analyse and raise issues for patterns/clusters of documentation that is missing.

Also agree with @teoli2003 about crowd sourcing after a cluster has been identified.

@teoli2003
Copy link
Contributor

(I'm all for removing the 404 URLs, btw.)

@sideshowbarker
Copy link
Collaborator Author

sideshowbarker commented Jun 23, 2021

So, the steps you list make sense to me, just adding a filter step.

  1. Remove 404 URLs

  2. List features that have no mdn_url

OK, as far as step 2 there, I already have some code written into my spec_url-checking script that attempts to do that. You can try it like this:

curl -s -O https://raw.githubusercontent.com/w3c/mdn-spec-links/master/.check-spec-urls.js \
    && node .check-spec-urls.js 2>&1 | tee LOG && grep "no mdn_url" LOG

When I run that from my fork, it lists 2703 features. When I run it from the upstream repo, it lists 1801 features.

  1. Filter that list to remove sub features that generally don't require a dedicated MDN doc.

I agree any automation needs to do that in order to be useful. But I’m not sure how to do that programmatically.

Whether a subfeature of a feature is one that doesn’t require a dedicated MDN doc seems to depend on what kind of feature it’s a subfeature of, and which data file it’s in and what all else is in that data, and how it’s structured — and also how the MDN docs for it are structured.

But there are probably some clever ways to approach it that would narrow things down a bit.

@sideshowbarker
Copy link
Collaborator Author

Whether a subfeature of a feature is one that doesn’t require a dedicated MDN doc seems to depend on what kind of feature it’s a subfeature of, and which data file it’s in and what all else is in that data, and how it’s structured — and also how the MDN docs for it are structured.

But there are probably some clever ways to approach it that would narrow things down a bit.

This discussion got me remembering that could help with the above need is: We would add a new key name to BCD to explicitly indicate the feature or subfeature type — with the possible values including, for example:

  • interface
  • method
  • property
  • element
  • attribute
  • javascript-object
  • javascript-operator
  • http-header
  • http-method
  • http-status

I guess some (or most) of those we could determine heuristically, based on the (sub)directory name and filename. But in my experience, when writing code to walk a filesystem tree and do JSON parsing of file contents, you really would rather not need to keep state information about what the current directory name happens to be, or what the current filename happens to be — instead, you really want the feature type information to be an explicit part of the data itself.

(And incidentally, if we did end up adding this, the type field should be added as a child of the __compat key, rather than as a sibling of __compat key. The reason I say that is for a reason similar to why it’s preferable not to need to keep state info about directory names and filenames: when you run a JSON parser with a reviver function, that returns the keys in reverse tree order — it ascends through he keys rather than descending through the keys. And so unless the type field is at the same level as the rest of the useful keys you’re doing stuff with, you’d have to end up keeping some state and re-checking it as you ascend.)

@Elchi3
Copy link
Member

Elchi3 commented Jun 28, 2021

Thanks Mike. I think over in mdn/content we have the idea of adding "page-types" to the files which would pretty much be what you describe here. And I think the mdn/content front matter/markdown files will contain more interesting data points in the future, so I suspect more stuff would rather live in mdn/content than in BCD. So, my sense is that we need to make sure BCD maps nicely into mdn/content rather than putting more data into BCD that isn't directly compat data. Generally, I agree though that such type classification is very useful.

@ddbeck
Copy link
Collaborator

ddbeck commented Jun 28, 2021

I'm not exactly sure what the next step should be for removing/changing/fixing 404s, but this caught my attention:

But I think what’s actually happened instead is, it’s just been arbitrary — that is, the difference between which non-existent URLs got added and which didn’t comes down to who happened to have who written and reviewed the BCD patches that added them, and their individual style/preferences just being different from that of the people whose patches didn’t add any.

We can stop making things worse in this regard. I've opened #11291 to discuss what data guidelines we might have as an interim measure, to make this less inconsistent going forward. That would be a good place to settle an important question right now: should we allow new 404s to appear in BCD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bulk_update 📦 An update to a mass amount of data, or scripts/linters related to such changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants