Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(jsonld): add structured data validation #6750

Merged
merged 33 commits into from
Apr 8, 2019
Merged

Conversation

patrickhulce
Copy link
Collaborator

@patrickhulce patrickhulce commented Dec 8, 2018

Summary
Split out the structured validation library portion from #5446.

Key changes from Konrad's implementation

  1. Ignoring native dependency instead of modifying our build environment Updated to 1.5.0 so we don't need it anymore! 🎉
  2. Generated files are now a single-line, I'm not 100% sure these even need to be checked in if we had a prepublish/postinstall hook to generate them? I suppose it's safer though considering they're pulled from URLs Reverted through review
  3. Added a monkeypatch for console.error because jsonlint-mod is annoying Updated to 1.7.4 so also not needed! 🎉

Native Dependency Notes
jsonld depends on rdf-canonize which in versions <0.2.0 had a native C++ dependency. Personally I consider the existence of this native dependency a major PR blocker. I don't think all users that install Lighthouse should now be blocked on having C++11 on their system. We should try and get jsonld published with rdf-canonize@0.3.0 which was the first version that made the native portion optional. Alternatives include just pulling in their UMD library from a CDN instead of using the NPM version.

This has been resolved and they've removed their native dependency!! Hurray!!

Related Issues/PRs
#5446

@patrickhulce
Copy link
Collaborator Author

Good news: jsonld published a version that uses the new optional native dependency! 🎉

Bad news: npm and yarn both install optional deps by default, which means everyone depending on LH today would still need to change their build configs to continue installing.

--

What do we do?

  • Pull in jsonld from unpkg instead https://unpkg.com/jsonld@1.2.0/dist/jsonld.min.js
  • Go back and try to thread the dep chain of making it a peer or not a dependency at all, assuming the package owners support this.
  • Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler
  • Something else I'm missing?

package.json Outdated
@@ -32,8 +32,9 @@
"unit-cli": "jest --runInBand \"lighthouse-cli/\"",
"unit-cli:ci": "jest --runInBand --coverage --ci \"lighthouse-cli/\"",
"unit-viewer": "mocha --reporter dot \"lighthouse-viewer/test/**/*-test.js\"",
"unit": "yarn unit-core && yarn unit-cli && yarn unit-viewer",
"unit:ci": "yarn unit-core:ci && yarn unit-cli:ci && yarn unit-viewer",
"unit-sd-validation": "mocha --reporter dot \"sd-validation/test/**/*-test.js\"",

This comment was marked as resolved.

This comment was marked as resolved.


jsonld.expand(inputObject, {
documentLoader,
}, (/** @type {string} */e, /** @type {Object} **/expanded) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we need a few additions to tsconfig?

diff --git a/tsconfig.json b/tsconfig.json
index 6d57b103..dc5e1b14 100644
--- a/tsconfig.json
+++ b/tsconfig.json
@@ -21,11 +21,14 @@
     "lighthouse-core/**/*.js",
     "clients/**/*.js",
     "build/**/*.js",
+    "sd-validation/**/*.js",
     "./types/*.d.ts",
   ],
   "exclude": [
     "lighthouse-cli/test/**/*.js",
     "lighthouse-core/test/**/*.js",
+    "sd-validation/test/**/*.js",
+    "sd-validation/scripts/**/*.js",
     "clients/test/**/*.js",
   ]
 }

/** @type {function(string):void} */
let reject;
const promise = new Promise((res, rej) => {
resolve = res; reject = rej;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont really follow this pattern much so i tried it using our more typical approach:

module.exports = function expand(inputObject) {
  return new Promise((resolve, reject) => {
    function documentLoader( /** @type {string} **/ schemaUrl, /** @type {function(null, Object):void} **/ callback) {
      try {
        return loadDocument(schemaUrl, callback);
      } catch (e) {
        reject(e.message);
      }
    }

    jsonld.expand(inputObject, {documentLoader}, (/** @type {string} */e, /** @type {Object} **/expanded) => {
      if (e) reject('Expansion error: ' + e.toString());
      else resolve(expanded);
    });
  });
};

but I can't say its a clear win so w/e

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm inclined to leave as-is, I actually prefer removing the layer of nesting :)

try {
return loadDocument(schemaUrl, callback);
} catch (e) {
reject(e.message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return reject IMO

documentLoader,
}, (/** @type {string} */e, /** @type {Object} **/expanded) => {
if (e) {
reject('Expansion error: ' + e.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return reject

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out they have a promise-api, so I'm just switching to that instead

if (e) {
reject('Expansion error: ' + e.toString());
} else {
resolve(expanded);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return resolve

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@paulirish
Copy link
Member

paulirish commented Dec 12, 2018

What do we do?

We discussed offline that peerDep doesn't totally make semantic sense and will spam warnings when installing. The unpkg solution scares me a bit.

I'm favoring this one:

Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document this new dep requirement and consumers have to live with --ignore-optional if they want to use LH and don't have a C++11 compiler

The problem is that almost no one is going to read this or do it. A lot of hard work has gone into this, but it still seems backwards to make lighthouse harder to use (and slower by default) to add this audit.

Maybe we should have a manual SEO audit pointing to a new @lighthouse/structured-data plugin?

@patrickhulce
Copy link
Collaborator Author

patrickhulce commented Dec 12, 2018

pointing to a new @lighthouse/structured-data plugin?

😍 😍 😍 😍 😍 😍 😍

I love this for so many reasons.

@rviscomi
Copy link
Member

@brendankenny is your suggestion to omit the SD audit from running in the default LH SEO category and include a link under "Manual audits" for users to install a new extension which will run the SD audit?

@brendankenny
Copy link
Member

brendankenny commented Dec 12, 2018

@brendankenny is your suggestion to omit the SD audit from running in the default LH SEO category and include a link under "Manual audits" for users to install a new extension which will run the SD audit?

Yeah.

I realize the issues with that, but in the past we haven't pursued audits that would introduce a new reload pass, or wouldn't work in all clients, or would require calls out to third-party APIs.

Landing this as-is is going to break downstream users who won't be reading the install instructions when they do a simple update and are deploying from a container or whatever, and going to slow down everyone else's install as we compile a native library.

This isn't a make-or-break audit (there are already existing tools for structured data), and the fact that there's a lot of hard work here and that it would be useful for developers doesn't seem sufficient to justify that downside at this point.

That said, plugins aren't really ready yet and we haven't figured out how any will be bundled (if they will be), so maybe there are other solutions we can look at in the meantime.

This is the kind of thing third-party/ can be useful for. What would it take to prebundle jsonld and put it in there? In theory a lot of https://npm.anvaka.com/#/view/2d/jsonld can be pruned since we won't need request.

@patrickhulce
Copy link
Collaborator Author

Alright now that the dependency issue has been resolved, I think this is ready for another go around! 🎉

@mattzeunert
Copy link
Contributor

Alright now that the dependency issue has been resolved, I think this is ready for another go around! 🎉

Waiting for this to get merged to move forward with the structured data audit.

@brendankenny @paulirish Could you take another look? ❤️

try {
// jsonlint-mod always calls console.error when there's an error
// We don't want this behavior, so we stash console.error while it's executing
console.error = () => undefined; // eslint-disable-line no-console
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should no longer be needed 🙂 circlecell/jsonlint-mod#2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hurray that's awesome! :D

@patrickhulce
Copy link
Collaborator Author

@danbri
Copy link

danbri commented Feb 14, 2019

I see some snapshotted schema.org data in https://github.com/GoogleChrome/lighthouse/tree/75b7a51a0eb19fc289338c799bae9074341f466c/sd-validation/assets ... what is the plan for keeping this up to date?

(we've got into trouble before when people have generated artifacts from now-obsolete snapshots, e.g. see https://github.com/google/schemaorg-java )

/cc @vholland @rvguha

@danbri
Copy link

danbri commented Feb 14, 2019

(to be clear, this is all awesome, any nitpicking is in the spirit of wanting to help make it better, sustainable, and to fit in with our development plans for Schema.org)

@patrickhulce
Copy link
Collaborator Author

🏏 🏏

round 2?

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is looking good!

I see the value in doing this, and so would like to avoid coupling it too much to LH code, but I also recognize that we don't need to be overly optimizing for this hypothetical future case now that there's no clear permanent owner.

yeah, it feels a bit like that's not going to happen at this point :/

I think it could do fine in lib/sd-validation/. It's confusing to have in a top level directory (I think), and really for now it's no more an independent library than manifest-parser, which is just a little longer in loc and was actually also intended to "someday" be an independent module :)

We will have to be vigilant about keeping it componentized, but it has the benefit of being pretty naturally its own thing, so ideally it'll mostly take care of itself.

types/structured-data.d.ts Outdated Show resolved Hide resolved
types/structured-data.d.ts Outdated Show resolved Hide resolved
types/isomorphic-fetch.d.ts Show resolved Hide resolved
sd-validation/scripts/generate-schema-tree.js Outdated Show resolved Hide resolved
sd-validation/scripts/generate-schema-tree.js Outdated Show resolved Hide resolved
sd-validation/json.js Outdated Show resolved Hide resolved
sd-validation/json.js Outdated Show resolved Hide resolved
sd-validation/json.js Outdated Show resolved Hide resolved
* @param {Array<string>} path
*/
module.exports = function walkObject(obj, callback, path = []) {
if (obj === null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care about guarding against non-objects passed in? we may not, but the type won't help prevent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything being passed in here is already * or has been filtered out for null already, so I don't think it matters

sd-validation/expand.js Outdated Show resolved Hide resolved
const path = require('path');
const fs = require('fs');

const SCHEMA_ORG_URL = 'https://schema.org/docs/jsonldcontext.json';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you fetch with "Accept: application/ld+json" headers, you don't need to use this magic URL. It has been documented before that this URL is for convenience only and should not be treated as an API. Tools will generally fetch the @context value exactly as specified.

curl -H "Accept: application/ld+json" https://schema.org/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah neat thanks!! done 👍

@mmocny
Copy link
Member

mmocny commented Apr 2, 2019

Folks, I'm also interested in seeing LH have support for structured data validation. This patch is really great! I left a few small comments, and I'm happy to help review more in-depth.

Question: this patch is very hard-coded to support only core schema.org vocab. If I was willing to put in the effort to update, would you accept changes to also support hosted/external schema.org extensions? I think it would be better to do this as a follow-up patch.

@patrickhulce
Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

Thanks @mmocny! I'd love to have someone that knows what they're doing invested in this :D (core team currently inherited this without complete context, so expertise would be very much appreciated)

I'd definitely like to land this before further increasing the scope since it's been outstanding for quite some time, but supporting beyond schema.org if it's being actively consumed by search engines/crawlers/others sounds good to me! (so long as the tradeoff between usage and cost to Lighthouse users is sufficient of course 😄 )

@mmocny
Copy link
Member

mmocny commented Apr 2, 2019

I'm not sure I qualify as "knows whats they are doing", but I can try. How can I help get this finished?

(Also, it looks like this patch has some issues with original authors not signing CLA. I don't know the history here...)

@patrickhulce
Copy link
Collaborator Author

(Also, it looks like this patch has some issues with original authors not signing CLA. I don't know the history here...)

That terminal CLA state happens when @brendankenny or @paulirish's suggested changes get applied to a non-googler's patches for some reason. We haven't yet been able to figure out why 🤷‍♂️

@patrickhulce
Copy link
Collaborator Author

How can I help get this finished?

As soon as @brendankenny signs off IMO it's landable. If you'd like to review it for anything that you think would be a showstopper to landing please feel free to help him out :)

@davidlehn
Copy link

That googlebot CLA warning happened immediately after my trivial doc fix suggestion got applied [1]. So perhaps my fault? I think I've done some Google CLA thing in the past but if I need to do it again here for a one-liner just to clear things up, let me know how.

[1] #6750 (comment)

@patrickhulce
Copy link
Collaborator Author

patrickhulce commented Apr 2, 2019

Oh thanks @davidlehn just a simple "I consent to those edits being released under the Apache 2.0 license in this repo" is fine :)

@mmocny
Copy link
Member

mmocny commented Apr 8, 2019

@davidlehn May you please give the OK to unblock this merge?

@davidlehn
Copy link

I consent to my edits being released under the Apache 2.0 license in this repo.

@mmocny
Copy link
Member

mmocny commented Apr 8, 2019

Thanks!

I also think that @brendankenny comments from last review are addressed. I think this patch has addressed issues that have been brought up.

I'm new to this feature, but anxious to see it land.

Questions:

  • It looks like there were various earlier efforts on this feature, and a couple of earlier PR are still open. Should we audit those to make sure all the work and comments have been addressed in this patch as well? Or, has that already been done? (I wasn't sure how this now relates to the original PR new_audit(seo): JSON-LD validation (not included in UI for now) #5446 -- but it looks like [SEO Audits] Structured data is valid #4359 describes this in the most recent comment)
  • The work in this patch adds json-ld and schema.org validation and tests, but is it actually exposed to Lighthouse audit already? Will that have to come next? Should there be a process for running the scripts to update the schema definition files?
  • I'd like to write a manual audit which uses much of the work from this patch, but to help validate a schema.org extension (not core schema). I believe I could do this either by adding support for extensions directly into lib/sd-validation, or just by doing it entirely in my manual audit. I'll file a separate issue once this patch lands to ask if there is support for the former.

@patrickhulce
Copy link
Collaborator Author

It looks like there were various earlier efforts on this feature, and a couple of earlier PR are still open. Should we audit those to make sure all the work and comments have been addressed in this patch as well? Or, has that already been done?

Yeah this was split out from #5446 and addressed anything outstanding at the time.

is it actually exposed to Lighthouse audit already? Will that have to come next?

No it won't be, that will be followup work by @mattzeunert.

Should there be a process for running the scripts to update the schema definition files?

Probably, but we don't have one proposed. If you have any insight to share on how frequently these are likely to become out-of-date we'd love to hear it :)

I'll file a separate issue once this patch lands to ask if there is support for the former.

Sounds great! :)

@brendankenny
Copy link
Member

That googlebot CLA warning happened immediately after my trivial doc fix suggestion got applied [1]. So perhaps my fault? I think I've done some Google CLA thing in the past but if I need to do it again here for a one-liner just to clear things up, let me know how.

[1] #6750 (comment)

can confirm that @davidlehn does indeed have a Google CLA

@patrickhulce
Copy link
Collaborator Author

sweet thanks @brendankenny! would you mind confirming your intention to review or not review? :)

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just the index.js filename change, otherwise let's do this!

@@ -0,0 +1,75 @@
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename sd-validator.js(or whatever) so we don't add another index.js file :)

@patrickhulce
Copy link
Collaborator Author

let's do this!

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

@rviscomi
Copy link
Member

rviscomi commented Apr 8, 2019

itshappening.gif

Huge thank you to @patrickhulce @brendankenny and everyone else who helped get this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants