-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a need for both an authored and a canonical manifests? #31
Comments
Two relevant comment from the other issue: |
My own opinion: as a technical person, I would be happy to use a canonical manifest only, i.e., to remove the notion of an authoring manifest. It would undeniably simplify the spec and make the technical content cleaner. This is, however, a usability issue, and the question really relies on what the community would accept or not; I do not have the necessary experience to have an informed opinion on this. |
IMO, I don't see "people typing these by hand" (@BigBlueHat). |
For reading systems, the mapping from the authored manifest to the canonical one is undoubtely some work. We're doing exactly that, from the "authored" Readium Webpub manifest to the in-memory object processed by reading software. For instance This is not a huge work however. Those curious can look at the Readium Mobile / Kotlin code: A JSON schema capable of dealing with flexible structures is also doable; this is done e.g. in https://github.com/readium/webpub-manifest/blob/master/schema/contributor-object.schema.json. In conclusion yes, there is a price to this flexibility but developers can overcome it. Re. other shortcuts in the authored manifest, there is also the question of JSON-LD types (e.g. "type": "LinkedResource"). The @type associated with the property definition in the context file is enough to define the type and I don't see why "type": "LinkedResource" should be added in each resource. |
Exactly. That is the core question. |
Wasn't the main motivator for the canonicalization steps (relating to expanding values, at least) that schema.org metadata values aren't rigidly enforced, so whether machines generate the manifest or not, there's the problem of whatever creates the file following general conventions, like: I thought the idea was to be practical and have the steps to sanitize the data for the user agent rather than fail on it or not process it, since the compact forms don't seem to hinder seo processing? But if we go this route, we should be thorough about it and also remove the steps about obtaining information from the html document that references/embeds the manifest. We should figure out why we want/need full json-ld conformance, too, when we don't expect user agents to be json-ld processors. |
That was certainly part of it. I suspect schema.org has some similar procedure as our canonicalization. That being said, if we decide we are stricter than schema.org this does not "harm" our data v.a.v. schema.org.
Oops, that is true. If we do not do any canonicalization at all, this is a consequence... The biggest "loss" would be the potential reuse of the
By "full conformance" I presume you mean being able to use all JSON-LD feature, right? At this moment I do not see any reason for having it… |
|
Right, we've never said that a user agent must be a fully-conforming json-ld processor, only that it be able to process the json in the manifest into an internal structure. Maybe I'm wrong, but json-ld only seemed to enter the equation as a means of allowing search crawlers to get at the information without duplicating the metadata. (Is that (still) a primary use case?) In terms of discrepancies between the graphs, sure, I don't think it's ideal, either, but I thought the point was finding a balance? If you don't care about the seo angle, or it's not relevant to your format, you aren't burdened with strict authoring of json-ld. If it does matter, you can choose to be stricter in authoring. But maybe this is where layering comes in if we want to formalize this duality. |
That is certainly the main motivation, yes. One could imagine a full JSON-LD engine which would then also allow a simple and frictionless extensibility of the metadata by adding new terms, vocabularies, etc, using the JSON-LD facilities. This is what, e.g., Verifiable Claims do. But I do not see that as a use case for the publication manifest, at least not in this version. |
I'm admittedly not too keen on another major overhaul of the specification at this stage. We got where we are because there wasn't consensus on requiring strict authoring, so reversing course now seems a bit fraught. We'd be undoing a lot. |
Schema.org provides no algorithm for processing--it's only a vocabulary definition and documentation. However, SEO bots and tools (i.e. SDTT) which consume it attempt to "clean up" things like We, however, have a host of consumers for these manifest documents: SEO bots, publisher metadata/management systems, CMS's, metadata management for distribution, personal archives, reading systems, etc. Consequently, the clearest and most complete manifest is the one that should go into the publication itself. Prior "author-friendly" formats can exist (like YAML => JSON or Markdown => HTML), but have a pre-publication use (vs. post-publication existence). |
Publishers want this very thing, and many of us use graph-based data formats (JSON-LD chief among them) to accomplish this extensibility while still maintaining interoperability (see also Web Annotations, VCs, etc). |
Could the presence/absence of the context be a trigger to canonicalization and/or the media type of the manifest? For example:
If a context is set, data is strictly interpreted and no canonicalization occurs. If the context is not set, the data is assumed to be json and the canonicalization steps have to be run to obtain the common data structure. It would then be up to implementations to allow one or both serializations, and authors to decide which they prefer. Right now we seem to be stuck on trying to force some measure of json-ld on simple authoring, and then arguing over the inevitable lack of perfection that results. If people actually want something less than json-ld, let's just give it to them. Otherwise, abandon the idea. It would also be good to hear who prefers which option either on a call or by email survey, so we have a practical idea of where the group is on this. This discussion could go on a long time if we're just discussing pros and cons. |
+1 |
(Summarizing, to help WG members who were not part of the discussions so far but whose opinion is necessary at this point) There are two questions that need a final and urgent decision in order to move forward.
At the moment, the draft:
Giving a clear yes or no answer to both questions is necessary at this point to move ahead with the draft; otherwise we are stuck. (Note that our primary goal are audiobooks at this point, although we should look forward to other usages of the manifest.) |
My preference is still to leave this alone given the work it's taken to get here. I only wonder if the idea of generating a "canonical manifest" needs some additional clarification. We describe the process as though an actual json-ld document has to be the end result, but I don't believe this is required. It's just a way of explaining the process and resulting data structure. In other words, where we say in the canonical manifest definition:
Isn't what we really mean more like:
A user agent should have the option to do things differently than the lifecycle algorithm, like read the manifest into an internal data structure and then sanitize the data by the rules, never creating a "canonical manifest" in the sense of there still being a json-ld representation. That's at least what confuses me about the idea that canonicalization represents an unnecessary step. Even if we had full json-ld, the process can't go away as it only takes out some expansion steps. You still have to get the manifest, internalize it and sanitize the data (checking properties are set, values are conforming, etc.). |
You're right in that the result will be an in-memory object reflecting the structure of the "canonical" manifest. |
No matter how we do things, life is going to be complicated for an ordinary web developer:
|
baz should be:
But I wouldn't call processing json all that complicated if I can do it. :) It's just a bit tedious in that you always have to test what kind of data you've encountered:
|
@dauwhe @mattgarrish good to see your developer skills. Did you have a look at the code linked from #31 (comment)? |
Yes, and I agree that processing isn't the big challenge here. I also agree that there isn't a lot of value to having to author information that can be inferred, like types. The JSON-LD specification appears to agree, too. But I've maybe become jaded by experiences in epub where no matter how elaborate we've tried to make the metadata, the extra information just hasn't proven useful. How many systems need to know whether an author is a person, organization or thing? |
The processing is not a big deal. I have done this in https://github.com/iherman/WPManifest (it includes stuff that is irrelevant by now, ie, getting hold of the manifest itself as described in WPUB, and it may not be up-to-date). |
As far as I could see schema.org processors (at least the structural testing tool) complains if the type is not explicit. I believe this was the only reason we required it. But, as @llemeurfr said in another comment, whether the type is required or not is another issue, let us not mix it with the fundamental questions above... |
I really want to avoid Aside from questions of syntax, what benefit do we get from this extra information? Are we obligated to research the name of every author? What does it mean for a proper name to have a language associated with it? Is Yann Martel fr-CA or en-CA? |
@dauwhe nobody will have to write this by hand. |
From what I see the testing tool just defaults to Thing whereas we default to Person. The concerns about different graphs are real, but I'm just not swayed that the compact forms will cause much real harm in practice, and you can always avoid them. We were trying to be flexible by going this route, just as schema.org metadata processors have to be. But I agree we don't need to get into all the details. I'm only raising this to agree with your second point that keeping the simplifications allowed in the authored manifest is fine with me. |
I have personally edited probably thousands of EPUB package files. Even if the majority of manifests are created by tools, I still think it's important to maintain as much human-readability as possible. It helps with troubleshooting and makes it easier for developers, who are also human :) |
I would phrase this more as it's important to maintain an authoring syntax that people are already familiar with when authoring schema.org metadata. I believe we've achieved that in allowing strings and defining how to make objects from them. Let's not go further astray. |
As comparison:
|
Re. allowing "property":"value" AND "property":["value","value"], did you spot that JSON-LD gives the bad example, e.g. in https://www.w3.org/TR/json-ld11/#specifying-the-type with A precision on @mattgarrish reference to the JSON-LD spec. From what I understand, the referenced section is about how a JSON-LD processor can infer the type of an object from the properties it contains. This is not the use case I was talking about: in fact I was thinking about "type coercion" but discovered that JSON-LD does not support it for complex types, ref. w3c/json-ld-syntax#31. This would have made the JSON-LD context an equivalent of an RDF Schema, which IMHO would have been smart. |
Indeed. JSON-LD is "only" an RDF serialization, and such inferences are in RDF's purview. |
If the context is not set (i.e. it doesn't include Also, that discovery step would be required or a unique MIME media type (beyond
Perhaps this is the core of the confusion/tension. The canonicalization algorithm is meant to provide a "canonical manifest," but that "manifest" isn't actually ever...manifest. It only exists (according to the spec currently) as an "internal representation of the data structure." That's not what JSON-LD is for...it's what WebIDL and internal APIs are for. From the introduction:
So, I'd conclude (per the aim of this issue) that...
Alternatively, there might only be one "manifest" format/style and UA's can define whatever internal representation they want/need. |
That example's correct, @llemeurfr JSON-LD supports multiple You are correct about JSON-LD not supporting node "type coercion" (i.e. you can't turn a "Thing" into a "Person"). Anything making those sorts of additions or changes to the original data (like the canonicalization algorithm or the Structured Data Testing Tool) is doing so beyond what can be understood via a JSON-LD context or even a JSON Schema--as they both describe expectations or understandings of the data...not transformations. |
That is a viable approach indeed, although I am not yet sure how to change what is there editorially. But surely that can be done, @mattgarrish and I can look into this editorial option... |
Related to the possible consensus on re-branding the canonicalization: we use WebIDL as some sort of a data structure description language. It is not ideal; nobody is/was fully happy with it, because WebIDL is usually used to describe a data structure. However... anybody has a better idea? Something that is clean and easily readable by programmers and does not give the impression that we bind this to a single programming language (although the latter can be mitigated by a clever explanation...). If I give up the programming language independent view, then I would consider TypeScript (but I am not very familiar with it): it is close enough to Javascript that people should understand it, but it has information about the datatypes, which Javascript does not have. Any ideas? @danielweck @rdeltour @llemeurfr (as users, afaik, of Typescript in Readium...) |
TypeScript, sure why not. But if "canonicalization" is an algorithm, then why not use pseudo-code in the same way that HTML5 defines processing model / parser logic? |
Well... the way I understand it, the HTML's logic is how to produce a DOM entry which is defined by... WebIDL. The current algorithm describes how the representation of the (JSON) manifest is transformed into a representation of another JSON document. Instead, we can explicitly refer to a datastructure defined in some language. We could keep that target as the WebIDL explicitly in the algorithm, and that would work. (This is analogous to the HTML spec.) Except that WebIDL was not liked, so if we have a better alternative, then it may worth taking it. It is the first type I hear of ReasonML, to be honest... |
For some reason, this type typo makes me smile every time :) |
We're not making web publications, though, so I'm not sure how important a use case this is, at least at this time. But I'll try to make some changes to reflect the discussions in this thread before Monday. On the road for a couple of more days. |
I think that it is important, no matter how it is used, that our vocabulary reuses another well known vocabulary as schema. It allows the usage of the manifest in a SEO setting, if needed... |
If our WebIDL were describing an internal data model / API for user agents, then WebIDL would be the right choice--and would make sense (I'd reckon) to the TAG and developers alike. Essentially, once a manifest is consumed by a UA, developers should expect compatible data representations within that UA to match the expressed WebIDL--which I think maps to what the canonical manifest was attempting to state/provide (afaict). |
Sure, I'm not suggesting we drop it. All I asked was whether it was necessary for every implementation to author json-ld, or whether the context can be inferred, and processing carried out, if a pure json file is authored. But it sounds like if we clarify the canonical "manifest" then there isn't such an issue with using compact expressions, in which case I don't care anymore. :) |
Here's my take on this:
|
I think that does reflect the current consensus we are heading for, except for:
I think the spec does require defining precisely how the WebIDL is created from the manifest, due to the fact that there are a number of cases when different manifest forms produce the same WebIDL expression (the usage or not of an array is a typical case). This is what the current canonicalization spec section does, and should be reformulated as producing the WebIDL instead. (I believe @mattgarrish will, eventually, change that part of the spec when he is back from traveling.) |
Re. @iherman 's request I think we would have a hard time finding a better way than WebIDL for expressing in a clear way the object model of a manifest. UML class diagram are not understood widely and are not accessible, Typescript is a specific language and we don't want to be language specific. WebIDL was not created for such a think, but it makes the trick correctly. |
That is correct... Nevertheless: does anybody know a precedence (not necessarily in W3C land) for a specification that uses Typescript as such a specification language? |
If we go with a "plain JSON" document, it will need its own media type so that the terms (and meaning) can be mapped to the vocabulary its using (which would only be defined in prose in a spec). It will also need to provide it's own extension mechanism--if we expect publishers to extend it with their own idiosyncratic data (and they will). In the end, there will be much repeating of what JSON-LD provides--which is why so many specifications ship JSON-based specs with JSON-LD contexts +/- their own media types when additional processing semantics is required (beyond either JSON or JSON-LD processing). |
Sure, I'm not suggesting that we drop JSON-LD, just pointing out that these documents won't get processed as such. IMO using JSON-LD and having a specific media type is the way to go. |
This issue was discussed in a meeting.
View the transcriptmanifest discussion issueWendy Reid: See Issue #31 “Is there a need for both an authored and a canonical manifests” Wendy Reid: See Issue #32 “Should we use JSON schemas as part of the spec?” Wendy Reid: See summary of the discussions and decisions to be taken, read here: Do we allow the usage of full JSON-LD for the publication manifest, or only a restricted “subset” (or shape) thereof. Put it another way, do we expect reading systems that use the manifest to include a full JSON-LD processor? (This is, in fact, issue #32.) Do we need the differentiation (and corresponding conversion method) between an “authored” manifest and a “canonical” manifest, where the former is a simplified version of the latter (e.g., allowing the author to use simple convention to express the manifest information in its full complexity)? (See the example of @llemeurfr’s example in #31 (comment) to illustrate it) Ivan Herman: we moved a bit last week and getting to a consensus among those who discussed all this. The proposed answer to the first question is no, ie, we would use just a specific “shape” of JSON-LD. There would be an informal reference to a JSON-Schema to define that shape. George Kerscher: A fully implemented json processor would be able to process this subset and so would a reading system so we have this covered, yes? Ivan Herman: yes. … the other issue is canonical manifest. The proposed consensus is the get rid of the term ‘canonical’ manifest. … There is already a (WebIDL) definition in the document that is used by the processor. Matt is working with the conversion algorithm that says here is the manifest and here is how I use it to convert it into the data structure defined by WebIDL. I think that the discussion on the issue shows that we are in agreement … matt’s work is not yet done and so this will not be in the first public working draft … the question is if there is a consensus Wendy Reid: are there any comments? Benjamin Young: the overall direction is the right one but we do need to review the writing when it’s done because it will clarify some of the confusion … shape is a better term in this case than subset Proposed resolution: (1) only a shape of JSON-LD is required; this will be further defined through an (informative) reference to a JSON schema. This should close issue #32. (2) instead of the canonical manifest only an internal data structure is used, and the canonicalization algo. maps onto this. This closes issue #31 (Ivan Herman) Ivan Herman: +1 Wendy Reid: +1 Luc Audrain: +1 Nellie McKesson: +1 Deborah Kaplan: +1 Matt Garrish: +1 Rachel Comerford: +1 Tim Cole: +1 Romain Deltour: +1 Mateus Teixeira: +1 Geoff Jukes: +1 Benjamin Young: +1 Marisa DeMeglio: +1 Laurent Le Meur: +1 Bill Kasdorf: +1 Joshua Pyle: +1 Resolution #5: (1) only a shape of JSON-LD is required; this will be further defined through an (informative) reference to a JSON schema. This should close issue #32. (2) instead of the canonical manifest only an internal data structure is used, and the canonicalization algo. maps onto this. This closes issue #31 4. M |
At the moment, there is an authored and a canonical manifest, with a separate canonicalization step to transform the authored manifest into the canonical one. The goal is to allow the author to express data more succinctly (eg, use only simple file names instead of complete
LinkedResource
instances or person names instead ofPerson
structures).It was raised, in #11, that the price being paid for having this is too high:
Question: do we want to simplify the manifest by removing this extra step and defining the manifest purely in terms of what is currently called the canonical manifest?
The text was updated successfully, but these errors were encountered: