Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$schema can change across embedded resources #914

Merged
merged 5 commits into from
Jul 3, 2020

Conversation

handrews
Copy link
Contributor

@handrews handrews commented May 9, 2020

Closes #808, closes #850

$schema is now definitively resource-scoped rather than
document-scoped, as crossing a resource boundary is the same as
following a $ref to an external resource.

jsonschema-core.xml Outdated Show resolved Hide resolved
resulting behavior is implementation-defined.
The "$schema" keyword SHOULD be used in the document root schema object,
and MAY be used in the root schema objects of embedded schema resources.
It MUST NOT appear in subschemas. If absent from the document root schema,
Copy link
Contributor

@notEthan notEthan May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wording of "subschemas" is a bit confusing to me. if I can attempt to clarify with an example

$schema: draftN
$id: root
items:
  $schema: draftM
  $id: items

items is a root schema object because it's got an $id. but is items no longer a "subschema" of the root? I feel like saying it's not a subschema isn't consistent with how the term subschema is used in the rest of the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "non-root schemas"? Or just "other schemas"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(IMO) "subschema" only has context when in reference to another schema. the schema at id "items" is a subschema of the schema at id "root". The schema at id "root" is not a subschema of anything (it has no parent schema).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, we've used "subschema" to indicate containment but not reference. Referenced schemas have not been historically labelled "subschemas."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it comes down to how we define a "schema document." Is that the specific file, and that's it? Or is it the file, and all of its external references? (This pertains to the change below as well.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Some implementations provide an interface to extract these - either as multiple documents or "bundled" together in one (potentially renaming conflicting $refs if needed). In one of my web apps I do this in a GET /json_schema/:schema_name endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenetheridge yeah, the bundling use case was ultimately what we decided use when figuring out what to do with $id (splitting the $anchor case out and cutting a bunch of nonsensical but syntactically legal values). And that led to the idea of $id as identifying resources as opposed to just random otherwise unremarkable schema objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be useful to have a name for that "document plus all external references, transitively" concept

I would call this a "trancluded de-referenced bundle".

  • Transclusion is what is done to the schemas
  • De-referenced is the result of the process
  • Bundle is the end product descriptor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Relequestual I'll think on this. Does it need to go in now or can we file an issue for this terminology? If we adopt it (and I'm cautiously supportive), it should probably go in everywhere and I'd rather not add all of that in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've gone back over all of this. I am about to push a commit to address @notEthan's original question about the usage of "subschema" (I agree it is unclear).

For the transcluded de-reference bundle thing, I have filed issue #935. Note that the discussion involved a lot more than the bundle use case, so it really needs to be discussed separately from this PR. We can add more terminology later, assuming my most recent commit addresses the actual confusion in the PR text.

@notEthan
Copy link
Contributor

notEthan commented May 9, 2020

this seems to contradict the resolution of the schemas which describe a schema, as defined by the metaschema.

oof, this is hard to pick all the right words to describe properly. I'll try not to get it too wrong.

referring to this schema with two specifications (draftN, draftM) describing bits of it:

$id: bar
$schema: draftN
$defs:
  foo:
    $id: foo
    $schema: draftM

when I refer to the object at #/$defs/foo - before I know what it is or that it's even a schema - I start at the root (#), described by the draftN metaschema. in there I see that #/properties/$defs/additionalProperties is a reference resolving to draftN itself, so the schema #/$defs/foo is an instance of the draftN metaschema. but I have a $schema saying it is draftM. the metaschema has been made incorrect.

maybe the metaschema needs to change in some manner to allow either a schema which instantiates the metaschema itself, or a thing with a $schema which does not instantiate the metaschema.

$id: draftN
properties:
  $defs:
    additionalProperties:
      oneOf:
        - $recursiveRef: '#'
        - required: ['$schema']

that's not perfect; it doesn't actually say what kind of thing the object with a $schema is. but I think it's at least an improvement in that it's not giving an incorrect reference to a schema (the draftN metaschema) which does not apply to the instance (the draftM schema).

and you can't do a $ref to the $schema since that's in the instance (the schema instantiating the metaschema), and schemas (or metaschemas) can't reference instance data.

@notEthan
Copy link
Contributor

notEthan commented May 9, 2020

except, of course, the keyword $schema is only defined for a schema, so if the subschema takes the second oneOf option in my weird modification above (where it has a $schema but isn't an instance of the metaschema), it's not recognized as a schema at all and $schema has no meaning. I think I must retract that idea ...

in order for the implementation to recognize that subschema #/$defs/foo is an instance of the draftM metaschema, it first has to be recognized as an instance of the draftN metaschema, and then change to no longer be that.

@handrews
Copy link
Contributor Author

handrews commented May 9, 2020

@notEthan What you learn about #/$defs/foo from the draftN schema is that it is a schema object.

ALL meta-schemas, even if they don't explicitly list it, ALWAYS include the core vocabulary. This is in part because the core vocabulary is the bootstrapping vocabulary.

Processing a schema should always start with a check for $id, and if present, next a check for $schema. That is under the rules of the core vocabulary so it is correct regardless of the meta-schema.

$schema basically executes a "switch rules" and evaluation continues.

If that doesn't help, think of it this way: We defined $id-containing subschemas to be a schema resource, just as if it were a separate document. When it is a separate document, you already had to check for a $schema after resolving the reference.

If we're supporting changing meta-schemas while following a $ref in the middle of processing, we can do it when the schema is inline. There's no real difference.

@karenetheridge
Copy link
Member

Processing a schema should always start with a check for $id, and if present, next a check for $schema

Two counter-points to this:

  • $id is not required, even at the schema root - but $schema can still occur there (yes? my understanding was that $schema can only occur in subschemas where there is an $id, but either or both of $schema and $id can appear at the root)
  • $id used to be known as id in earlier drafts, so one might have to peek at $schema first to know whether id or $id should be looked at

@handrews
Copy link
Contributor Author

handrews commented May 9, 2020

@karenetheridge

If you're at the document root you don't need those rules to figure out if it's a resource root because you already know it's the document root- that's why $id is not required. The check for $id is how you tell if a subschema is a resource root.

So techincally not all schemas, true 😛

Regarding id, ugh. I hate draft-04. I would not be averse to saying that you can't use draft-04 or earlier in an embedded resource. That's not a horrible restriction. OpenAPI doesn't use id or $id so they won't be embeddable exactly as-is either.

Otherwise I'd say implementations would have to opt-in to supporting id as a resource identifier because it's not acceptable to reserve id in other draft's rules.

@handrews
Copy link
Contributor Author

handrews commented May 9, 2020

@karenetheridge thinking about id and draft-04, if we wanted to we could require a check for $id and/or $schema, and if $schema is present and one of the standard draft-04 or earlier meta-schemas (regular or hyper-schema), check for an id. This is convoluted and annoying but could be made to work without otherwise reserving id. It would not work with custom draft-04 or earlier meta-schemas, but custom meta-schemas were never all that useful in draft-04 and at some point we have to give up on cobbling together support for things that old. id without the dollar was an endless source of confusing given the prominence of properties named "id".

@gregsdennis
Copy link
Member

Hang on... is this explicitly allowing $schema to be used internally (not at the root), or is this only across an external reference boundary?

@notEthan's comment suggests the former, and that worries me. I think a single resource should follow a single schema draft, allowing $schema only at the root.

@handrews
Copy link
Contributor Author

@gregsdennis it can be used in the resource root, which can be "internal" in the sense of not being a document root. But an embedded resource is still a different resource, it's just stuffed into the document. Here is the use case:

I have some large number of schemas. They look like this (assume that they $ref each other somewhere, as well):

{
    "$id": "https://example.com/schema/aaa",
    "$schema": "https://json-schema.org/draft/2020-06",
    ...
}
{
    "$id": "https://example.com/schema/bbb",
    "$schema": "http://json-schema.org/draft-06",
    ...
}
{
    "$id": "https://example.com/schema/ccc",
    "$schema": "http://json-schema.org/draft-07",
    ...
}

etc.

I want to bundle them in a single document for ease of distribution, which (as @karenetheridge notes, is something that there are tools for now). The result would be:

{
    "$id": "https://example.com/schema/bundled",
    "$schema": "https://json-schema.org/draft/2020-06",
    "$defs": { 
        "aaa": {
            "$id": "https://example.com/schema/aaa",
            "$schema": "https://json-schema.org/draft/2020-06",
            ...
        },
        "bbb": {
            "$id": "https://example.com/schema/bbb",
            "$schema": "http://json-schema.org/draft-06",
            ...
        },
        "ccc": {
            "$id": "https://example.com/schema/ccc",
            "$schema": "http://json-schema.org/draft-07",
            ...
        }
    }
}

This should work. The presence of an $id in a non-document-root schema means that that schema is a resource root, and therefore $schema is usable. @karenetheridge identified a problem with supporting draft-04 in such a context, but I think we can sort that out separately.

Note, however, that this is NOT VALID:

{
    "$id": "https://example.com/schema/whatever",
    "$schema": "https://json-schema.org/draft/2020-06",
    "properties": {
        "foo": {
            "$schema": "http://json-schema.org/draft-07",
            ...
        }
    }
}

In this example, there is no $id indicating that "#/properties/foo" is a resource root. I would consider it a bad practice to embed a resource there in the first place, but you could if you really wanted to.

I was fairly sure we had an extensive conversation around this stuff but admittedly it would have been quite a while ago. But it's all about the bundling use case. If we need to have the whole discussion on this again then we should do it in slack. PRs are not the place to debate fundamental direction- I wrote a PR because it had been settled.

@jdesrosiers
Copy link
Member

I think allowing $schema to change across referenced or embedded schemas is absolutely the right way to go, but it occurred to me that meta-schemas can't fully express such a thing. Meta-schema references are recursive. In other words, they reference themselves. This means that the meta-schema will validate sub-schemas the same as the root schema.

{
  "$id": "https://example.com/schema1",
  "$schema": "http://json-schema.org/draft-06/schema#",
  "type": "object",
  "properties": {
    "foo": {
      "$id": "https://example.com/schema2",
      "$schema": "http://json-schema.org/draft-07/schema#",
      "if": "asdf"
    }
  }
}

When you validate this schema against the draft-06 meta-schema, semantically, it's not a schema, it's just arbitrary JSON. The neither the inner $id nor the inner $schema have any meaning, so the meta-schema doesn't know to validate the if as draft-07.

@handrews
Copy link
Contributor Author

@jdesrosiers People have in the past asked for a way to restrict the draft of the $ref target as well. There are a couple of options:

  • meta-schemas don't capture everything, and once you notice an $id you should treat that the same as if you hit a $ref and tried to validate the referenced schema, and validate it separately against its own meta-schema. This would formalize the notion of crossing a resource boundary.
  • Include a special case for root subschemas that, if $id is present, only further validates $schema, and doesn't check anything else. This also formalizes the resource crossing and indicates that you need to start processing separately. But it would avoid a naive application of the meta-schema to the entire document from causing a failure.
  • Do some weird anyOf to however many past meta-schemas we care to support, pinning $schema on each branch with const (but this doesn't help with future unknown custom meta-schemas, so it's not really a very good option)
  • Hope that An alternate approach to meta-schemas #911 or Declare keyword use with $schema #918 produces a better option before we publish 😝

Note that part of the point of #849 is to give a clear description of how to process schemas and meta-schemas. Which is why all of that sort of stuff has been pulled out of where it was scattered all over and consolidated into what's now section 9.

So if we want to formalize stuff around crossing a resource boundary, that's where that goes. And to some degree I'm doing that anyway. If I can ever get back to that issue, which I've been trying to do for 2 weeks now.

For reference, #850 is the issue for this change, and #808 is the $ref-oriented one that @gregsdennis mentioned.

@jdesrosiers
Copy link
Member

The only reason meta-validation works in my implementation is because I modify the schemas when they are loaded to separate embedded schemas. The schema then get validated separately and there is no problem. But, if I validate that schema I gave before as the instance and meta-schema as the schema, it doesn't work as expected. That's why I don't think just giving guidance on how to process schemas sufficiently solves the problem.

#918 Introduces a way for a schema to declare that a value is a schema without saying what kind of schema. It adds a new type value "schema". In Hyperjump Validation, I'm currently using a new keyword, validation, as a boolean flag to indicate that value is a schema. I'm not fond of either of those options, but they do solve the problem.

@Relequestual
Copy link
Member

Relequestual commented May 13, 2020

I've agreed with one suggested change.
I need a little more time (2 days) to review all of the comments. Broadly I think this is good.

@handrews
Copy link
Contributor Author

The push just now is a rebase to fix conflicts- nothing else has changed, waiting on Relequestual's feedback.

@handrews
Copy link
Contributor Author

@jdesrosiers I somehow didn't notice this before:

The only reason meta-validation works in my implementation is because I modify the schemas when they are loaded to separate embedded schemas. The schema then get validated separately and there is no problem. But, if I validate that schema I gave before as the instance and meta-schema as the schema, it doesn't work as expected. That's why I don't think just giving guidance on how to process schemas sufficiently solves the problem.

I'd actually say that's exactly how it should be processed. The embedding is essentially a... I dunno, ?transport layer? convenience. The real unit of schema-ness is the schema resource, which we didn't quiet settle on for 2019-09 b/c the schema resource idea appeared near the end of that process as a solution to other things.

So that makes loading schemas a little more challenging, but once that is done, working with schema resources works just fine. In this way, the schema document with embedded schema resources is not, itself, really a schema. It's a package containing schemas, and the ideal option might be:

  • separate documents into schema resources, propagating $schema values when they are not explicitly set
  • validate schema resources separately based on their own meta-schema

There's something in there about resolving relative $id against base URIs when splitting out embedded schemas, which might make it slightly more complex, but I think a "how to load a schema document" guidance could formalize what you're doing here in a way that doesn't contradict anything else, or require a specific implementation. You could copy the spit schemas, or maintain some sort of external map into the original structures, etc.

Just brainstorming on this.

@Relequestual
Copy link
Member

I can understand @jdesrosiers concerns about meta schema validation.

My feeling on this is we should add requirements to the following:

  • When processing a schema document, embedded schema resources (which I think you've covered how they are identified) which provide a different JSON Schema feature set identifier using $schema, an MUST cause an error to be thrown if the JSON Schema feature set is unknown or not supported.
  • When processing a schema document with any embedded schema resources, for the purposes of schema validation against meta-schemas (confirming the JSON Schema document is likely to be processable), embedded schema resources SHOULD be validated within their own JSON Schema feature set (using the appropriate meta-schema). For enclosing schema resources (which is likely the document root schema), an embedded resource SHOULD be considered as a valid schema document, with the value of true, for the purposes of validating the enclosing schema resource as a valid JSON Schema.

@Relequestual
Copy link
Member

I think @jdesrosiers approach in #918 is interesting, but we need a LOT more time to flesh that out, and we have time pressures to deliver THIS draft sooner.

@handrews
Copy link
Contributor Author

@Relequestual at this point all of the small change requests have been addressed. Regarding the main conversation about how to handle a switched meta-schema:

When processing a schema document, embedded schema resources (which I think you've covered how they are identified) which provide a different JSON Schema feature set identifier using $schema, an MUST cause an error to be thrown if the JSON Schema feature set is unknown or not supported.

JSON Schema has never, in any draft, required an error on an unrecognized meta-schema. As of 2019-09, you can cause an error through $vocabulary in the meta-schema, but there is already language from 2019-09 that preserved the pre-existing behavior when a schema is not recognized. This PR does not change that, so it should be out of scope for the PR.

The key principle here is that $ref-ing a schema resource and embedding it produces functionally identical behavior. (The error reporting / annotation output will look slightly different to show that a $ref was crossed, so it's technically not completely identical, but that's the only difference).

We cannot have a scenario where bundling an external reference as an embedded schema resource changes the behavior from best effort ("I have no idea what this is but I'll pretend it's the standard core+validation and give it a shot") to an error.

When processing a schema document with any embedded schema resources, for the purposes of schema validation against meta-schemas (confirming the JSON Schema document is likely to be processable), embedded schema resources SHOULD be validated within their own JSON Schema feature set (using the appropriate meta-schema). For enclosing schema resources (which is likely the document root schema), an embedded resource SHOULD be considered as a valid schema document, with the value of true, for the purposes of validating the enclosing schema resource as a valid JSON Schema.

I'm not entirely sure that I follow, and I think we should be having this debate in an issue so I'm going to file that. This PR is effectively blocked.

$schema is now definitively resource-scoped rather than
document-scoped, as crossing a resource boundary is the same as
following a $ref to an external resource.
@Relequestual
Copy link
Member

...but there is already language from 2019-09... - @handrews

Yeah, I mean that should have been obvious. Of course.
I'll take a look at the associated issue in relation to the other parts of your response to avoid further chatter here.

@Relequestual
Copy link
Member

My feeling is @jdesrosiers broadly approved of the suggested change.
It looks like all other comments felt their concerns have been answered (or at leas they haven't followed up.

This is a small change, identifying behaviour which was previously just 🤷‍♂️ (not defined), so I'm merging it.

@Relequestual Relequestual merged commit b615488 into json-schema-org:master Jul 3, 2020
@handrews handrews deleted the schema-res branch September 26, 2020 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow different $schema values in same document if different resources $ref (and kin) across schema versions
6 participants