Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"$id": Eliminate base URI shadowing #726

Closed
handrews opened this issue Mar 1, 2019 · 40 comments · Fixed by #780
Closed

"$id": Eliminate base URI shadowing #726

handrews opened this issue Mar 1, 2019 · 40 comments · Fixed by #780
Labels
clarification Items that need to be clarified in the specification core

Comments

@handrews
Copy link
Contributor

handrews commented Mar 1, 2019

"URI shadowing" (h/t to @johandorland for the term) refers to the scenario when there are multiple $ids in a schema, with at least one $id in a subschema of another object containing $id (such a the root schema object).

When did this become a thing?

Here is the schema and various resolved URI examples, copied and pasted directly from the current spec:

{
       "$id": "http://example.com/root.json",
       "definitions": {
           "A": { "$id": "#foo" },
           "B": {
               "$id": "other.json",
               "definitions": {
                   "X": { "$id": "#bar" },
                   "Y": { "$id": "t/inner.json" }
               }
           },
           "C": {
               "$id": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f"
           }
       }
   }

   The schemas at the following URI-encoded JSON Pointers [RFC6901]
   (relative to the root schema) have the following base URIs, and are
   identifiable by any listed URI in accordance with Section 5 above:

   # (document root)

         http://example.com/root.json

         http://example.com/root.json#

   #/definitions/A

         http://example.com/root.json#foo

         http://example.com/root.json#/definitions/A

   #/definitions/B

         http://example.com/other.json

         http://example.com/other.json#

         http://example.com/root.json#/definitions/B

   #/definitions/B/definitions/X

         http://example.com/other.json#bar

         http://example.com/other.json#/definitions/X

         http://example.com/root.json#/definitions/B/definitions/X

   #/definitions/B/definitions/Y

         http://example.com/t/inner.json

         http://example.com/t/inner.json#

         http://example.com/other.json#/definitions/Y

         http://example.com/root.json#/definitions/B/definitions/Y

   #/definitions/C

         urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f

         urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f#

         http://example.com/root.json#/definitions/C

Notice that all of the locations (the list section headers) are done as JSON Pointer fragments, relative to the root of the entire example schema.

Prior to draft-handrews-json-schema-01, the analogous example section did not show any examples with JSON Pointer fragments. Using the exact same schema, it showed these resolved URIs only:

   # (document root)  http://example.com/root.json#

   #/definitions/A  http://example.com/root.json#foo

   #/definitions/B  http://example.com/other.json

   #/definitions/B/definitions/X  http://example.com/other.json#bar

   #/definitions/B/definitions/Y  http://example.com/t/inner.json

   #/definitions/C  urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f

But you can see that the locations were done the same way- as a relative JSON Pointer fragment from the overall document root.

The key point here is that we added examples of URI shadowing in handrews-*-01, which was a clarification of draft-07 (draft-07 was originally published as handrews-*00). So we didn't even introduce these with draft-07. There was some confusion over it (and other aspects of $id and $ref) and we added the examples when we clarified the text.

Why did we do this?

Good question. The change was made in PR #550, which was primarily about removing the terminology around "internal references" vs "external references". Which had itself been a great improvement by @awwright over the draft-04 language around "inline resolution" and "canonical resolution". None of these prior approaches directly explained how JSON Pointer fragments work across the presence of $id.

The issue mentioned by that PR only mentions JSON Pointer fragments once, at the end, after the PR was submitted: #545 (comment)

I know I had questions over how that should be handled when I tried implementing what were then the new draft-06 proposals. And I know that I ended up implementing URI shadowing. I don't remember why.

Quite a few people weighed in on and reviewed #545 and #550, so this isn't something that slipped in by accident.

Did anyone rely on this before handrews-*-01?

No clue.

Did anyone actually implement what's in handrews-*-01?

I think @johandorland did? He commented on the issue/pr. Not sure if anyone else did. @Julian?

Does the test suite test this?

No.

So, really, why did we do this?

I think the rationale was that since the locations in the existing example were given as JSON Pointer fragments, that must mean that those fragments were valid, which would definitely imply that every parent $id introduces a base URI that must be tracked throughout all subschemas, even if another $id appears.

Can we just not?

I think so. JSON Pointer fragment evaluation is just defined in terms of JSON document structure, without any thought given to changing base URIs or embedding one document in another. But media types get to specify fragment syntax and semantics, so I think we can reasonably say that it stops when the base URI is reset, and from that point on, the prior base URI no longer applies at all.

That is actually how we interpret plain name fragments. Note in this part of the example:

   #/definitions/B/definitions/X

         http://example.com/other.json#bar

         http://example.com/other.json#/definitions/X

         http://example.com/root.json#/definitions/B/definitions/X

the #bar plain name fragment can be used with the innermost base http://example.com/other.json, but not with the outer base http://example.com/root.json.

So restricting the use of JSON Pointer fragments to not cross $id boundaries would make their behavior more consistent with how we handle plain name fragments.

Similar to #724, this considers $id to establish a document boundary, with the new base URI it creates applying to the document within that boundary. Unlike #724, what is proposed here does not make any other changes to $id's behavior. In particular, the effect of an $id with a JSON Pointer fragment remains undefined, although we can address that separately later if we want to.

So what would this look like?

There are two options WITHIN THE SCOPE OF THIS ISSUE. If you want to talk about other options, file them yourself :-) Comments that go off-topic here will be deleted.

Here's the example schema again so you don't have to scroll so much:

{
       "$id": "http://example.com/root.json",
       "definitions": {
           "A": { "$id": "#foo" },
           "B": {
               "$id": "other.json",
               "definitions": {
                   "X": { "$id": "#bar" },
                   "Y": { "$id": "t/inner.json" }
               }
           },
           "C": {
               "$id": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f"
           }
       }
   }
}

Option 1: Exactly one valid URI involving a JSON Pointer fragment for each location:

   # (document root)

         http://example.com/root.json

         http://example.com/root.json#

   #/definitions/A

         http://example.com/root.json#foo

         http://example.com/root.json#/definitions/A

   #/definitions/B

         http://example.com/other.json

         http://example.com/other.json#

   #/definitions/B/definitions/X

         http://example.com/other.json#bar

         http://example.com/other.json#/definitions/X

   #/definitions/B/definitions/Y

         http://example.com/t/inner.json

         http://example.com/t/inner.json#

   #/definitions/C

         urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f

         urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f#

This is a lot more straightforward, and all of these addresses work whether the schemas were loaded as one document (as in this example), or if each of root.json, other.json, and inner.json had been loaded separately, and $ref-ed each other.

Option 2: But what about the URI used to fetch the document?

If it's possible to retrieve this example from http://example.com/alias.json (as opposed to or in addition to its declared $id of http://example.com/root.json), then should fragments also work from that base?

I can see it either way.

One option is to consider fetching this from alias.json and discovering that its declared $id is root.json is to conceptualize this as a redirect, in which case the fragment is applied to the redirect target (at least in HTTP). h/t to @jdesrosiers for this mental model. In this sense, anywhere we have root.json we would also require supporting alias.json

On the other hand, JSON Schema repeatedly talks about URIs as identifiers rather than locators, so I think we could rationally say that the $id is how the schema needs to be referenced. It's fine if it's fetched from elsewhere outside of the reference process (like being pre-loaded from a local filesystem) but does that mean that we should support those loading URIs when an $id is provided? (obviously if no $id, then the loading URI is the only available base).

Supporting the loading URI as a shadowed base, but nothing else, might be a bit odd, or it might be more intuitive. I really have no idea at this point 😆

@awwright
Copy link
Member

awwright commented Mar 1, 2019

I'm not sure this improves schema authoring or writing implementations. As far as I can tell, this would mean I'd have to add a condition to implementations saying "Evaluate JSON Pointer fragments until an $id is encountered, so as to make anything under a subschema with $id inaccessible."

At that point, it seems like we just want to get rid of JSON Pointer fragments altogether.

If I've got this right, why would this be desirable?

@handrews
Copy link
Contributor Author

handrews commented Mar 1, 2019

@awwright I found having to manage a stack of in-scope base URIs and the current relative position to each to be one of the most complicated things to implement. But @johandorland can perhaps speak to a more real-world scenario.

@handrews
Copy link
Contributor Author

handrews commented Mar 1, 2019

@awwright so far what I have established with recent issues is that there are people who think every possible thing to do or not do with $id is horrible and obviously wrong, and no one particularly likes any of them more than barely tolerating the idea.

I should probably just ignore all comments/requests/complaints regarding $id, I guess. I have no idea at this point.

@handrews
Copy link
Contributor Author

handrews commented Mar 2, 2019

@awwright

At that point, it seems like we just want to get rid of JSON Pointer fragments altogether.

Do you mean in $id or just in general? Getting rid of fragments in $id was part of #719 but I filed this more modest version due in part to pushback there. Although I suppose I should re-file the full version for proper comparison, separated from the back-and-forth in #719.

If I've got this right, why would this be desirable?

The most important question, I think, is "was this ever intended to be a feature at all?"

We added it to examples just in the most recent raft for... um... reasons? Honestly I cannot tell why. It's not clear to me that it was ever intentionally made a feature.

I basically filed this issue because I think I made a mistake putting the examples in. Do you remember why that seemed like a good idea at the time?

@johandorland
Copy link
Collaborator

The way I look at this shadowed base URIs are not really a feature, but more of an unintended consequence. In the past I've had a hard time explaining to people in the JSON schema slack that a single part of the schema can be addressed by an arbitrary number of $refs, based on how many levels of nested $ids there are.

Conceptually it'd be nice if every part of a schema only has a single URI it can be $reffed by. However even in your example this is not the case as http://example.com/other.json#bar also remains accessible by http://example.com/other.json#definitions/X as it has the same base URI. Even so as long as every part of the schema is only allowed to have one base URI it'd make it easier to parse the JSON by walking over the entire JSON structure and register its $reffable URI in a lookup map, so when you encounter a $ref in the second pass you know every valid $ref is in that lookup map.

@awwright has a very good point though. Currently the way I and I suspect most others to this first pass is to register all different base URIs contained in the JSON and when a $ref is encountered it looks at its base URI and then gets the proper JSON by using the JSON pointer fragment. It's kind of odd to say use JSON pointer for the fragment, but then amend it by saying it doesn't always work across $id boundaries.

So if anyone thinks Option 1 is a good idea, we should just say that the behaviour of shadowed base URIs is undefined. That way an implementation is a bit more free in how it wants to implement behind the scenes, but others do not have to go out of their way to explicitly disable it.

Even though I suggested this I'm not actually sure it's a good idea. Even though I don't like shadowed base URIs, the cure might be worse than the problem itself as I'm not sure how we disallow it in a proper manner without making it more complicated.

All I have to say about option 2 is that I don't get the mental model described, so I can't really comment on that.

@handrews
Copy link
Contributor Author

handrews commented Mar 2, 2019

@johandorland

It's kind of odd to say use JSON pointer for the fragment, but then amend it by saying it doesn't always work across $id boundaries.

I think it's odd that if I embed a document using $id, it suddenly becomes accessible by one or more sets of URIs that are not available if it is hosted separately.

However even in your example this is not the case as http://example.com/other.json#bar also remains accessible by http://example.com/other.json#definitions/X as it has the same base URI.

But that's intentional with a clear use case. There are two defined fragment syntaxes, and the plain name syntax is for labeling schemas as kind of analogous to exported function from a module. They are independent of the schema's location within the document, so you can refer to them from outside the document without worrying if the internal structure gets rearranged somehow.

Note that you cannot use plain name fragments across $id boundaries, and I still think that is the more sensible behavior, but 🤷‍♂️

@jdesrosiers
Copy link
Member

👍 for Option 1

As far as I can tell, this would mean I'd have to add a condition to implementations saying "Evaluate JSON Pointer fragments until an $id is encountered, so as to make anything under a subschema with $id inaccessible."

I'd be ok with calling this behavior "undefined" as @johandorland suggests. We can let linters check for document boundary violations. It'd be no worse than what JSON does with calling the behavior of duplicate properties "undefined".

At that point, it seems like we just want to get rid of JSON Pointer fragments altogether.

I have no idea what you mean by this. I'm sure it can't mean what it sounds like.


Option 2

I've always found it awkward to put $id at the root of a document. (Apparently I'm the only one because it seems to be accepted as best practice today.) Coupling a representation's identifier with it's representation makes it harder to change that identifier or have alias identifiers. HTML documents don't declare their identifiers. They are identified by the URI used to retrieve them. I don't see why JSON Schema should be any different.

The approach I prefer is that it just doesn't make sense to put an $id at the root. Doing so would be like having a schema that is nothing but a $ref. Any pointer would pointer across a document boundary and the value of the document would be meaningless.

However, the approach that seems to make most sense given the path JSON Schema has chosen, is to consider $id the one and only way to identify a schema.

I think defining an $id at the root as a redirect or some other kind of alias is a reasonable compromise.

@awwright
Copy link
Member

awwright commented Mar 3, 2019

Do you mean in $id or just in general?

Not using JSON Pointer at all, only named schemas. (I'm not sure it's viable to remove outright, but at least for authoring I don't see any reason to use pointer fragments anymore.)

@johandorland
Copy link
Collaborator

I've always found it awkward to put $id at the root of a document. (Apparently I'm the only one because it seems to be accepted as best practice today.) Coupling a representation's identifier with it's representation makes it harder to change that identifier or have alias identifiers. HTML documents don't declare their identifiers. They are identified by the URI used to retrieve them. I don't see why JSON Schema should be any different.

The major difference between HTML and JSON schema is that HTML is online by default whilst JSON schema is not. It's even discouraged that remote references are downloaded by default. Having identifiers within each JSON document makes it very helpful to keep track of them. In my opinion https://json-schema.org/latest/json-schema-core.html#rfc.section.8.3.1 explains this very well. You can just pass a schema to an implementation and it will sort out all the references and you don't have to worry about telling which schema is identified by what URI.

@jdesrosiers
Copy link
Member

@johandorland

The major difference between HTML and JSON schema is that HTML is online by default whilst JSON schema is not. It's even discouraged that remote references are downloaded by default.

I don't think it makes any difference whether it's remote or local. Either way you are retrieving schemas from somewhere.

@awwright

Not using JSON Pointer at all, only named schemas.

I'm still not entirely clear, but I'm assuming that by "named schemas" you are referring to the location-independent identifiers as currently defined in the spec. If so, that's interesting. It's a little less efficient than using pointers because you would have to scan the document for all the named schemas before you could process the schema. But if you support named schemas at all, you have to do that anyway. Without pointers, schemas wouldn't be able to arbitrarily reference any part of a schema. Especially with regard to remote references, this might be a good thing. It makes only part of a schema "public" (can be $reffed from another schema). Schema authors can change their schemas with confidence that they aren't breaking another schema that depends on it.

As for removing JSON Pointer support entirely, I'll point out that JSON Pointer (or something similar) is required for my use-case of JSON Reference. My use case doesn't involve JSON Schema, so it's not relevant by itself, but with the introduction of vocabularies we might not want to keep options open. If my use-case requires JSON Pointer, some vocabulary might need it as well.

@awwright
Copy link
Member

awwright commented Mar 3, 2019

But if you support named schemas at all, you have to do that anyway.

Also consider: All parsers have to do some sort of indexing. If you're a JSON parser, you're indexing a byte stream into a set of properties or items. We're just so used to the JSON parser doing it for us that when we add an additional thing to index ("$id"), it seems foreign: We have to take the parsed result, and scan over it again.

If you were parsing a JSON Schema document using a SAX-like event stream instead, it wouldn't look like an additional step at all. It would just be an event you handle, just like all the others.

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

@awwright

If you were parsing a JSON Schema document using a SAX-like event stream instead, it wouldn't look like an additional step at all. It would just be an event you handle, just like all the others.

🐡 Mind blown! (there's no github exploding head emoji, so... blowfish 🤷‍♂️ )

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

@johandorland wrote:

The major difference between HTML and JSON schema is that HTML is online by default whilst JSON schema is not. It's even discouraged that remote references are downloaded by default. Having identifiers within each JSON document makes it very helpful to keep track of them. In my opinion https://json-schema.org/latest/json-schema-core.html#rfc.section.8.3.1 explains this very well. You can just pass a schema to an implementation and it will sort out all the references and you don't have to worry about telling which schema is identified by what URI.

Yes, this!

@jdesrosiers without root $ids, I would be unable to make use of JSON Schema.

Most of my career has been in enterprise infrastructure appliances and/or cloud-hosted services. The appliances are frequently behind firewalls with limited external access, and a strong resistance to opening additional routes through the firewall.

With managed service providers and the largest enterprise customers, even with cloud-hosted products they want their own instance in their private cloud with their own security controls.

This is an even more stringent requirement when selling to the government or other highly regulated spaces, where the product must be "air-gapped", meaning that it cannot rely on a connection to the outside internet in any way.

So for all of these use cases, while the schemas are hosted on the manufacturer's public web site, and their $id URIs match those URLs, many customers cannot access those URLs at all. Their schemas are either pre-packaged (the easy case), or they are directed to download them from their local product instance (using an HTTP describedBy link). Which is slightly more complicated because if you have documented that $refs should be downloaded as needed for your particular application, then those need to be relative URIs so that the different retrieval base URI does not cause problems.

Furthermore, during development, schemas are often written as YAML or JavaScript and loaded from a local filesystem, from files with a .yaml or .js (or .json5, etc.) extension so that editors work properly. In that case as well, it is important that the root $id (which has the intended public URL and no file extension) override the retrieval URI.

Root $ids are critically important to decouple identity from location, which is a mandatory feature for many important use cases.

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

@awwright @johandorland @jdesrosiers my preference at this point would be to make the behavior of shadowed JSON Pointer fragments undefined. The philosophical justification is:

It should (SHOULD?) be possible to move any document embedded using $id out into its own resource, replacing it with a $ref in its original location, without breaking references to that document.

This is not possible if you are using shadowed JSON Pointer references, so the behavior of such references is necessarily undefined.

This avoids imposing a further burden on implementors, but makes clear why schema authors should not do this regardless of any one implementation's support. For use cases where schemas are sometimes bundled into a single file, this ensures that the same references will work in both bundled and unbundled states.

How does that sound?

@awwright
Copy link
Member

awwright commented Mar 3, 2019

@handrews Consider this brief aside for a moment: The W3C TAG published this a long time ago saying (in brief) you shouldn't parse arbitrary data as a certain media type unless you know it was given that media type by the server: https://www.w3.org/2001/tag/doc/mime-respect

Well there's a similar thing here, should we know affirmatively that the inner instance being pointed to is, in fact, a schema (and not a random value)?

For example, should this be legal (even though the $ref isn't pointing to an item known to be a schema):

{
   $id: "#",
   properties: {
      "foo": { const: { type: "string" } },
      "bar": { $ref: "#/properties/foo/const" }
   }
}

@jdesrosiers
Copy link
Member

@handrews

Most of my career has been in enterprise infrastructure appliances and/or cloud-hosted services. The appliances are frequently behind firewalls with limited external access, and a strong resistance to opening additional routes through the firewall ...

That's all well and good, but it's completely irrelevant to the point I was making. My point applies to whether schemas are network accessible, or on a local file system, or in memory. It doesn't matter. There's always some sort of locator. Even if the $id identifies the document, it has to go into some kind of storage that has some sort of index that allows users to retrieve it based on that identifier. You have to retrieve the schema from somewhere. How we address those schemas is irrelevant to whether the system is air-gapped or on a public network.

Ultimately, which addressing approach you think is better is going to depend on your use-case. Every use-case I want to use JSON Schema for involves the schema being part of a web-based system (such as a Hypermedia API). It might be a closed network, but it's still web based. Web addressability is essential in these use cases and $ids don't make sense any more than an HTML representation identifying it self. So of course I prefer web addressability over self identification. Your use-cases may differ and lead you to prefer the other.

It's probably best that JSON Schema continue to support and encourage the use of both depending on what makes the most sense for the use case. Therefore, I would argue against ...

we could rationally say that the $id is how the schema needs to be referenced

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

@awwright You're packaging up several things in an effort to come up with a counter-example. It's a good counter-example, and a good reference drawing a parallel with Content-Type, but we actually just sorted this out :-)

The way you tell whether the "server" (schema author) as labeled an object or boolean as a schema is by noting whether it is being applied by an applicator keyword, or stored in a schema placeholder keyword ($defs / definitions).

Obviously, in the case of nested subschemas, they all need to be chained together by applicators / schema placeholders (I just made up that term, it's not great, everyone please roll with it rather than derailing onto the topic of what to call $defs thanks).

In PR #713 makes it clear that being a $ref target does not automatically coerce something into being a schema. It explicitly addresses the case of a $ref target being under an unknown keyword (you can't tell whether the target is really supposed to be a schema, so the behavior is undefined). That should imply that a target that is known not to be a schema (in this case const is known to not be an applicator) is not treated as a schema. I can add a sentence to make that explicit.

There is more discussion on this in issue #687, and there was even more on slack.

If we continue in the current direction for another draft, we will formalize how an implementation can recognize extension applicators/placeholders, as that is the only piece missing. Currently, it either needs to be a standard keyword, or one that an implementation understands as a known extension.

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

@jdesrosiers I don't know how to respond to an objection of "I don't care about your use case or your entire industry's use case." You're far from being one of the most difficult offenders here, but this week has been a constant stream of people telling me that only their use case matters and everything else should be thrown out, so I'm probably way more angered by this position today than I otherwise would be (in general, not at you personally).

At this point, no one is pushing hard to forbid using the retrieval URI in addition to the root $id, and we can sort out the details of that next. For now, I am just trying to get consensus on the rest of it.

@handrews
Copy link
Contributor Author

handrews commented Mar 3, 2019

I'm seriously going to write a code of conduct and it will include THOU SHALT NOT DISPARAGE OR DISREGARD ANYONE ELSE'S USE CASE. 😝

@jdesrosiers
Copy link
Member

@handrews
Wow, you completely misunderstood me. I'm not even sure how you got all that from what I wrote. I was not dismissing anyone's use-case. In fact, part of my response was that everyone's use-case is valid and that we should support all of them. I have preferences based on my use-cases and so do you.

When I said your response was irrelevant, I wasn't saying that your use-case was irrelevant. I was saying that you were making a counter-argument for something that wasn't what I was arguing. It's like I said, "Dogs are great companions" and you responded with, "No way! Carrots taste great and they're good for you too. I eat carrots all the time". That's the kind of irrelevant I was talking about.

To attempt again, to clarify, I am not suggesting that JSON Schema should always be network accessible. Networks had nothing to do with what I was talking about.

At this point, no one is pushing hard to forbid using the retrieval URI in addition to the root $id

I never thought anyone was. You mentioned it as an option. It was relevant to the discussion, so I mentioned it. If I came off as defensive or combative, I'm sorry.

@handrews
Copy link
Contributor Author

handrews commented Mar 4, 2019

@jdesrosiers

Wow, you completely misunderstood me. I'm not even sure how you got all that from what I wrote.

Sigh. I do apologize. It's been a really long week for me with this project and I'm more than a little on edge. And trying to push through anyway because I know at least some people want this draft despite recent assertions by a vocal minority to the contrary. I've been told a lot of stuff I've spent a lot of time and effort trying to get right over the past two years is irrelevant or harmful in the last week, so that was not a fun sentence to read right now. Not your fault.

That said, I do not see my response as irrelevant. The presence or absence of a network was not the point I was trying to make. The point is that when you say:

There's always some sort of locator

I am responding that, based on my numerous use cases with real, shipping products, there is no usable locator. The problem is not that there is no network-accessible locator. It's that no possible locator is usable in place of $id.

@Relequestual
Copy link
Member

The way I look at this shadowed base URIs are not really a feature, but more of an unintended consequence. In the past I've had a hard time explaining to people in the JSON schema slack that a single part of the schema can be addressed by an arbitrary number of $refs, based on how many levels of nested $ids there are.

@johandorland

To me, this is key, and is the main issue various people have been talking about with regard to $ref / $id support, as I understand.

It's been created as a dynmaic system to allow flexibility, allowing people to create converluted situations. It's not clear that a specific subset of those situations have a use case, which creates authoring issues, and may create implementation issues.

@Relequestual
Copy link
Member

I'm in favour of reducing the complexity and possible situations here, and waiting to see if anyone says this would cause issues for them.

I would note, I would prefer to avoid any more "...behaviour is undefined" situations. If I understand correctly, that isn't the objective here, but just in case I haven't understood everything quite right.

@Relequestual
Copy link
Member

I'd like to give one or two more (different) people a chance to comment, before moving this to Accepted status.

@handrews
Copy link
Contributor Author

handrews commented Mar 4, 2019

@Relequestual

I would note, I would prefer to avoid any more "...behaviour is undefined" situations. If I understand correctly, that isn't the objective here, but just in case I haven't understood everything quite right.

What is the motivation for requiring implementations to error out? It's definitely nicer, but @awwright did raise a concern that that might be harder than just letting it work for some implementations.

@johandorland
Copy link
Collaborator

johandorland commented Mar 5, 2019

@handrews Consider this brief aside for a moment: The W3C TAG published this a long time ago saying (in brief) you shouldn't parse arbitrary data as a certain media type unless you know it was given that media type by the server: https://www.w3.org/2001/tag/doc/mime-respect

Well there's a similar thing here, should we know affirmatively that the inner instance being pointed to is, in fact, a schema (and not a random value)?

For example, should this be legal (even though the $ref isn't pointing to an item known to be a schema):

{
   $id: "#",
   properties: {
      "foo": { const: { type: "string" } },
      "bar": { $ref: "#/properties/foo/const" }
   }
}

At first glance I'd think sure why not, then I realized by using these tactics you can do some even nastier things. You'd probably break a lot of JSON schema implementations once you start putting $id and $ref in a const block and then ref into them like in your example.

It's probably off-topic for this issue, but it might be worth to consider to forbid these kind of references in one way or another.

@Relequestual
Copy link
Member

@handrews

@Relequestual

I would note, I would prefer to avoid any more "...behaviour is undefined" situations. If I understand correctly, that isn't the objective here, but just in case I haven't understood everything quite right.

What is the motivation for requiring implementations to error out? It's definitely nicer, but @awwright did raise a concern that that might be harder than just letting it work for some implementations.

My understanding is, we want to make this change because of two main reasons:

  • It is confusing for spec authors to understand
  • It is hard for implementations to implement

I realise I'm not the average JSON Schema author, but the reference examples you pasted from the current spec in your opening issue comment here, seem logical and OK to me. I can see those examples, look at each one, and can understand the resultion reasoning. In fact, this list of examples was a great help in getting me to understand.

I'm not an implementer. I don't maintain an implementation, nor am I likely to.

I can't speak to etiher of those use cases, but if the reason for the change is because it's confusing, making it "not officially supported" but not likely to "throw an error", feels to me like it would create even more confusion?

Maybe because it's so unliekly that anyone is doing this, it won't be a problem (just like how it's uncommon for someone to have duplicate keys in a JSON object).

Maybe I'd feel easier about it if we were to add something like.
"..behaviour is undefined, however an error MAY be thrown.", to suggest it would be beneficial to throw an error, but not an outright requirement. (Personally I'd want to make that a SHOULD, but then I also don't fully understand the cost to the implementer.)

@Relequestual
Copy link
Member

OK, I've had a discussion with @handrews about this off github to make sure we're on the same page.

I would propose that
"...the behaviour is undefined, however it's RECOMMENDED that an error is thrown to avoid silent unexpected behaviour."

It looks like @awwright was the only one to propose that this would cause a problem. Can you expand on the reasons for this please?

After discussion with @johandorland (and please correct me if I'm wrong), he doesn't feel it would cause an issue for his implementation (just to take an example from the other end).

@handrews
Copy link
Contributor Author

handrews commented Mar 5, 2019

I think @KayEss did not implement shadowing at all in his code, so might offer another useful counterpoint.

(and yes, I support what @Relequestual wrote above)

@KayEss
Copy link

KayEss commented Mar 6, 2019

It's kind of odd to say use JSON pointer for the fragment, but then amend it by saying it doesn't always work across $id boundaries.

I think it's odd that if I embed a document using $id, it suddenly becomes accessible by one or more sets of URIs that are not available if it is hosted separately.

I don't find this very convincing. If I mirror your schema somewhere else then it is available under two URLs. If you mirror the schema inside another schema why would you expect any difference?

@jdesrosiers
Copy link
Member

jdesrosiers commented Mar 6, 2019

@KayEss, I think you're missing the point. Assume we have the following two schemas.

http://example.com/foo

{
  "type": "object",
  "properties": {
    "aaa": { "$ref": "/bar" }
  }
}

http://example.com/bar

{ "type": "string" }

Now assume we want to embed the referenced /bar in /foo. We end up with the following.

http://example.com/foo

{
  "type": "object",
  "properties": {
    "aaa": {
      "$id": "/bar",
      "type": "string"
    }
  }
}

Now consider the URI http://example.com/foo#/properties/aaa/type with regard to both the first and the second version of /foo. In the first, the JSON Pointer points nowhere. In the second, it points to "string". That inconsistency is a smell that something is wrong. The two representations should be equivalent.

URI fragments indicate a fragment of the document. If we allow JSON Pointers to "see inside" $id, that URI fragment is indicating a fragment of a separate document, which doesn't make sense. It doesn't make sense for any part of /bar to be a fragment of /foo.

@KayEss
Copy link

KayEss commented Mar 6, 2019

Now consider the URI http://example.com/foo#/properties/aaa/type with regard to both the first and the second version of /foo. In the first, the JSON Pointer points nowhere. In the second, it points to "string". That inconsistency is a smell that something is wrong. The two representations should be equivalent.

I don't agree with that. I think that the documents are different (in one you have more stuff) and so a user visible change in behaviour is reasonable.

I don't think that we should try to specify some sort of "smart" or "multi" document format within JSON. If you want to mirror a schema inside another that opens up aliasing, just as it does if I want to mirror your foo schema at http://kayess.example.com/. Now the schema can be accessed through either URL, and I think that's fine. We shouldn't try to impose some sort of canonicalisation on the URLs that we allow to be used. I think it'll be hard to specify, much more complex and hard to test as it'll for sure open up a ton of new edge cases.

This shadowing thing should just remain undefined behaviour for now.

@handrews
Copy link
Contributor Author

handrews commented Mar 6, 2019

@KayEss

This shadowing thing should just remain undefined behaviour for now.

That's what we're trying to do- make it undefined (or possibly an error, but the momentum seems to be with undefined). Right now it is required that all of the aliases work.

@jdesrosiers I'm still more or less with you on the embedded document thing, but this issue doesn't decide for or against that model. I wanted to split things up and see if we could even get agreement on this part (which is apparently at least a bit challenging on its own).

@jdesrosiers
Copy link
Member

@handrews

I'm still more or less with you on the embedded document thing, but this issue doesn't decide for or against that model.

Fair enough, but I feel like the justification for this proposal hinges on the concept of $id as an embedded document. Otherwise @KayEss is right, if it's all considered the same document, it's hard to justify there should be a boundary within a single document.

@KayEss

We shouldn't try to impose some sort of canonicalization on the URLs that we allow to be used

To be clear, this is far from my goal. This (for me) is all about the principle that fragments are not allowed to reach from one document to another. If you don't accept the $id as an embedded document premise, then that principle is not violated.

I think it'll be hard to specify, much more complex and hard to test as it'll for sure open up a ton of new edge cases.

I strongly disagree with this. I think the embedded document model is far more simple in all respects than the base URI change model. The concepts you have to understand are few, simple, and consistent. Which leads to a very simple implementation with few edge cases.

@jdesrosiers
Copy link
Member

Leaving aside the embedded document perspective, there is another problem with the base URI change model. It allows one identifier to identify two different documents. Taking this example again.

http://example.com/bar

{ "type": "string" }

http://example.com/foo

{
  "type": "object",
  "properties": {
    "aaa": {
      "$id": "/bar",
      "type": "string"
    }
  }
}

http://example.com/foo#/properties/aaa/type, refers to both /foo and /bar. One URI, two documents. This conflict is fixed by this proposal whether you accept the embedded document model or not.

@KayEss
Copy link

KayEss commented Mar 7, 2019

http://example.com/foo#/properties/aaa/type, refers to both /foo and /bar. One URI, two documents. This conflict is fixed by this proposal whether you accept the embedded document model or not.

I guess we both agree that if you replace the embedded schema at aaa with a $ref to http://example.com/bar then the url http://example.com/foo#/properties/aaa/type is no longer valid. I guess I'm just less convinced that if you embed it instead that the aliasing matters.

I do like the embedded document idea is a metaphor to explain things, but I'm not convinced that the metaphor should suggest this sort of change to the aliasing rules.

Full disclosure though, my implementation handles this aliasing just fine, and banning it would be a significant amount of extra code.

@handrews
Copy link
Contributor Author

handrews commented Mar 7, 2019

@KayEss

Full disclosure though, my implementation handles this aliasing just fine, and banning it would be a significant amount of extra code.

Hence the likelihood of declaring that behavior to be undefined, which is standards-ese for "you really shouldn't use this and definitely can't rely on it but sometimes it might work"

Basically, I'd like to discourage its use by schema authors without requiring implementations to rip it out.

@handrews
Copy link
Contributor Author

handrews commented Apr 1, 2019

I'm going to write a PR for this, possibly in combination with other $id things as we have several that all affect each other fairly directly.

@handrews
Copy link
Contributor Author

I don't think I have it in me to get this through in draft-08

@handrews
Copy link
Contributor Author

OK, I read back through all of this and think we all did a really great job of sorting through the issues and settling on the "shadowed URIs have undefined behavior" change.

Let's get this in draft-08 after all: I'm writing the PR right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Items that need to be clarified in the specification core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants