Aligning thoughts and outcomes #8

jonaslagoni · 2021-08-11T14:11:54Z

jonaslagoni
Aug 11, 2021
Maintainer

This discussion is an attempt to align all thoughts behind, what this SIG tries to solve, what the outcome of the SIG and repository. I will update the repository documents once an agreement has been reached how we want to proceed 🙂

The current process

To make sure we are all aligned with what we have now and how it relates to the IDL SIG and its outcomes, I want to quickly mention the current JSON Schema, validation process, and what it solves.

The JSON Schema specification and vocabularies make up the structure for a JSON Schema document. That together with instance data allows it to be validated based on a well-defined validation process.

The validation process then allows external vendors to implement validators.

The test suite is provided to enable consistency and uniform behavior across all implementations of the validation process. The test cases contain instance data, a JSON Schema document, and an expected validation output.

The IDL

And its SIG is trying to provide a very different process than what JSON Schema currently has with its validation process. I will try to outline my point of view of what is needed to solve this problem.

Green boxes: Documentation that should be the primary focus of the SIG.
Yellow/orange boxes: Tooling, implementation of said documentation.

I want to split up the task into multiple sections, so they are more easily digested.

Again, the JSON Schema specification and vocabularies make up the structure for a JSON Schema document. For us to interpret the validation rules as data definitions, we need a well-defined interpretation process, the same way we have a well-defined validation process.

The interpretation process needs to interpret the JSON Schema (validation rules) to a common data definition format. The implementation of this comes in the form of an Interpreter (as we have Validator for the core JSON Schema). This common data definition format, as I can understand from multiple people, want to see this as being the JSON Schema itself, which can be done. However, for now, I would recommend that we research the options available, and not take a haste decision.

The test suite is provided to enable consistency and uniform behavior across all implementations of the interpreter process. The test cases contain a JSON Schema document, and its corresponding data definition format.

The JSON Schema IDL vocabulary, which should not only add "control" to the interpretation process (not quite sure if this will even be the case) and metadata for specific output languages.

Last, and opaque as it is yet to be determined based on the initial work, is to help facilitate which set of features are possible in specific type systems. For example, if the type system supports union types, do "A", otherwise, fall back to "B". However, for now, this is gonna be ignored.

The task list

This is the task list I suggest that the SIG primarily focus on:

Research data definition formats.
Choose a data definition format.
Start adding examples to the test suite
Based on the test suite, and its examples, the interpretation process needs to be defined.
- Keep in mind potential IDL vocabulary suggestions.
- Try to keep in mind other versions of JSON Schema and how they relate.
Implement an interpreter alongside the documentation of the process for self-validation of the process and test suite.

Discussion

Do you see any reason this format would not work?
Does this format of tasks and processes make sense?
Any other comments/suggestions?

jdesrosiers · 2021-08-11T22:48:01Z

jdesrosiers
Aug 11, 2021
Maintainer

This common data definition format, as I can understand from multiple people, want to see this as being the JSON Schema itself

I think you're missing the point. It's not that I want to see JSON Schema be the intermediate DDL format, it's that I don't see the need for an intermediate DDL format at all. An implementation might have one, but that's an implementation detail. It's not necessary for it to be any specific format. The test suite can specify the characteristics of the expected output without coupling us to any kind of DDL.

Lastly [...] is how the data definition format can be ported to specific languages.

I don't think we're going to have to specify results for specific languages. Type systems have a fairly fixed set of features, so we just need to define what to do given the availability of specific features. For example, if the type system supports union types, do "A", otherwise fallback to "B".

5 replies

jonaslagoni Aug 13, 2021
Maintainer Author

An implementation might have one, but that's an implementation detail. It's not necessary for it to be any specific format. The test suite can specify the characteristics of the expected output without coupling us to any kind of DDL.

While I do see the benefits of not coupling us to a DDL from the perspective of defining the testing suite. I can't comprehend how this is beneficial in terms of the implementation of the interpretation and its process. It seems to me too abstract to be usable.

Let's say we go with that approach. @jdesrosiers can you come up with a concrete example that shows how this will work in practice in terms of test suite and the interpretation process and implementation?

I have a pretty clear vision of how it can be done with the proposed approach, I just can't wrap my head around how we can remove the common DDL entirely (from the process and testing suite).

I don't think we're going to have to specify results for specific languages. Type systems have a fairly fixed set of features, so we just need to define what to do given the availability of specific features. For example, if the type system supports union types, do "A", otherwise fallback to "B".

Yes, agreed, a better explanation of what is expected there 👍 Updating.

Relequestual Aug 17, 2021
Maintainer

It seems like there's a difference of opinion. I think we need some minimal example code to inform the direction.

Something you can both work on?

Define objectives of the minimal working code?

jonaslagoni Aug 17, 2021
Maintainer Author

I think that would be a good idea yes, maybe I should start with a minimal working code @jdesrosiers? I think that would be the fastest way for you to give an example of your approach by adapting mine? 🤔

Or at the same time, does not matter to me 😄

jdesrosiers Aug 18, 2021
Maintainer

Sorry for the delay. Things have been busy lately. I'll try to write something up (maybe even spike out some code) by Monday.

jonaslagoni Aug 24, 2021
Maintainer Author

Tried to add a very simple "test", for discussion to reflects what I mean by the data definition format: #9

I actually tried to create it for JTD (instead of JSON Schema variant, in terms of definition) but already had problems, as there is no way to define something like all but string type.

jdesrosiers · 2021-08-24T19:35:18Z

jdesrosiers
Aug 24, 2021
Maintainer

Here's my example. First, let's look at the example test suite.

[
  {
    "description": "A simple example",
    "schemas": [
      { 
        "$id": "https://example.com/foo",
        "$schema": "https://json-schema.org/draft/2020-12/idl-schema",

        "package": "com.example",
        "name": "Foo",

        "type": "object",
        "properties": {
          "aaa": { "type": "string" },
          "bbb": { "type": "boolean" }
        }
      }
    ],
    "tests": [
      {
        "assertion": "hasClass",
        "arguments": ["Foo"],
        "tests": [
          {
            "assertion": "hasPackage",
            "arguments": ["com.example"]
          },
          {
            "assertion": "hasScope",
            "arguments": ["package"]
          },
          {
            "assertion": "hasProperty",
            "arguments": ["aaa"],
            "tests": [
              {
                "assertion": "hasScope",
                "arguments": ["public"]
              }, 
              {
                "assertion": "hasType",
                "arguments": ["string"]
              }
            ]
          },
          {
            "assertion": "hasProperty",
            "arguments": ["bbb"],
            "tests": [
              {
                "assertion": "hasScope",
                "arguments": ["public"]
              },
              {
                "assertion": "hasType",
                "arguments": ["boolean"]
              }
            ]
          } 
        ]   
      }   
    ]   
  }   
]

The /tests section shows how we can describe assertions rather than a DDL.

Now let's assume we have a generator written in node.js JavaScript that emits Scala classes. Here's an example implementation of the above test suite in a Scala-ish pseudocode. The ReflectionUtil is something I made up to avoid having to lookup scala reflection, but it probably makes the example more readable anyway.

class JsonSchemaIdlTests extends AnyFunSpec {
  describe("JSON Schema IDL") {
    // Load the Test Suite data from JSON
    val testSuite = JSON.parseFull(testSuiteJson).asInstanceOf[TestSuite]
    
    for (scenario <- testSuite) {
      describe(scenario.description) {
        val listOfClasses = generateClasses(scenario.schemas)
        listOfClasses.foreach { ReflectionUtil.eval } // Load classes from code into the current scope
        
        for (test <- scenario.tests) {
          // Each type of assertion has an implementation below
          test.assertion match {
            case "hasClass" => hasClass(test, ReflectionUtil.global)
          }
        } 
      }   
    }   
  }   
    
  def generateClasses (schemas) {
    // Call external JavaScript with schemas and return code. This will return a list of classes that
    // looks something like the following:
    return List("""package com.example
class Foo(val aaa: String, val bbb: Boolean)
""");
  }     
   
  def hasClass(test, global) {
    val className = test.arguments.head
   
    it("should have a class with name" className)) {
      assert(ReflectionUtil.hasClass(className, global))
    }
   
    if (test.tests) {
      // Some types of assertions, like this one, can have sub-assertions
      // In this case, we can make additional assertions about the class
      describe("Class" className) {
        val classMirror = ReflectionUtil.getClassMirrorFor(global, className)
    
        for (test <- test.tests) {
          test.assertion match {
            case "hasPackage" => hasPackage(test, classMirror) // Implementation not included
            case "hasScope" => hasScope(test, classMirror) // Implementation not included
            case "hasProperty" => hasProperty(test, classMirror)
          }   
        }
      }
    }
  }
 
  def hasProperty(test, classMirror) {
    val propertyName = test.arguments.head
        
    it("should have a property with name" propertyName) {
      assert(ReflectionUtil.hasProperty(classMirror, propertyName))
    }     
          
    if (test.tests) {
      // Properties can also have sub-assertions
      describe("Property" propertyName) {
        val propertyMirror = ReflectionUtil.getPropertyMirror(propertyName)
          
        for (test <- test.tests) {
          test.assertion match {
            case "hasType" => hasType(test, propertyMirror)
            case "hasScope" => hasScope(test, propertyMirror) // Implementation not included
          }
        }
      }
    }
  }

  def hasType(test, propertyMirror) {
    val expectedType = test.assertion.head

    it("should have type" expectedType) {
      assert(ReflectionUtil.getType(propertyMirror) == expectedType)
    }
  }
}

The JavaScript implementation can be a black box as far as the test implementation is concerned. Schemas go in and classes come out. The test implementation then evals the class code and uses reflection to ensure that the result passes the assertions in the test suite. To support another language, you'll need a test implementation in that language but you can use the same test suite. No intermediate DDL is necessary. An implementation might have one, but that's an implementation choice and isn't necessary for us the specify.

I'm glossing over several important details, but hopefully this gets the concept across.

5 replies

jonaslagoni Aug 28, 2021
Maintainer Author

Thanks for taking the time to provide such a detailed example, it brought the point across! 👍

I can see the benefits of providing a single test suite that tests the entire interpreter process (regardless of the implementation and representation of the internal model) + the generator.

There are so many aspects of this in terms of the generation phase though, that makes this approach very difficult. This is also one of the reasons I did not include it in the scope (generator + generator testing suite) as it (at least can be) complex.

The model generation can be very dynamic depending on use-cases and requirements. Maybe you have a use-case where you prepend all class names (of course not all languages have classes, but gonna use it here as a general name of the output) with Generated. Maybe all the property names have prepended names, just like the class name. Maybe you just want an interface instead of a class. For Java, the package is not apparent (same with other languages), cause it depends on the end user's intention on the placement of such as class. There is a lot of use-cases that render the test suite unusable or at least difficult to use.

Furthermore, in terms of the interpretation process, do you think we will be able to describe that process without assuming a specific output?

From my perspective, the output of the work here should not limit the implementor's options on the possibilities they can do, in terms of output. This means we would limit the implementors in terms of the common structure instead of the generator output. This is however much more acceptable by me, as it should not matter, as it won't change the generator options you can do.

So as I see it, we are left with two choices (unless someone has another?):

Should the testing suite limit the implementor's ability to vary the output format?
Should the testing suite limit the implementor's choice in the underlying data format?

I hope I got my point across, otherwise let me know if sections the response did not make sense 🙏

jdesrosiers Aug 28, 2021
Maintainer

From my perspective, the output of the work here should not limit the implementor's options on the possibilities they can do, in terms of output.

We're not on the same page on this. One of the primary motivations for this project is to ensure that implementations provide predictable and consistent results. That is to say that a JavaScript implementation that produces Java code and an Elixir implementation that produces Java code should be producing equivalent Java code. If implementations all have their own custom set of configuration options to vary the output, then we won't be getting equivalent code. I can't swap one implementation out for another.

I think that the schemas should be as self-descriptive (using keywords) as possible and when not possible, configuration options should be standardized. Implementers can extend the idl vocabulary with their own vocabulary if they want to provide additional features, but it would be out-of-scope for us. If the configuration options are standardized, they can be built into the test suite so we don't have a problem there.

do you think we will be able to describe that process without assuming a specific output?

At this point I don't see why not and I'm not aware of any reason to be concerned that it would be an issue. We can be specific about the qualities of the output without being specific about the output as a whole.

The place where I think there could be risk is in adapting the assertions in the test suite to the feature set of specific output formats. An example would be the package keyword in my example. Not all output formats have a concept of a package. A test harness for an output format that doesn't have a concept of a package can simply skip that assertion. But, it might not be that easy for more complex cases. I don't think there will be problem, but I think that if there is a problem with the approach I'm proposing, it will be this kind of problem.

jonaslagoni Aug 31, 2021
Maintainer Author

Yes, we are not on the same page here 😄

However, we are on the same page on enabling more consistent and predictable results, cause this is exactly what we are solving. We are just not aligned where this consistent and predictable result lies.

If we take your example schema and try to translate it into a Java class (just as an example):

public class Foo {
   private String aaa;
   private Boolean bbb;
   public String getAaa(){return this.aaa;}
   public void setAaa(String aaa){this.aaa = aaa;}
   public Boolean getBbb(){return this.bbb;}
   public void setBbb(Boolean bbb){this.bbb = bbb;}
}

or it could be

public class Foo {
   public String aaa;
   public Boolean bbb;
}

or it could be

public class Foo {
   private String _aaa;
   private Boolean _bbb;
   public String getAaa(){return this._aaa;}
   public void setAaa(String aaa){this._aaa = aaa;}
   public Boolean getBbb(){return this._bbb;}
   public void setBbb(Boolean bbb){this._bbb = bbb;}
}

For me, the consistency here is (and only for this example, as I could add a lot to this):

The property types (for both optional and required properties)
The algorithm for what to do when you encounter reserved keywords (such as when a property is named class for Java, that cannot be rendered)
Even though I said the naming format should be customizable, a default naming format would probably be good to establish. Especially on how to render properties such as some weird property name.
Default naming format for the name.

This means the output entirely depends on your use-cases and what the tool you are using, enables you to customize. That should by no means be locked by us and the outcomes of this work.

In short, I am looking for the rules for interpreting any JSON Schema file (regardless of the use of idl vocabulary and keywords), and give an accurate and consistent output in terms of the underlying structure that valid data can take. How that underlying structure is reflected in code, use of accessors, the scope of variables, the package it is under, etc, I don't really care about, as there are so many options that are not up to us to decide upon 😄

I am just thinking from the end-user point of view, and I am entirely biased here, as we have our implementation, so please do let me know if I view this completely wrong 😅

At this point I don't see why not and I'm not aware of any reason to be concerned that it would be an issue. We can be specific about the qualities of the output without being specific about the output as a whole.

👍 I kinda like the approach of reflection, I just don't have any actual experience with such an approach, so I don't know the limitations. But if we can agree on the depth of the assertions, I think it could be really useful.

The place where I think there could be risk is in adapting the assertions in the test suite to the feature set of specific output formats. An example would be the package keyword in my example. Not all output formats have a concept of a package. A test harness for an output format that doesn't have a concept of a package can simply skip that assertion. But, it might not be that easy for more complex cases. I don't think there will be problem, but I think that if there is a problem with the approach I'm proposing, it will be this kind of problem.

That makes sense yea, that is gonna be tricky to balance across all languages. However, behind the scene, I guess a package can rather easily be translated to namespace in C# for example. Not sure if that would be confusing?

jdesrosiers Sep 8, 2021
Maintainer

This kind of thing (generating with or without accessors) is something that you would build into your test harness. Given an assertion like, "has property foo", if the implementation was configured to generate accessors it would mean, "has a public getFoo method that returns a value". If the implementation was configured for no accessors, it would mean, "has a public member with name foo". The test harness would have to check that assertion differently based on which configuration option they were using in their implementation.

This is the kind of configuration option was what I was saying should be in the specification so all implementations support the same features. However, even looking at this simple case, I can imagine many of these options ending up being pretty language specific and we should avoid anything language specific in this spec. I don't think that changes my opinion about the approach, but I do now have a broader perspective and expect a significant amount of custom configuration to be a reality.

In short, I am looking for the rules for interpreting any JSON Schema file (regardless of the use of idl vocabulary and keywords), and give an accurate and consistent output in terms of the underlying structure that valid data can take.

The "regardless of the use of idl vocabulary" part makes me nervous. There's no way this is going to be possible without vocabulary keywords to instruct the implementation how to interpret schema constructs. The output can't be dependent on the structure of the schema. I should be able to refactor my schema as much as I want without affecting the code that gets generated. For example, allOf might indicate an inheritance relationship or it might just be used for organization of the schema. We can't assume the allOf means anything to the codegen process, so we need vocabulary keywords to tell the implementation how things should be related. There are lots of similar examples, like naming and references.

jonaslagoni Sep 16, 2021
Maintainer Author

I am gonna try to expand on your example and actually use it, gonna do another PR that shows the differences.

jonaslagoni · 2021-09-16T10:06:44Z

jonaslagoni
Sep 16, 2021
Maintainer Author

@jdesrosiers gonna start a new thread here as it is important to align in this subject, and this is in terms of "regardless of the use of idl vocabulary".

From the perspective of the other specifications AsyncAPI, OpenAPI, or just in general, we cannot force users to care about the specific vocabulary. And for our code generation tooling to work for any users for any schemas, we first need to define how this process works regardless of the use of idl vocabulary.

This is also why I do not really "care" that much about the vocabulary side of things, don't get me wrong they are great to have and we should have them, but only for future schema development. The vocabulary does not solve anything for users that either can't (say they use draft 7 or below, or don't have access to change the schema files) or won't change the existing schemas. For them, this solution should still work.

I do truly believe we can create a process that interprets ANY JSON Schema, regardless of the use of extra vocabularies. Yes, some cases require that we take a stand in terms of what we define as the "standard" behavior (or none, and make it up to the implementors to decide on the approach, depends on the situation I guess), such as for namings and interpretation of allOf.

I should be able to refactor my schema as much as I want without affecting the code that gets generated.

Not sure I understand or agree with this statement. If by refactoring you mean refactor the schema to still validate against the exact same data, then I absolutely agree. If you mean changing what data is valid, then no, the code generation output will definitely change, as it has to represent different data.

For example, allOf might indicate an inheritance relationship or it might just be used for organization of the schema. We can't assume the allOf means anything to the codegen process, so we need vocabulary keywords to tell the implementation how things should be related.

I get this, but regardless, we still need a standard expected behavior. Overwriting such behavior (such as through vocab, library settings, etc) should of course be possible.

3 replies

jdesrosiers Sep 21, 2021
Maintainer

I use the term "refactor" in the way it was originally defined, not in the looser sense people sometimes use the term today. Refactoring is changing the structure of code without changing its behavior. Nothing is added or removed, it's just organized differently. Any code generation should have no dependency on the structure of the schema, only it's behavior. To me, that property is the most important outcome of this effort. The whole reason I wanted to be involved in this effort is to make sure that happens.

I do truly believe we can create a process that interprets ANY JSON Schema, regardless of the use of extra vocabularies.

I'm not sure I see how. You mentioned the two issues that come immediately to mind: naming and semantics of allOf. As you say, if we don't introduce keywords, the choices are to assign semantics to an existing keyword or to not specify behavior. If we leave these up to implementations, they can choose behaviors that couple to the structure of the schema. For example, determining names by using names used in $defs requires that everything be defined in $defs. And, interpreting allOf as inheritance would mean you can't use allOf for schema organization. So, we have to make a ruling on these two things.

The only safe decision for allOf is to specify that it's just a transparent grouping mechanism and has no special meaning in code gen. That means that without a vocab keyword to indicate otherwise, we wouldn't be able to express an inheritance relationship at all. The same problem occurs with union types. The only thing we'd be able to support is flat types with a bunch of duplication and no polymorphism.

Naming things is worse. If we can't use the structure of the schema to infer names, we can't use $defs, $id, or $anchor which leaves the only option is repurpose an existing keyword such as title. But, that wouldn't be a good idea because it might conflict with the use of title in documentation. I don't see a good solution for naming that doesn't involve a new keyword.

I definitely want to keep the vocabulary as minimal as possible, but there are several issues that I don't see a way around without new keywords.

The vocabulary does not solve anything for users that either can't (say they use draft 7 or below, or don't have access to change the schema files) or won't change the existing schemas.

People using older drafts can still benefit from the vocabulary. Extensions to JSON Schema have always been supported. Vocabularies just formalizes the process and makes it easier. It'll just take 20min to write the draft-07 dialect schema as opposed to 5min for the draft 2020-12 dialect schema. Since the structure of the schema shouldn't affect the behavior of the code gen, the most people would have to do to change their schemas is to add a couple annotation keywords. So far, the only keyword I can think of that would be required is something to specify a name. We'll do our best to keep these things minimal, but we can't help people who can't or won't update their schemas to conform to the spec.

For them, this solution should still work.

The only way I can see that being possible is something so limited (random names, no inheritance, no unions), that no one would want to use it. I think minimal annotations would be required for the kind of features people are going to want.

jonaslagoni Sep 24, 2021
Maintainer Author

Nothing is added or removed, it's just organized differently. Any code generation should have no dependency on the structure of the schema, only it's behavior.

Okay, I think we are in agreement here, i.e. for this example below, we consider this a refactoring, right?

{
  "anyOf": [
    {"type": "string"}, 
    {"type": "number"}
  ]
}

Refactored to:

{
  "type": ["string", "number"]
}

Naming things is worse. If we can't use the structure of the schema to infer names

That's a good point... I can understand if we want to stay neutral here, and just say something like name keyword infers the name.

At least until the community wants a standardization of a "best try" approach. I know that "best try" approach might go against the fact that you should be able to refactor the schema and give the same output, but I think such an approach will be called for eventually.

Speaking as an implementor, we will take such an approach as we want this to work by default for anyone regardless of the use of this vocab. The output they get in return might not have accurate names or correct inheritance, but it would still be usable, as the models represent the data that can be validated (at least to an extend).

So far, the only keyword I can think of that would be required is something to specify a name. We'll do our best to keep these things minimal, but we can't help people who can't or won't update their schemas to conform to the spec.
The only way I can see that being possible is something so limited (random names, no inheritance, no unions), that no one would want to use it. I think minimal annotations would be required for the kind of features people are going to want.

I think this is completely the wrong approach, but I understand why there are certain areas we cannot force a decision. However, we can enlighten people about it, especially implementors, as to the limitations and mitigation tactics to use.

And I disagree that no one would want to use it, as the output would still represent the data model, and allow users to more easily create instances data, and in our case (AsyncAPI), provide better output in code generation altogether for data transfer.

jdesrosiers Sep 28, 2021
Maintainer

Yes, that's a refactoring.

I disagree that no one would want to use it

Yeah, I agree that it wouldn't be useless. If the bare minimum is good enough (or better than nothing), that's something we could support.

Relequestual · 2021-09-27T13:31:06Z

Relequestual
Sep 27, 2021
Maintainer

From the perspective of the other specifications AsyncAPI, OpenAPI, or just in general, we cannot force users to care about the specific vocabulary. And for our code generation tooling to work for any users for any schemas, we first need to define how this process works regardless of the use of idl vocabulary.
...
I do truly believe we can create a process that interprets ANY JSON Schema, regardless of the use of extra vocabularies.
@jonaslagoni

IMHO the reason this SIG exists is the acknowledgment that generating code from the basic JSON Schema vocabulary is not enough, and the use of a new vocabulary is required.

We'll do our best to keep these things minimal, but we can't help people who can't or won't update their schemas to conform to the spec.
@jdesrosiers

This is exactly right here.

@jonaslagoni I understand you wan to provide a solution people can use with their schemas today, but I don't feel this aligns with the purpose of the SIG.

There are already several solutions out there which do similar, but they are never quite how people want, or there are edge cases, or it won't work with current versions of JSON Schema.

What you're proposing is a convention. Conventions are great, until you can't do what you want because the convention doesn't allow it. This will include arbitrary decisions and create an opinionated convention.

In comparison, a defined vocabulary should allow you to do almost anything you want, with some things defined using a config (This can even be in the schema as part of the vocabulary). This requires some additional keywords. This will require updated schemas.

New vocabularies require modifications to schemas. There is some delay in publication and production. This is not uncommon, and I feel we should expect and be comfortable with this fact.

Lasting impact over time is prefereable over quick wins today.

The last thing I want to see is a quick win today which then creates a convention which is in conflict with the long term solution. I feel we should be very cautious to avoid this sort of situation.

4 replies

jonaslagoni Oct 8, 2021
Maintainer Author

I have not really been sure what to write as a response, as I am not quite sure I can differentiate between the work here, convention, and a vocabulary, as I see them blurred together (maybe this is the core problem) 😅

My main objective with this SIG is to take any JSON schema that can be used to validate data and discover the data definition that best represents the underlying structure of the data that can be accepted. In a sense, I don't care much about creating a new "language".

Vocabulary or not, this objective does not change for me. This primarily comes from the fact that I CANNOT control how users write their schemas, and this has nothing to do with "there is some delay in publication and production", all we can do from our side is say "we follow this convention, and if you want to fine-tune the schema for code generation, please use the following vocabulary".

The question might be, where do we draw the line between a convention that we need implementors to do for themself, and what we define here? 🤔

My perspective of the vocabulary is that we will discover keywords that will simplify or control certain aspects, for example, the name keyword is perfect, we need that keyword to more precisely control the name of the output model, however, I also think we need a convention to achieve "best possible outcome" when users don't use the vocabulary, as this is more important for me (and this is where we have our differences 😄).

I don't see this as a "problem" as keywords can easily be added on top of the default behavior/conventions.

Maybe a SIG is not the right step based on your definitions of what you expect, maybe we can find a middle ground, maybe not. I think this is extremely difficult to agree upon over text, so I think we should discuss this in a meeting?

I have created a couple of issues, that I think we need to decide upon how to handle, maybe we can use those as a starting point for what you think this SIG should solve and not solve?

jdesrosiers Oct 8, 2021
Maintainer

I think this is extremely difficult to agree upon over text, so I think we should discuss this in a meeting?

This is exactly what we started the Open Community Working Meetings for. We can put this on the agenda for the next meeting. json-schema-org/community#35

kevinswiber Sep 2, 2022

@jdesrosiers Was this ever discussed in a working meeting? Was there ever a direction settled?

jdesrosiers Sep 4, 2022
Maintainer

It was so long ago, I don't remember details. I remember that we did discuss it at at least one meeting. I don't remember if those meetings resulted in consensus. I was and still am convinced that we must not define any convention that requires users to write their schemas in a certain way. I should be able to organize the schema however I want and still get the same code gen results. There's no way to get that without a few annotations sprinkled into the schema. I think/hope we ended up agreeing on that as a constraint, while aiming to add as few new annotation keywords as possible.

kevinswiber · 2022-09-02T21:17:41Z

kevinswiber
Sep 2, 2022

Based on the discussion here, it seems there's more of an interest in developing a processing model for schemas that include all of the vocabularies associated with the JSON Schema 2020-12 dialect, optionally with an additional vocabulary for IDL purposes.

My concern with this approach is that we already have sub-optimal code generators that depend on JSON Schema dialects. A JSON Schema document may define dynamic, conditional structures that may depend on runtime conditions and may not be suitable for ahead-of-time type definitions in programming languages. This leads to support tables communicated by generators. Unfortunately, there's often a lot that's not supported.

Example from OpenAPI Generator's Go generator:

Name Supported Defined By

Simple ✓ OAS2,OAS3

Composite ✓ OAS2,OAS3

Polymorphism ✗ OAS2,OAS3

Union ✗ OAS3

allOf ✗ OAS2,OAS3

anyOf ✗ OAS3

oneOf ✗ OAS3

not ✗ OAS3

Does a better processing model fix truly fix this? I am skeptical but would be delighted to be surprised.

When looking to the future--instead of retrofitting IDL support into existing utilities--I think it's worth exploring a dialect that has a more static definition with the intent of using it as an IDL. There are plenty of IDL solutions, but all that I know are tied directly to specific RPC implementations. I think there's opportunity in a JSON Schema dialect specifically tailored to IDL, particularly for formats that depend on JSON Schema such as OpenAPI and AsyncAPI.

I may look at exploring that approach if anyone's interested! I realize that this is not the stated intent or approach of this vocab-idl initiative, so I'm happy to move future discussions around this to another repo.

Thanks, everyone, for sharing your thoughts here! It has been a great discussion to follow. 🙏

7 replies

kevinswiber Sep 4, 2022

I'll think on the annotation approach a bit more! I do think, for me, going through the exercise of isolating IDL needs might help inform what would be most helpful in an annotation approach. I might swing back around this way if I catch up to your wavelength. Thanks for the input!

Relequestual Sep 5, 2022
Maintainer

The hope (my hope, which I think others also follow) is that rather than people having to create new or different schemas, they can augment their existing schemas as required, to enable them to have consistent (or at least, "no surprises") code generation across different languages, given tooling which asserts that it supports the defined vocabulary.
My expectation is, anywhere you currently have to make assumptions, you need a new keyword to avoid having to make that assumption.
Otherwise, my only firm belief from working in smaller scale standards orgs, is we need a champion who will run and push the vocabulary development in order for it to matter.

Sort of what Darrel said and parts of what Jason said, I think =]

jonaslagoni Sep 13, 2022
Maintainer Author

Does a better processing model fix truly fix this? I am skeptical but would be delighted to be surprised.

@kevinswiber my hope is that if someone wants to build a code generator themself, with an actually defined processing model and vocabulary, we will be able to understand the same language and how things can be accomplished - cause it's no easy task getting to a definition from validation rules. Hence your table where only a very restricted set of JSON Schema keywords are interpreted.

To continue a bit more in those lanes of thought as each language has very unique constraints it can have a great variation in how you can even create the models that should represent and be converted to JSON. This is why I wanted to include these variations as we learned about them, and give suggestions on how they can be handled, i.e. tuples in Java is not a thing, while it is in TS, so which possibilities do you have to represent something like this in Java?

{
  "type": "array",
  "prefixItems": [
    { "type": "number" },
    { "type": "string" },
    { "enum": ["Street", "Avenue", "Boulevard"] },
    { "enum": ["NW", "NE", "SW", "SE"] }
  ]
}

But yea, they are suggestions, so all we do is lower the barrier of entry for people to build code-generation tools to support JSON Schema documents, so the knowledge does not reside with the few 🙂

What we need is for the ability for JSON Schema authors to be able to provide explicit directives to tools that generate types. I don't think it is realistic to tell all the owners of existing JSON Schemas that they need to go re-write their schemas because these new directives are the "right" way to do it. Tooling should still rely on the conventions that they support today to ensure backward compatibility. However, as they implement the new vocabulary, they should prioritize the explicit keywords over conventions.

@darrelmiller absolutely agree with this, and the same as @Relequestual and @jdesrosiers point out. Everyone should be able to use their exact schemas as they have now, and by adding keywords they can bet a more precise output that is defined with the vocabulary. So, as you say we do not have to rely on conventions that change from tool to tool.

@jdesrosiers also convinced me that a new intermediate data type description language is not the real outcome here, as everyone will convert JSON Schema documents to whatever their tool currently have and that won't change.

But with the use of this vocabulary in your JSON Schema documents, you should expect only one outcome.

handrews Sep 28, 2022

@kevinswiber the essence of the annotation approach is:

JSON Schema, as a constraint system evaluated in the context of an instance, is ambiguous in the absence of an instance due to conditionals and dynamic constructus of various sorts
JSON Schema, as a system that operates on parsed data, is ambiguous regarding the semantics of the data structure (e.g. inheritance vs delegation vs "I'm just using this allOf to glue stuff together, it's not semantically important at all"
annotation keywords disambiguate these ambiguities without impacting validation — this is also one of the reasons for "optional" vocabularies, as a non-code-generating validator can still correctly validate such a schema without understanding the code generation vocabulary

Personally, I also think it's entirely valid to exclude certain keywords for code generation, which is one reason why you can write different meta-schemas for the same set of vocabularies. So you can use the standard applicator vocabulary while forbidding not, for example.

Regarding the different language capabilities, while we don't want language-specific annotations, annotations can be defined in terms of capabilities. So an annotation that primarily exists to say "this prefixItems with minItens and maxItems defines a tuple" can also define how languages that do not support tuples should handle it.

ponelat Oct 12, 2022

Adding my 2c, I like the idea of annotations that allow folks to describe type mapping and type relationships on top of JSON Schema. These could be done incrementally to solve real-world issues. A possible catch-all extension annotation may be added that would allow authors to hint directly(If Java18, then use this type).
A similar approach could be done in reverse. Take a modelling language and annotate with validation rules.
A silver-lining of this effort, is that capturing the "deficiencies" in type/code generation from JSON Schema, would benefit both camps (model first and validation first).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aligning thoughts and outcomes #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 24 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Aligning thoughts and outcomes #8

jonaslagoni Aug 11, 2021 Maintainer

The current process

The IDL

The task list

Discussion

Replies: 5 comments · 24 replies

jdesrosiers Aug 11, 2021 Maintainer

jonaslagoni Aug 13, 2021 Maintainer Author

Relequestual Aug 17, 2021 Maintainer

jonaslagoni Aug 17, 2021 Maintainer Author

jdesrosiers Aug 18, 2021 Maintainer

jonaslagoni Aug 24, 2021 Maintainer Author

jdesrosiers Aug 24, 2021 Maintainer

jonaslagoni Aug 28, 2021 Maintainer Author

jdesrosiers Aug 28, 2021 Maintainer

jonaslagoni Aug 31, 2021 Maintainer Author

jdesrosiers Sep 8, 2021 Maintainer

jonaslagoni Sep 16, 2021 Maintainer Author

jonaslagoni Sep 16, 2021 Maintainer Author

jdesrosiers Sep 21, 2021 Maintainer

jonaslagoni Sep 24, 2021 Maintainer Author

jdesrosiers Sep 28, 2021 Maintainer

Relequestual Sep 27, 2021 Maintainer

jonaslagoni Oct 8, 2021 Maintainer Author

jdesrosiers Oct 8, 2021 Maintainer

kevinswiber Sep 2, 2022

jdesrosiers Sep 4, 2022 Maintainer

kevinswiber Sep 2, 2022

kevinswiber Sep 4, 2022

Relequestual Sep 5, 2022 Maintainer

jonaslagoni Sep 13, 2022 Maintainer Author

handrews Sep 28, 2022

ponelat Oct 12, 2022

jonaslagoni
Aug 11, 2021
Maintainer

Replies: 5 comments 24 replies

jdesrosiers
Aug 11, 2021
Maintainer

jonaslagoni Aug 13, 2021
Maintainer Author

Relequestual Aug 17, 2021
Maintainer

jonaslagoni Aug 17, 2021
Maintainer Author

jdesrosiers Aug 18, 2021
Maintainer

jonaslagoni Aug 24, 2021
Maintainer Author

jdesrosiers
Aug 24, 2021
Maintainer

jonaslagoni Aug 28, 2021
Maintainer Author

jdesrosiers Aug 28, 2021
Maintainer

jonaslagoni Aug 31, 2021
Maintainer Author

jdesrosiers Sep 8, 2021
Maintainer

jonaslagoni Sep 16, 2021
Maintainer Author

jonaslagoni
Sep 16, 2021
Maintainer Author

jdesrosiers Sep 21, 2021
Maintainer

jonaslagoni Sep 24, 2021
Maintainer Author

jdesrosiers Sep 28, 2021
Maintainer

Relequestual
Sep 27, 2021
Maintainer

jonaslagoni Oct 8, 2021
Maintainer Author

jdesrosiers Oct 8, 2021
Maintainer

jdesrosiers Sep 4, 2022
Maintainer

kevinswiber
Sep 2, 2022

Relequestual Sep 5, 2022
Maintainer

jonaslagoni Sep 13, 2022
Maintainer Author