Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Bundled streams w/ self-identification #3875

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions proposals/2024-06-01-Self-Identification-and-Bundling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Self-Identification and Bundling

## Metadata

|Tag |Value |
|---- | ---------------- |
|Proposal |[2024-06-01 Self-Identification-and-Bundling](https://github.com/OAI/OpenAPI-Specification/tree/main/proposals/{2024-06-01-Self-Identification-and-Bundling.md})|
|Relevant Specification(s)|OAS, Arazzo|
|Authors|[Henry Andrews](https://github.com/handrews)|
|Review Manager | TBD |
|Status |Proposal|
|Implementations |n/a|
|Issues |[{issueid}](https://github.com/OAI/OpenAPI-Specification/issues/{IssueId})|
|Previous Revisions |[{revid}](https://github.com/OAI/OpenAPI-Specification/pull/{revid}) |

## Change Log

|Date |Responsible Party |Description |
|---- | ---------------- | ---------- |
|2024-06-01 | @handrews | Initial submission

## Introduction

Poor support for external references has fractured the OAS tooling landscape, with many tools requiring multi-document OpenAPI Descriptions (OADs) to be combined into a single document. Arazzo requires resolving sources and runtime expressions from multiple OADs, each of which might consist of multiple documents. There is no way to combine all of the OAD and Arazzo documents involved into a single document, but an alternate solution would be a bundle similar to [what we recommend on our blog for bundling Schema Objects](https://www.openapis.org/blog/2021/08/23/json-schema-bundling-finally-formalised). This would require a similar mechanism to JSON Schema's `$id` for OAS and Arazzo documents and the components within them. It would also provide an alternative to current multi-to-single-document OAD tools, most (possibly all) of which do not fully support OAS 3.1, and allow for _lossless bundling of identifiable components_, which is increasingly needed by industry standards groups publishing API "building blocks" for use across many APIs by many different providers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to combine all of the OAD and Arazzo documents involved into a single document

I'd like to understand why this is the case. What about just OAD documents?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects in descriptions are scoped to the document in which they originate. We don't really have semantics for merging multiple documents. In some cases, it would involve renaming unique identifiers like map keys. Tools that have done this create an inconsistent experience, and to reverse the process requires a source map of some kind. Maintaining document identity in a multi-document scenario obviates the need for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikekistler basically what @kevinswiber said, but I'll add two things:

I chatted with someone on some Slack a while ago who was having a lot of problems because his "merge to a single document" tools that he needed in order to get something (AWS gateway, maybe?) to accept the OAD was messing things up with what it though were "safe" transformations. As Kevin pointed out, there are a lot of choices that tools doing this have to make, and those can be surprising and sometimes breaking. When you start with a single document, split it, and then use the same toolchain to re-combine the pieces, that works quite well because the split and combine tool are the same and make the same assumptions. But mixing toolchains doesn't work well here.

Another way to look at it is to look at how JSON Schema bundling works. It relies heavily on "$id" to preserve not just the referencing behavior but the literal reference values. This way there isn't any "rewriting" of the documents, and no need to merge them. They are simply placed in the standard location ("$defs") and used in a way that the name under "$defs" is irrelevant.

As briefly mentioned in the call this morning, if we want to support "$id" (or an equivalent named differently to reduce confusion with JSON Schema) in every Object type in the OAS and Arazzo, we could absolutely implement single-document bundling. But nested "$id"s make it much harder to determine the base URI, which is something that tooling vendors seem to struggle with (or are just indifferent to) as it is. Using a document stream minimizes the changes and implementation work involved.

I'm trying to get the maximum improvement for the minimum effort with this proposal.


## Motivation

This proposal is motivated by the shortcomings of the current ecosystem regarding referencing, particularly its gaps regarding OAS 3.1 and Arazzo support.

Conflicting requirements from different tools regarding referencing, and regarding how to work around their lack of support, is one of the most fundamental interoperability problems facing the OpenAPI ecosystem. While manageable within a single API owned by a single provider, it becomes a much bigger problem when working across multiple APIs by multiple providers.

### Current tools lose necessary identifying information

The current lack of consistent referencing support has resulted in many tools requiring preprocessing by a tool that combines documents through some combination of inlining reference targets (which is not always possible) and/or moving reference targets and rewriting the references to point to the external location. This is a _lossy_ operation in terms of recognizing shared components: tools that work by [recognizing a specific shared component by URI](https://github.com/OAI/sig-moonwalk/discussions/72#ogc) cannot reliably recognize inlined components or rewritten references.

### Most tools do not fully support OAS 3.1

Many referencing preprocessors only work in OAS 3.0 or earlier, because they violate the [full-document, JSON Schema keyword-aware parsing requirement](https://github.com/OAI/OpenAPI-Specification/pull/3758). Supporting referencing as a preprocessor requires handling not only the [many different `$ref` variations](https://github.com/OAI/Arazzo-Specification/issues/181#issuecomment-2085586524), but also the `$id`, `$anchor`, `$dynamicAnchor`, and `$dynamicRef` keywords. While it is theoretically possible to preprocess `$dynamicRef` (aside from circular references), it can cause an [exponential growth in document size](https://arxiv.org/pdf/2307.10034). Dynamic referencing is an important technique for [modeling generic data types](https://github.com/OAI/OpenAPI-Specification/pull/3714).

The author of this proposal is not aware of any reference preprocessing tool that fully supports OAS 3.1, although the last comprehensive survey on this was done in 2022.

### Arazzo not supported

All current tools depend on it being possible to structure OADs as single JSON or YAML documents. This is not possible with Arazzo, as it coordinates multiple OADs without being part of any of them. We do not yet know how much of a challenge this will be for Arazzo, but history suggests that the ecosystem will be healthier if a clear solution is endorsed early on.
Copy link
Contributor

@kevinswiber kevinswiber Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make composite keys using a combination of the identifier for the entry OAD and identifiers for dependency documents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinswiber Arazzo identifying targets in OADs is not a problem, if that's what you mean. I'm just kind-of assuming that since multi-document support has been a challenge (or just not viewed as cost-effective) for our community, publishing a spec that depends on reading and using multiple documents seems a touch risky.


### JSON Schema bundling is lossless and well-received

JSON Schema bundling, which we have officially endorsed on our blog as linked in the introduction above, is a _lossless_ operation that does _not_ require rewriting or inlining any references. Schema documents with an `$id` at the root are incorporated as-is, while documents referenced by location have an `$id` with that URL added. This makes it possible to reproduce the original multi-document form from the bundle, and continue to recognize components based on their reference URIs.

Building on the JSON Schema bundling model will ensure that, as much as possible:

* New behavior will be identical or at least analogous to behavior that is already required, making it easier to support
* The mental model will parallel one that is already successful

## Proposed solution

These challenges can be solved by combining two already-existing concepts:

1. A simplified analog of JSON Schema's `$id` that appears in exactly one place: a new `self` field in root OpenAPI Objects / Arazzo Objects
2. Existing YAML and JSON streaming formats:
* YAML native streams [RFC 9512 §3.2](https://www.rfc-editor.org/rfc/rfc9512.html#name-yaml-streams) `application/yaml`
* JSON Text Sequences [RFC 7464](https://www.rfc-editor.org/rfc/rfc7464) `application/json-seq`
* [JSON Lines](https://jsonlines.org/) _[proposed](https://github.com/wardi/jsonlines/issues/19):_ `application/jsonl` or `application/jsonlines`
* [NDJSON](https://github.com/ndjson/ndjson-spec) `application/x-ndjson`)

Placing the `self` field only in the OpenAPI Object or Arazzo Object makes it align with the existing bootstrapping process for parsing: Parsers MUST already check the `openapi` or `arazzo` field first, and in OAS 3.1+ MUST also check `jsonSchemaDialect` to know how to interpret Schema Objects. With `self` providing the base URI when present, it would also impact how relative `$id` values in Schema Objects are handled, just as `jsonSchemaDialect` impacts Schema Objects that do _not_ include `$schema`.

As [OAS 3.1.1 clarifies](https://github.com/OAI/OpenAPI-Specification/pull/3758), it is already mandatory to separate location and identity for Schema Object support. Currently, associating a URI other than the current URL with a document to meet this requirement has to be done externally. Many tools effectively support this in the form of allowing the retrieval URL to be set manually, without verifying that the document actually lives at the given URL.

The various streaming formats do not state how to resolve links among the parts, as noted in [RFC 9512 YAML Media Type §3.2](https://www.rfc-editor.org/rfc/rfc9512.html#name-yaml-streams), which makes an explicit analogy to `application/json-seq` for this behavior.

A `self` field that is a relative URI-reference would be resolved against the document location just as all Reference Object and similar URIs are resolved in OAS 3.1. A relative reference within a stream would resolve against the URL of the entire stream; however it is probably better to either RECOMMEND or require (MUST) resolving `self` to an absolute URI when bundling into a stream for maximum predictability.

## Detailed design

This is written for the structure of the OAS, but it should be clear how it would be adapted for Arazzo.

\# Specification

...

\#\# OpenAPI Description Structure

...

\#\#\# Bundling Documents as YAML or JSON Streams

Multiple OpenAPI Description documents MAY be bundled in a YAML stream ([RFC 9512 §3.2](https://www.rfc-editor.org/rfc/rfc9512.html#name-yaml-streams)) or a JSON streaming format such as JSON Text Sequences [RFC 7464](https://www.rfc-editor.org/rfc/rfc7464), [JSON Lines](https://jsonlines.org/), or [NDJSON](https://github.com/ndjson/ndjson-spec). Documents bundled in this way MUST set their own URI using the `self` field in the [OpenAPI Object](#oasObject). If a document in the stream has `self` set to a relative URI-reference, it MUST be resolved relative to the location of the entire stream. However, it is strongly RECOMMENDED to set `self` to an absolute URI for use within multi-document streams.

The first document in the stream MUST be treated as the entry document.

\# Schema

...

\#\# OpenAPI Object

\#\#\# Fixed Fields

Field Name | Type | Description
---|:---|:---
self | `URI-reference` (without a fragment) | Sets the URI of this document, which also serves as its base URI in accordance with [RFC 3986 §5.1.1](https://www.rfc-editor.org/rfc/rfc3986#section-5.1.1); the value MUST NOT be the empty string and MUST NOT contain a fragment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For interleaving multiple OAD streams, we could also include an optional field to identify the bundle, perhaps using the entry document ID. Ideally, it would allow multiple bundles to prevent duplicating documents in certain scenarios, but I'm usually the odd one out for being cool with arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinswiber I think this would be a good further direction, but I worry about making the initial thing too complicated. Really, the only thing you need to solve for having multiple OADs in a stream is how to identify each entry document. This proposal just says "the first document in the stream is the entry document", which means there can only be one clear OAD in the stream. But there's no reason you couldn't get a stream, and then select (by URI) each entry document you want to try to parse from the stream. As long as all needed documents are in the stream, it would work fine.


## Backwards compatibility

OAS 3.2 and Arazzo 1.1 documents that do not use the `self` field will behave exactly the same as OAS 3.1 and Arazzo 1.0 documents. The change in minor version is sufficient to manage the compatibility issues, as no software that only supports up to 3.1/1.0 should attempt to parse 3.2/1.1 documents.

## Alternatives considered

### `$id` and `$anchor` everywhere

We could adopt JSON Schema's `$id` and `$anchor` keywords for all OpenAPI Components. This would allow self-identification on the component level, but would introduce a great deal of complexity for tooling vendors. (Note that `$dynamicAnchor` and `$dynamicRef` would never be relevant because they depend on instance evaluation, which is Schema Object-specific).

Having hierarchical `$id`s makes managing base URIs substantially more complex, without a clear benefit at this time.
Using `$anchor` for plain name fragments might preclude other approaches being discussed for Moonwalk, such as using names from the Components Object or its Moonwalk analogue.

### Bundling using `multipart/related` or similar

Bundling could be implemented with a `multipart` media type along the lines of [RFC 2557: MIME Encapsulation of Aggregate Documents](https://www.rfc-editor.org/rfc/rfc2557). This could extend support to prior OAS versions and external documents and resources, since the parts need not be limited to JSON or YAML, and the URI can be captured in the per-part `Content-Location` header, removing the need to add a `self` field.

While worth considering separately due to the substantial additional benefits it would bring, it would also be more costly to implement given the relatively obscure nature of `multipart` processing. There are also other benefits to the `self` field as noted in this proposal, specifically around recognizing shared components across multiple APIs from multiple providers.

Most importantly, a `multipart` solution and this solution could co-exist, so each can be evaluated on its own merits as separate proposals.