Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling of datacontenttype is inconsistent #558

Open
kevinbader opened this issue Jan 22, 2020 · 52 comments
Open

handling of datacontenttype is inconsistent #558

kevinbader opened this issue Jan 22, 2020 · 52 comments
Labels

Comments

@kevinbader
Copy link

CloudEvents 1.0

Consider this example, straight from the spec:

{
    ...
    "datacontenttype" : "text/xml",
    "data" : "<much wow=\"xml\"/>"
}

Clearly, data is some structure that has been encoded using the XML format and put into the event as a string (binary). Naturally, I'd assume the same behavior for JSON encoding:

{
    ...
    "datacontenttype" : "application/json",
    "data" : "{\"foo\": \"bar\"}"
}

However, that's doesn't seem to be the case; as the example in the HTTP protocol binding spec shows, the JSON object is not sent in its encoded form but rather nested into the event directly:

{
    ...
    "datacontenttype" : "application/json",
    "data" : {
        "foo": "bar
    }
}

Note that removing the optional datacontenttype attribute doesn't change this, as the spec clearly states:

A JSON-format event with no datacontenttype is exactly equivalent to one with datacontenttype="application/json".

To sum it up, it is not possible to put a JSON-encoded data blob into a CloudEvent; and a parser needs to treat application/json different than any other datacontenttype.

HTTP Protocol Binding 1.0

For structured content mode, the spec says:

The chosen event format defines how all attributes, and data, are represented.

Does this mean that datacontenttype must be present and set to the event format? Or does structured mode implicitly change the default of datacontenttype from application/json to whatever event format is in use? What if datacontenttype is present and set to a different encoding - must a parser treat this event as malformed?

JSON Event Format 1.0

As a side note, the JSON Format spec makes this even more confusing:

If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value, and use the member name data_base64 to store it inside the JSON object.

This basically says that you have to Base64-encode any simple JSON string (which is, of course, binary). Also, if a receiver does not implement the optional (!) JSON Format spec, it won't be able to parse the data_base64 value; consequently, implementing the JSON Format spec as a sender means not implementing the full CloudEvents spec.

@duglin
Copy link
Collaborator

duglin commented Jan 23, 2020

@n3wscott @clemensv any comments on this one?

@duglin
Copy link
Collaborator

duglin commented Jan 29, 2020

Trying to remember the history....

See https://github.com/cloudevents/spec/blob/master/json-format.md#31-handling-of-data

In your example: "data" : "{\"foo\": \"bar\"}" , what is data ? Is it a string that just happens to look like JSON or is it actually JSON? The content type being "app/json" doesn't help here since both treating it as a string is a valid JSON value. However, in this case, per section 3.1 it is a string that just happens to look like JSON.

I agree that when comparing the xml and json examples in the spec there appears to be a bit of an inconsistency, but (if I'm remembering correctly) we treat JSON payloads differently because they're JSON. Clearly, we can't put raw XML into JSON w/o some kind of encoding, but JSON doesn't have that problem, so it makes more sense to put the JSON into data in it's raw form.

@clemensv @cneijenhuis @n3wscott ?

@kevinbader
Copy link
Author

However, in this case, per section 3.1 it is a string that just happens to look like JSON.

Exactly, that's my point - datacontenttype is not interpreted like it is for other media types and the spec doesn't allow to transmit a data structure JSON-encoded (at least the receiver couldn't tell whether it's a JSON encoded value or simply a JSON string without trying to decode it).

I agree that when comparing the xml and json examples in the spec there appears to be a bit of an inconsistency, but (if I'm remembering correctly) we treat JSON payloads differently because they're JSON.

The event format could be XML, right? So if content-type is application/cloudevent+xml and within that event datacontenttype is set to application/json, you'd clearly need to stringify (i.e., JSON-encode) your data (and XML-escape it). So you could say the spec actually means to never encode stuff that already uses the event format as its encoding, so JSON data wouldn't be encoded if the event format is also JSON. But then it would be consistent to do the same in case of XML: if data is already XML encoded and the event format is an XML, you'd expect data to not be a string but as XML "in its raw form" instead whenever datacontenttype is application/xml. However, the spec says that data needs to be XML-encoded no matter the event format.

@deissnerk
Copy link
Contributor

@duglin I think, the important text in the section you referenced is

For any other type, the implementation MUST translate the data value into a JSON value, and use the member name data to store it inside the JSON object.

So, if data is already a JSON value, no translation is needed.

@n3wscott
Copy link
Member

n3wscott commented Jan 29, 2020

Both cases, A):

"data" : "{\"foo\": \"bar\"}"

and, B):

"data" : {
  "foo": "bar
}

Are equivalent to the spec.

The message needs to be inspected to understand how to parse. How it becomes unmarshaled is up to the consumer of the event; they need to know if it is a string, array or a struct. In the case of B, this shortens the string escaping for the consumer.

Arrays are also allowed:

"data" : [
  {"foo": "bar}
]

Arrays are not allowed. my bad.

@duglin
Copy link
Collaborator

duglin commented Jan 29, 2020

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

@tsurdilo
Copy link

tsurdilo commented Jan 29, 2020

@n3wscott In the JSON Schema data is defined as being one of two types, object or string:

"data": {
   "type": ["object", "string"]
}

but array type should not be allowed afaik.

You can however of course:

"data" : {
 "foobar": [ ... ] 
}

@n3wscott
Copy link
Member

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

The spec says you can promote that inner json object or use it as a string. The consumer needs to inspect the string escaping to understand if it should be a string or object based on what it is trying to un-marshal into.

@duglin
Copy link
Collaborator

duglin commented Jan 30, 2020

If I see: "data" : "hello" in the event, I would hope that it would result in a single string with value of "hello" being passed on to the app. If so, I would not expect to see anything different if I replaced "hello" with something that looked like JSON, other than a new value for the string.

Where in the spec does it say you can alternate between a string or json? I didn't see it in: https://github.com/cloudevents/spec/blob/master/json-format.md#31-handling-of-data

@tsurdilo
Copy link

tsurdilo commented Jan 30, 2020

@duglin It cannot alternate. It's either-or: https://github.com/cloudevents/spec/blob/master/spec.json#L12

@tsurdilo
Copy link

IMO having both data and data_base64 is confusing. especially when you can define datacontenttype.

@n3wscott
Copy link
Member

Are equivalent to the spec.

is that true? How would a JSON Object that's just a simple string but looks like JSON be serialized then if not like this: "data" : "{\"foo\": \"bar\"}"

The spec says you can promote that inner json object or use it as a string. The consumer needs to inspect the string escaping to understand if it should be a string or object based on what it is trying to un-marshal into.

soooooo.... funny story... the spec changed at the last minute between 0.3 and 1.0 and it now is a must.

A):

"data" : "{\"foo\": \"bar\"}"

and, B):

"data" : {
  "foo": "bar
}

Are both valid JSON but are not the same. A has been string escaped and is a JSON string. B is a Json object.

@kevinbader
Copy link
Author

As my original post above suggests, I also have some difficulties following the JSON Format spec and would appreciate guidance (I'm currently implementing an Elixir lib for handling CloudEvents).

The core spec says that Binary = "Sequence of bytes. String encoding: Base64 encoding per RFC4648.". The JSON Format spec says that "If the implementation determines that the type of data is Binary, the value MUST be represented as a JSON string expression containing the Base64 encoded binary value, and use the member name data_base64 to store it inside the JSON object.". So how should the implementation actually figure out whether something is considered Binary? It is was binary in the first place, data would already contain the base64-encoded data and the JSON format spec then suggests that data - which is a string and thus also Binary - must be base64-encoded again and put into data_base64. I don't think this was the intention of the spec. Can you please state when base64 encoding must be used and how exactly an implementation should figure out whether something is Binary (as it's not trivial to differentiate between an utf8 string and non-readable binary data)?

@tsurdilo
Copy link

@kevinbader +1, I also think that if you allow "data" to be of type string, then datacontenttype+data is enough. There is no need for data_binary imo. wdyt?

@deissnerk
Copy link
Contributor

There was some discussion around this in #520. In 0.3 we still had an attribute called datacontentencoding, but we had to remove it. Instead the data_base64 field was introduced in formats where it was needed.

There is no simple way to determine from a content-type, if the data is binary. Therefore an SDK will check for some well-known text-based formats and do base64 encoding otherwise. For an example of such a check, please have a look at what @alanconway wrote here.

By using either data or data_base64, the receiving side always knows, if a base64 decode step is needed.

@tsurdilo
Copy link

tsurdilo commented Jan 31, 2020

@deissnerk thank you for the explanation! Got one more question:

CE also defines the "dataschema" property as "Identifies the schema that data adheres to."
Could we then not define our schema as for example:
{
"type": "string",
"contentEncoding": "base64",
"contentMediaType": "image/png"
}
(https://json-schema.org/understanding-json-schema/reference/non_json_data.html)
which then would clearly identify contents of data (and data_base64 would be not needed at all)?

@deissnerk
Copy link
Contributor

@tsurdilo The dataschema attribute is defined as a URI. So you can of course use it to identify a schema like the one you posted, but it won't have any effect on the way the an SDK handles the event.
Do you want to send an event, that contains base64-encoded data end-to-end? That would be different from data_base64 that is only used to put binary data into the JSON format.

Btw, I think the "contentEncoding":"base64" attribute could be added to #/definitions/data_base64 in spec.json.

@tsurdilo
Copy link

@deissnerk added contentEncoding as part of the overall json schema update - pr #563
Would you mind reviewing this pr? Thanks.

@duglin
Copy link
Collaborator

duglin commented Mar 12, 2020

What's the status of this? I think we can close it but @kevinbader do you agree?

@kevinbader
Copy link
Author

@duglin well, I still think that this part of the spec could be improved a lot - I'm still not sure how to implement this in the Elixir library for CloudEvents I was going to build. I mean, depending on whether I implement the optional JSON format spec, the parser behaviour for the data field is different in a non-compatible way. But I've laid out the details above already.

I'm aware that fixing this would mean a breaking change to the spec. But given that 1.0 is less than five months old and therefore it likely hasn't seen that much adoption yet, perhaps it would make sense to consider a v2 spec. If this seems reasonable to you I gladly offer you to create an RFC with some ideas on how this could be handled in a more consistent manner.

@duglin
Copy link
Collaborator

duglin commented Mar 24, 2020

@kevinbader are you concerned about the sender or receiver of CEs? I think it’s the receiver, if so, where is the ambiguity? If it’s JSON and binary then it uses the data_base64 attribute, otherwise it uses the data attribute and the value is pure JSON not a stringified version on JSON.

@tweing
Copy link
Contributor

tweing commented Mar 26, 2020

We are currently looking into using CloudEvents 1.0 in our company. I agree with @duglin that the specification is clear.

Taking the example from @kevinbader:

    "datacontenttype" : "application/json",
    "data" : "{\"foo\": \"bar\"}"

This example above is IMHO invalid, since the value of the data attribute is a string "{\"foo\": \"bar\"}" and not a valid JSON expression.

We human beings can guess that the string happens to be an encoded JSON. If you look at a longer example, you can't be so sure anymore:

{ \"swagger\": \"2.0\", \"info\": { \"description\": \"Repository of Events used within Platform hosted applications\", \"version\": \"1.0\", \"title\": \"Platform Event Registry Microservice\", \"termsOfService\": \"Terms of Service\", \"contact\": { \"name\": \"Platform Core Team\", \"email\": \"reworg_platform-core-team@msxdl.abc.com\" } }, \"host\": \"localhost\", \"basePath\": \"\/\", \"tags\": [ { \"name\": \"event-registry-controller\", \"description\": \"Event Registry Controller\" } ], \"paths\": { \"\/api\/v1\/app\/{appId}\/events\": { \"get\": { \"tags\": [ \"event-registry-controller\" ], \"summary\": \"Get events for an application\", \"description\": \"Retrieve list of events which are produced and consumed by the given application.\\n* Called with: **Platform Admin** role\", \"operationId\": \"getAllEventsByAppIdUsingGET\", \"produces\": [ \"application\/json;charset=UTF-8\" ], \"parameters\": [ { \"name\": \"appId\", \"in\": \"path\", \"description\": \"appId\", \"required\": true, \"type\": \"string\", \"format\": \"uuid\" } ], \"responses\": { \"200\": { \"description\": \"OK\", \"schema\": { \"type\": \"object\", \"properties\": { \"consume\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"consumers\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"required\": [ \"appID\" ], \"properties\": { \"appID\": { \"type\": \"string\", \"format\": \"uuid\" } }, \"title\": \"ApplicationDTO\" } }, \"eventAlias\": { \"type\": \"string\" }, \"eventDescription\": { \"type\": \"string\" }, \"eventId\": { \"type\": \"string\", \"format\": \"uuid\" }, \"eventSampledata\": { \"type\": \"string\" }, \"eventSchema\": { \"type\": \"string\" }, \"eventStatus\": { \"type\": \"string\" }, \"links\": { \"type\": \"array\", \"xml\": { \"name\": \"link\", \"namespace\": \"http:\/\/www.w3.org\/2005\/Atom\", \"attribute\": false, \"wrapped\": false }, \"items\": { \"type\": \"object\", \"properties\": { \"deprecation\": { \"type\": \"string\", \"xml\": { \"name\": \"deprecation\", \"attribute\": true, \"wrapped\": false } }, \"href\": { \"type\": \"string\", \"xml\": { \"name\": \"href\", \"attribute\": true, \"wrapped\": false } }, \"hreflang\": { 

Following the specification of CloudEvents 1.0, we interpret data indicated with application/json directly as JSON without unescaping. This also makes sense on the producer side, because most of our products nowadays use JSON anyhow and we can directly paste or serialize JSON into the data node.

@deissnerk
Copy link
Contributor

@Thoemmeli I agree to your point, but "{\"foo\": \"bar\"}" is a string and therefore also a JSON value. In that sense the example is valid, but the datacontenttype in this case refers to the string and not to the escaped JSON object.

@duglin
Copy link
Collaborator

duglin commented Apr 1, 2020

@kevinbader thoughts?

@kevinbader
Copy link
Author

@duglin well, I've implemented this according to the spec and the replies here:

https://github.com/kevinbader/cloudevents-ex/blob/master/test/format/v_1_0/decoder/json_test.exs

Not particularly happy with it but I guess that's how it is.

@dazuma
Copy link
Member

dazuma commented Jun 24, 2021

@duglin @deissnerk This is coming up in my work on the Ruby SDK, and I want to bring up a clarification question.

To summarize a conclusion from above:

In the following CE:

const ce1 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/json",
  data: "{\"foo\": \"bar\"}"
});

... it sounds like the data should be considered a JSON value of type string. The fact that the string's value happens to look like serialized JSON is irrelevant. It is simply a string. Therefore, if we were to serialize this CE in HTTP Binary mode, it might look like this:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/json

"{\"foo\" : \"bar\"}"

The data must be "escaped" in this way, so that a receiver parsing this content with the application/json content type will end up with a JSON string and not an object.

As a corollary, when deserializing an HTTP Binary mode CE with Content-Type: application/json, the HTTP protocol handler must parse the JSON and set the data attribute in memory to the actual JSON value (rather than the string representation of the JSON document). Otherwise, the content's semantics will change when the CE gets re-serialized. And this, of course, all implies that an SDK's HTTP protocol handler (and perhaps other protocol handlers as well) must understand JSON, even if the JSON structured format is not in use.

Taking that as given, consider this implication:

Earlier a comparison was made with application/xml, noting a possible inconsistency. Consider this parallel example:

const ce2 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/xml",
  data: "<much wow=\"xml\"/>"
});

If we were to treat this XML data consistently with how we treated the earlier JSON data, we would consider this data as a string node in an XML document, whose contents just happen to look like XML. Hence, serializing this as HTML-Binary might yield something like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

&lt;much wow="xml"/&gt;

However, my understanding of the spec, and my understanding of the current behavior of the SDKs, suggests we are not doing that. (And indeed I'm glad, because that would, in turn, imply that all protocol handlers would also need to understand XML.) Instead, we actually consider the above data as semantically an XML document and not a string. Hence, serializing this as HTML-Binary actually looks like:

CE-SpecVersion 1.0
CE-Type: com.example.someevent
CE-Source: /mycontext
CE-ID: C234-1234-1234
Content-Type: application/xml

<much wow="xml"/>

In other words, our handling of the XML content-type appears to be inconsistent with our handling of the JSON content-type.

So my clarification question is:

  1. Am I correct in my interpretation that the spec intentionally treats data with content-type application/json specially, differently from string data with content-type application/xml (or indeed any other content-type), as illustrated above?

If so, follow-up questions:

  1. Is the reason for this that we (for some reason) consider JSON uniquely special among all content types in the universe, or is the reason simply that the spec currently happens to include a JSON format but not an XML format to define how data with that datacontenttype is rendered? Suppose a future spec version adds an XML format, YAML format, Protobuf format, etc. Would we at that time need to change the behavior of those formats to be like JSON (which would be a breaking change)?
  2. How do we precisely identify which content types are to be treated in this special way? For example, application/json is obvious, but what if the datacontenttype is itself application/cloudevents+json (i.e. a cloudevent whose payload is another cloudevent)? If we do consider JSON special, it seems it might be a good idea for the spec to state that explicitly, and define how it is identified, perhaps with reference to fields in RFC 2046 or similar.

@deissnerk
Copy link
Contributor

I don't see any special handling of the JSON format here, @dazuma.
I think the nesting of quotes combined with the different representations of the event in Javascript (not JSON) and HTTP binary lead to some confusion. In your example ce1 you use Javascript:

const ce1 = new CloudEvent({
  specversion: "1.0",
  id: "C234-1234-1234",
  source: "/mycontext",
  type: "com.example.someevent",
  datacontenttype: "application/json",
  data: "{\"foo\": \"bar\"}"
});

This statement gives you an object ce1with, among others, a string member data. It depends now on the SDK, what this means. In the go SDK the Event has an attribute DataEncoded that holds the raw byte sequence of the payload. Only if you call DataAs, the SDK attempts to parse DataEncoded according to the datacontenttype. How would your example look like in JSON?

Like this?

{
...
   "data" : "{\"foo\": \"bar\"}",
...
}

The go SDK would store the serialized JSON value of data in a byte sequence, where the first and the last byte would contain a ". This is a JSON specific implementation, but it is part of the unmarshaling of the JSON format.

As, in Javascript, you don't have byte sequences, you would perhaps instead store the data as string and base64 encode it depending on the datacontenttype. If you now express this string as a string literal in Javascript, it looks like this:

dataEncoded : "\"{\"foo\": \"bar\"}\""

or maybe a bit more readable:

dataEncoded : '"{\"foo\": \"bar\"}"'

If your data attribute is decoded, this means that, when serializing it again, it would have to be encoded again according to datacontenttype. This holds true for JSON, XML and every other encoding. In the go SDK there are encoders and decoders for XML, JSON and text.

@dazuma
Copy link
Member

dazuma commented Jun 27, 2021

@deissnerk I think we are in agreement here on the interpretation of this example that I labeled ce1. The type of the data attribute is string, and so when datacontenttype is set to application/json, the value of dataEncoded must, as you point out, begin and end with " to indicate that a string is being encoded.

dataEncoded : '"{\"foo\": \"bar\"}"'

So far I think we agree. My question is actually about the ce2 example. In JSON format, this looks like:

{
...
   "datacontenttype" : "application/xml",
   "data" : "<much wow=\"xml\"/>",
...
}

I constructed ce2 to be a parallel to ce1. In ce1, the content type is application/json; in ce2, it is application/xml. In ce1, the data attribute is a string which looks like an encoded JSON document; in ce2, the data attribute is a string which looks like an encoded XML document. In all aspects, ce1 and ce2 are in parallel; the one case using JSON and the other using XML, but in both cases, the type of data is string.

For ce2, what is the value of dataEncoded?

Is it: dataEncoded : '<much wow="xml"/>' (which would decode to an XML document)

Or is it: dataEncoded : '&lt;much wow="xml"/&gt;' (which would decode to a string)

@athalhammer
Copy link

Having the same question as @dazuma:

At IANA, there are about 114 application/.*+json media types registered (note: application/cloudevents+json is not even there). Are these JSON documents or should we treat them like the XML examples above?

@deissnerk
Copy link
Contributor

@dazuma IMHO it should be:
dataEncoded : '<much wow="xml"/>'

@athalhammer
Currently the go SDK only recognizes application/json and text/json into account., but it's an interesting question.
@n3wscott @slinkydeveloper What do you think?

For the pending IANA registration we already have #557

@dazuma
Copy link
Member

dazuma commented Jun 28, 2021

@deissnerk

IMHO it should be:
dataEncoded : '<much wow="xml"/>'

I also agree. And that was my point. Parallel cases (in both cases the data is a string that looks like a serialized document in that media type), but different semantics (for JSON, the data is interpreted, and thus encoded, as a JSON string value, but for XML, the data is interpreted, and thus encoded, as a document). JSON's semantics seem to be different from any other media type. Hence my clarification questions.

@deissnerk
Copy link
Contributor

@dazuma This special handling only needs to happen in the unmarhalling implementation of the JSON format. Looking at the pending PR for XML, it seems to be similar there, as it distinguishes XML, text and binary.

@athalhammer
Copy link

athalhammer commented Jun 28, 2021

Not sure if that option has already been discussed but I think a clean solution to this could be to interpret the datacontenttype also as a flag:

  1. If it is present, the value of the data field needs to be a string and the content of the string can be interpreted according to the mime type specified by the datacontenttype field.
  2. If it is absent, we would expect a JSON value (i.e., one of string, number, object, array, true, false, null).

There would be one edge case because a plain string actually is a JSON value. In particular, the following alone would form valid JSON document: "foo bar". However, this would be fine and because we also allow e.g. data: 42, as long as the value of the data field is not coming with a common data representation format, such as xml or others, i.e. having the value "<much wow=\"xml\"/>" and no datacontenttype field would be considered as bad practice/incompatible.

The good news would be: this gives clear guidance without treating JSON differently.

The bad news would be: the following sentences would need to be adapted in the spec:

For example, if a JSON format event has no datacontenttype attribute, then it is implied that the data is a JSON value conforming to the "application/json" media type. In other words: a JSON-format event with no datacontenttype is exactly equivalent to one with datacontenttype="application/json".

Probably also a couple of things would break as the following would not be allowed any longer:

...
datacontenttype = "application/json", 
data = { "foo" : "bar" }
...

To mitigate this, one could specify: "if the value of the data field is not a string, the datacontenttype field is ignored" (would then not break that many things). This would leave us with the following edge case:

...
datacontenttype = "application/json", 
data = "foo bar"
...

This would actually break - on the other hand, this is exactly also the problem that should be addressed by this thread...

@deissnerk
Copy link
Contributor

@duglin I suppose we need to discuss this in our weekly call.
@athalhammer I suppose your proposal could work. Personally, I would prefer a more explicit solution. When I checked out the pending PR for XML, I realized that in this proposal there is an explicit distinction between binary, text and xml. Transferring that approach to JSON would mean to have three dedicated attributes data_json, data_text and data_base64. Of course that is not possible for specversion 1.0, but your proposal would be a breaking change, too.

To mitigate the problem for specversion 1.0, we could clearly define, for which content types the JSON value as such would be the event data.

@jskeet
Copy link
Contributor

jskeet commented Jul 2, 2021

This was thorny in the C# SDK as well. It feels like the separation of concerns isn't quite right - but I think at this point, it's too hard to fix, and I can see how the special handling of JSON is also pragmatic in terms of leading to easy-to-read events. The way we've "solved" it for C# is:

  • We have a CloudEventFormatter abstract class which is responsible for conversions, even for binary mode data. It doesn't quite map to a CloudEvent format
  • Each CloudEventFormatter must specify what conversions it applies - how it treats specific content types
  • Every protocol binding method accepts a CloudEventFormatter; the protocol binding is responsible for extracting the relevant context attributes etc, but delegates all the data handling to the CloudEventFormatter

It's not ideal by any means, but it looks like it's working well enough at the moment. See the docs for implementing a protocol binding and implementing a formatter for more detail. I'm definitely not suggesting that every SDK should implement it the same way as we've chosen to, although I'm happy to work with anyone who wants to know more if they'd like to do something similar in another SDK.

@dazuma
Copy link
Member

dazuma commented Jul 2, 2021

@jskeet Interestingly, it sounds like we ended up solving this very similarly for Ruby. We have a Format interface that handles both structured format and (optionally) binary mode data conversion—the only implementation so far is JsonFormat—and any number of Formats can be "installed" in a protocol binding to handle data conversion based on recognized content types.

@jskeet
Copy link
Contributor

jskeet commented Jul 2, 2021

@dazuma: We're planning on doing something with content types in the future, but didn't want to block going GA on it :)

@deissnerk
Copy link
Contributor

@jskeet I don't think that JSON is treated special. The pending PR for XML, as well as the protobuf format have corresponding concepts. The only difference I see is, that protobuf and XML clearly distinguish data as text, binary or protobuf/XML. As I stated above, there should have been a data_json attribute instead of just relying on datacontenttype.

@athalhammer
Copy link

@deissnerk I agree that the introduction of three fields that clearly separate json from text and base64 solves the problem more explicitly, however, it may come with additional pitfalls.

Is there only one data_* field allowed?

  1. If multiple are allowed: to which of the three fields does the datacontenttype field then relate to?
  2. If only one is allowed: it becomes somehow implicit how datacontenttype is interpreted. It doesn't explicitly relate to one of the three data_* fields but rather the one that is currently given.

@deissnerk
Copy link
Contributor

@athalhammer I would recommend to handle it the same way it is done for protobuf. Only one of the data_* fields would be allowed. This way it is up to the producer to determine, if something is text, binary or JSON without any hidden contract or ambiguity for the consumer or intermediary.
Unfortunately I don't see, how this could be introduced as a non-breaking change. So we still need a solution for specversion: 1.0.
This issue is on the agenda for today's CloudEvents call. Let's see, if someone comes up with a clever proposal.

@athalhammer
Copy link

I like your suggestion - this would also solve the problem of the application/*+json content types as the data could still come in the data_json field but it enables to be more specific.

Example (with JSON-LD from schema.org):

...
datacontenttype = "application/ld+json",
data_json = {
    "@context": "https://schema.org",
    "@type": "Person",
    "address": {
      "@type": "PostalAddress",
      "addressLocality": "Seattle",
      "addressRegion": "WA",
      "postalCode": "98052",
      "streetAddress": "20341 Whitworth Institute 405 N. Whitworth"
    },
    "colleague": [
      "http://www.xyz.edu/students/alicejones.html",
      "http://www.xyz.edu/students/bobsmith.html"
    ],
    "email": "mailto:jane-doe@xyz.edu",
    "image": "janedoe.jpg",
    "jobTitle": "Professor",
    "name": "Jane Doe",
    "telephone": "(425) 123-4567",
    "url": "http://www.janedoe.com"
    },
...

@jskeet
Copy link
Contributor

jskeet commented Jul 9, 2021

More detailed explanation of how the C# SDK formatters work. (We have two: one for Json.NET and one for System.Text.Json.)

Here's the detail for the Json.NET formatter - the System.Text.Json one is equivalent, just using different types.

Structured mode:

/// In a structured mode message, any data is either binary data within the "data_base64" property value,
/// or is a JSON token as the "data" property value. Binary data is represented as a byte array.
/// A JSON token is decoded as a string if is just a string value and the data content type is specified
/// and has a media type beginning with "text/". A JSON token representing the null value always
/// leads to a null data result. In any other situation, the JSON token is preserved as a <see cref="JToken"/>
/// that can be used for further deserialization (e.g. to a specific CLR type). This behavior can be modified
/// by overriding <see cref="DecodeStructuredModeDataBase64Property(JToken, CloudEvent)"/> and
/// <see cref="DecodeStructuredModeDataProperty(JToken, CloudEvent)"/>.

Binary mode:

/// In a binary mode message, the data is parsed based on the content type of the message. When the content
/// type is absent or has a media type of "application/json", the data is parsed as JSON, with the result as
/// a <see cref="JToken"/> (or null if the data is empty). When the content type has a media type beginning
/// with "text/", the data is parsed as a string. In all other cases, the data is left as a byte array.
/// This behavior can be specialized by overriding <see cref="DecodeBinaryModeEventData(ReadOnlyMemory{byte}, CloudEvent)"/>.

@n3wscott
Copy link
Member

For the golang sdk, an integrator can register or override the handler for a given media type: https://github.com/cloudevents/sdk-go/blob/main/v2/binding/format/format.go#L67 so any other kind of media type can be supported if the integrator added the special conversion logic beoynd the well understood text, json, and xml types.

  1. If it is present, the value of the data field needs to be a string and the content of the string can be interpreted according to the mime type specified by the datacontenttype field.
  2. If it is absent, we would expect a JSON value (i.e., one of string, number, object, array, true, false, null).

I am assuming this is only in the context of structured mode? This is the rules today with the extra case of application/json == absent, and also that data_base64 could be present in a mix of any set of datacontenttype.

@dazuma
Copy link
Member

dazuma commented Jul 14, 2021

Ruby SDK info

Event class support for content-type encoding

Ruby's event class (i.e. in-memory representation) includes three fields:

  • data which is always present and may yield either the encoded (string) form of the data, or a decoded (object) form
  • data_decoded? which is boolean-valued and specifies whether the data field is decoded or encoded.
  • data_encoded which provides the encoded (string) form of the data if present, but may not be present.

These fields may be set in different ways depending on whether the SDK knows how to encode/decode the content for a particular event. For example, the SDK does not currently know how to parse XML, so when receiving an event with Content-Type: application/xml, both the data_encoded and data fields are set to the encoded string, and data_decoded? is set to false. The SDK does know how to parse JSON, so when receiving an event with Content-Type: application/json, the data_encoded field is set to the encoded string, and the data field is set to the parsed JSON value, and data_decoded? is true. The SDK provides an interface allowing integrators to register encoder/decoder handlers for media types. There are built-in handlers for JSON and text media types.

JSON structured mode

When decoding an event in JSON structured mode, if the datacontenttype is of the form TYPE/json or TYPE/SUBTYPE+json, the data is assumed to be a JSON value, and is passed through without further decoding. This is done even if the data is a string that looks like a JSON document; it is not parsed. If the datacontenttype is missing, it is treated as if it were application/json, and the data attribute is passed through without decoding. Any other datacontenttype is handled normally: the data is interpreted as an encoded string, and may be parsed based on available registered handlers.

Similarly, when encoding an event to JSON structured mode, if the datacontenttype is of the form TYPE/json or TYPE/SUBTYPE+json, the data is passed through without any further encoding, to the data field in the JSON structured event. If the event does not have datacontenttype set, it is treated as if it were application/json, and the data is passed through without encoding. Any other datacontenttype is handled normally: if the event has a data_encoded field, its value is used, otherwise the SDK will treat the event's data as a decoded Ruby object, and look for an appropriate handler to encode it.

Hence the JSON structured mode implementation treats the content-type patterns TYPE/json and TYPE/SUBTYPE+json specially.

Importantly, Ruby does not base any of this logic (i.e. whether a value needs to be encoded) on the runtime type of the event field value. For example, it does not assume that a dictionary implies a JSON object or that a string implies an encoded value. Rather, the encoded vs decoded status is always expressed in the data_decoded? flag, and the runtime type of the data field could potentially be any Ruby type that a handler knows how to handle.

data_base64 is treated as a transfer-encoding construct, and thus independent of content type encoding. Any data_base64 decoding is done first, transforming the data into a byte array (which Ruby represents as a "string" with ASCII_8BIT encoding). Then, possibly, further datacontenttype-mediated decoding may be done on the result. When encoding an event, the process is reversed: first, any datacontenttype-mediated encoding is performed to encode a Ruby object to a string or byte array. The result is interrogated to see whether the encoded data is binary (i.e. whether the Ruby string has ASCII_8BIT encoding) and if so, it is further base64 encoded and serialized into data_base64.

Binary mode

Binary mode treats all content as encoded, and will populate the event's data and data_encoded, and set the data_decoded? flag, based on whether there is a registered handler for the event's datacontenttype. The Ruby SDK provides handlers for JSON and Text media types, and there is an interface for integrators to provide additional handlers. If an event has no datacontenttype, the data is passed through as a string, as if the content-type were implicitly text/plain or application/octet-stream.

@deissnerk
Copy link
Contributor

As I was asked in the regular call to provide some example to illustrate the issue. Let me try this:

Assume there is a simple event forwarder that received events in structured format and forwards them in binary forward. What would happen for this event?

{
    "source":"mySource",
    "type":"a.clever.CloudEvent",
    "id":"123",
    "datacontenttype":"application/happy+json",
    "data":"I'm just a string"
}

How would the binary look like?

A

ce-source: mySource
ce-type: a.clever.CloudEvent
ce-id: 123
content-type: application/happy+json
I'm just a string

or

B

ce-source: mySource
ce-type: a.clever.CloudEvent
ce-id: 123
content-type: application/happy+json
"I'm just a string"

If the event forwarder recognizes the datacontenttype application/happy+json as JSON, it will interpret data as a JSON value of type string. The result should then be B. Otherwise the result will be A, but in that case the body does not represent a valid JSON value any more.

Conclusion

All SDKs should have the same idea of which content types are to be interpreted as JSON.

@lance
Copy link
Member

lance commented Aug 2, 2021

I have been looking into this for cloudevents/sdk-javascript. At the moment, the SDK would produce A as the result of this transformation. In the example below, I've left out the transformation from a structured event, and am just creating the event from whole cloth, since this is how it would look after deserialization anyway.

> const e = new CloudEvent({
...     "source":"mySource",
...     "type":"a.clever.CloudEvent",
...     "id":"123",
...     "datacontenttype":"application/happy+json",
...     "data":"I'm just a string"
... })
undefined
> const s = HTTP.binary(e)
undefined
> s
{
  headers: {
    'content-type': 'application/happy+json',
    'ce-id': '123',
    'ce-time': '2021-08-02T19:14:34.229Z',
    'ce-type': 'a.clever.CloudEvent',
    'ce-source': 'mySource',
    'ce-specversion': '1.0'
  },
  body: "I'm just a string"
}

In @deissnerk's example, the event representation in A isn't actually a code representation. It's the representation on the wire. In my illustration above, the s object is the in-memory representation of the event as a Message object as defined in the SDK. For users of the SDK, it is their responsibility to push this data across the wire. The reasoning behind this is that the networking world in Node.js is rife with lots and lots of competing frameworks, and the underlying Node.js native APIs are need a lot of scaffolding around them to be very user friendly. So, we've just provided interfaces for developers to implement, and we send/receive through whatever framework they want as long as what they hand us conforms to our API.

When the user ultimately sends the event with something like

const resp = axios.post(url, { headers: s.headers, body: s.body });

That string is just a string, and nothing is wrapping it in quotes. So over the wire, there are no quotes.

Which got me wondering. What if a binary event arrives and it looks like A. Is it invalid? Should the SDK wrap it in quote marks? It's not very clear.

@krispenner
Copy link

krispenner commented Nov 5, 2021

Regarding these two comments about data_json, data_text and data_base64:
#558 (comment)
#558 (comment)

I would recommend to handle it the same way it is done for protobuf. Only one of the data_* fields would be allowed. This way it is up to the producer to determine, if something is text, binary or JSON without any hidden contract or ambiguity for the consumer or intermediary. Unfortunately I don't see, how this could be introduced as a non-breaking change. So we still need a solution for specversion: 1.0. This issue is on the agenda for today's CloudEvents call. Let's see, if someone comes up with a clever proposal.

I realize this is about solving the issue in spec version 1.0 and not be a breaking change, but going beyond that, is there any discussion for the next version anywhere that would allow for breaking changes like this?

I'd prefer to see a dataencoding attribute "re"-added with a value of either json, text or base64 and then only a single data attribute to hold the payload. I'm not seeing the benefit of instead defining individual attributes as mentioned of data_json, data_text or data_base64. It sounds like a dataencoding attribute was once part of the spec but dropped, maybe it needs to be re-introduced. This would remove any "special" case of */json or */*+json for the datacontenttype attribute and simplify the whole confusion here. Or maybe I'm missing why it wouldn't.

I also question the attribute naming formats for consistency. The other attributes are all lowercase, not camel nor snake, so why is data_base64 all of a sudden using snake case? For consistency is should be database64. But to avoid this inconsistency altogether and to avoid adding any more data_xxx fields later, I propose just use data only and add dataencoding to specify the encoding format.

This issue is still open, so I thought I would add my suggestion. I'm a bit confused by the merges as to whether this is considered fixed for spec 1.0 or not now, but I'm suggesting how I think it could be simplified for a future version anyways.

Examples

JSON as JSON

If dataencoding is json, then only datacontenttype of */json or */*+json is allowed.

"dataencoding": "json",
"datacontenttype": "application/json",
"data": {
    value: 1
}

To read this would be: var value = event.data.value;

JSON as text

"dataencoding": "text",
"datacontenttype": "application/json",
"data": "{ \"value\": 1 }"

To read this would be: var value = parseJson(event.data).value;

XML as text

"dataencoding": "text",
"datacontenttype": "application/xml",
"data": "<much wow=\"xml\"/>"

To read this would be: var wow = parseXml(event.data).attr("wow");

JSON as bytes

"dataencoding": "base64",
"datacontenttype": "application/json",
"data": "ew0KICAgIHZhbHVlOiAxDQp9"

To read this would be: var value = parseJson(toUtf8String(fromBase64(event.data))).value;

XML as bytes

"dataencoding": "base64",
"datacontenttype": "application/xml",
"data": "PG11Y2ggd293PSJ4bWwiLz4="

To read this would be: var wow = parseXml(toUtf8String(fromBase64(event.data))).attr("wow");

Binary as bytes

"dataencoding": "base64",
"datacontenttype": "image/png",
"data": "c29tZWltYWdlZGF0YQ=="

To read this would be: var imageBytes = fromBase64(event.data);

Thank you.

@jskeet
Copy link
Contributor

jskeet commented Nov 8, 2021

I'm a bit confused by the merges as to whether this is considered fixed for spec 1.0 or not now, but I'm suggesting how I think it could be simplified for a future version anyways.

To my mind, it's as "fixed for spec 1.0" as it can reasonably be (modulo clarifying tweaks etc). Yes, there's a lot that could change for 2.0, although the larger the change in 2.0 (not just here, but everywhere), the harder it will be for SDKs etc to support both 1.0 and 2.0. I haven't heard any detailed discussions of expectations around timelines for a 2.0 - I think more of the activity is around getting Discovery etc across the line first.

@adgerrits
Copy link

adgerrits commented Apr 25, 2022

Just like @krispenner in his #558 comment I've been reading past issues to understand why 'data_base64' is chosen in favor of a 'dataencoded' atrribute. I fully agree with his proposals that the chosen solution seems confusing and inconsistent. I am curious if there is a chance his proposals will be adopted in a future release. The argument from @jskeet that big changes are not desirable is valid but maybe support both options ('data_base64' and 'dataencoded') for a while and in the longer term phase out 'data_base64'? (or are there good reasons that I missed why the 'data_base64' option is better?)

@jskeet
Copy link
Contributor

jskeet commented Apr 25, 2022

To my mind, one problem is that "dataencoding" is a perfectly valid context attribute name, but its use here isn't really part of the CloudEvent itself. I would be happier with "data_encoding", to indicate that it's metadata about the "data" property rather than a separate context attribute.

In terms of supporting both: I'd prefer not to do that, personally. We can't make this change until 2.0 (it would be a breaking change) and I'd really like to aim for 2.0 to be very, very long-lived. Instead, I think it makes sense for a CloudEvent 1.0 to use the existing format, and a CloudEvent 2.0 to use "whatever we decide is best" - individual SDKs can decide which versions of CloudEvents they support. They may decide to support both 1.0 and 2.0 forever, or drop 1.0 support after 2.0 is widely adopted. Making that an SDK choice rather than having both options in the spec itself feels like a more flexible approach.

@duglin
Copy link
Collaborator

duglin commented Apr 27, 2022

Adding the v2.0 label to this issue as we may want to consider tweaking the attributes at that time.

@duglin duglin added the v2.0 label Apr 27, 2022
@duglin
Copy link
Collaborator

duglin commented Apr 27, 2022

Is the "datacontent/data_content/data_xxx" discussion the only open topic for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests