Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better JSON binary handling #459

Closed
jroper opened this issue Jun 18, 2019 · 8 comments
Closed

Better JSON binary handling #459

jroper opened this issue Jun 18, 2019 · 8 comments
Assignees

Comments

@jroper
Copy link
Contributor

jroper commented Jun 18, 2019

Currently, binary values get mapped to base64ed strings in JSON. The problem is, there's no solid way to know, when converting JSON back to a CloudEvent, whether a JSON string is a CloudEvents String type, or a CloudEvents Binary type that has been base64ed. The result is an opportunity for incompatibilities between systems, since the behaviour is unspecified, different systems may handle it in different ways (eg, some might always treat strings as strings, some might have heuristics that determine if something looks like a base64 value, etc).

Now, for known attributes that are Binary, that information can be used to handle it. But for unknown attributes (eg, extension attributes that a particular implementation does not understand but may pass along as is), there's no way to correctly handle them. Also, for Any typed attributes, there's no way to correctly handle them.

I propose therefore that we change the way binary values are encoded, such that binary values are encoded using a JSON object with a single member with a special name, eg:

"someanytypedattribute": {
  "__ce-binary": "... <base64ed content here> ..."
}

Of course, this would prevent encoding any Maps or JSON values that had a key of __ce-binary, but the risk of that ever occurring in the real world is next to zero. If the above were adopted, then it would always be unambiguous as to whether a particular field was binary or a string. The same could be used for the data, and it would mean decoders would not have to rely on complex heuristics based on the content type to determine whether the value is meant to be binary, or text, producers could make whatever decision they want as to whether the data is binary, text or JSON, and the consumer will always be able to correctly consume it without the risk of accidentally treating a base64ed bytes as a string, or vice versa.

@clemensv
Copy link
Contributor

clemensv commented Jun 18, 2019

Binary isn't different from Timestamp. We have a canonical representation for binary (Base64 string) and a canonical representation for Timestamp (RFC3339 string), and an implementation or encoding is allowed to flow either differently and expose them as native runtime types. Interop is ensured by always being able to convert via the canonical representation. It doesn't matter what format the Binary is in during transfer as long as a consuming client who expects a binary will be able to convert it, and if the consumer sees a string, it can expect for the encoding to be Base64. If a client has no opinion about the expected data type, it's probably also fine for it to treat it as a string if it arrives as a string.

What you are proposing is to introduce a constructor concept for JSON as it already exists in AMQP or Protobuf wire-encoding.

@jroper
Copy link
Contributor Author

jroper commented Jun 18, 2019

But what about Any typed attributes? Right now, if you have an Any typed attribute, and an instance of that attribute is Binary, it base64 encodes it, puts it in a String, but then the consumer has no way of knowing that it should be base64 unencoded or not - no consumer knows, that context is lost. The result is a corruption of the event, that attribute will end up having a different type in the consumer and different context data. That's a bug, you can't have round trips to/from encodings resulting in a different representation coming back, that's exactly the sort of edge cases where you run into interoperability issues. And the whole point of CloudEvents is to achieve interoperability, to ensure that every producer and consumer pair can have exactly the same understanding of an event, if we don't solve problems like this, what are we even doing this for?

@duglin
Copy link
Collaborator

duglin commented Jun 18, 2019

Obviously this related to our previous discussions around #432. Based on the above, it seems like there's agreement that if both the producer and consumer know that an attribute is going to be base64ed as it is serialized then there's no interop issue, right? It's really only an issue for unknown extensions. My assumption was that consumers could treat them as strings (or raw bytes). If the consumer has no idea what this attribute means, or how it should be used, the best it can do is just pass it along in the hopes of someone else later in the pipeline will know what to do with it. At that point, it can be de-base64ed by that component. But I don't see the interop issue since w/o knowing what the value means i don't think the consumer can do anything with it anyway even if it did know it was base64ed. @jroper perhaps you can elaborate on what you think a consumer should do with an unknown extension that would require it to know whether it should de-base64 it?

@duglin
Copy link
Collaborator

duglin commented Jun 18, 2019

having said that... if we did actually really see an interop issue around not knowing what each attribute's type is then my current preference is to use something like @JemDay suggested, and just encode its type in the name of the attribute as it is serialized - e.g. ce-foo-int. But I'm not seeing the need for that yet.

@clemensv
Copy link
Contributor

clemensv commented Jul 2, 2019

@jroper

The consumer will typically have an expectation for the field even if it's Any typed. The concrete type can also be communicated through a second attribute exactly as we do with with datacontentencoding for the data attribute.

At the consumer side, the app will be coded to a particular contract for a context attribute. If the expectation is that the context attribute holds a binary token and the attribute's type is text, the app will try to do a Base64 conversion and that's an informed choice because that's CE's canonical encoding. With transports/encodings that can carry binary as binary, there's obviously no issue.

The model we're employing for all types is that the canonical format is a string encoding and the consumer MUST have further knowledge about the expected concrete type to map into the native type they use with their runtime. If the consumer doesn't have that knowledge, it's quite safe to assume that they're not taking action on that value and a thus leaving the value as a string is just fine. For when the consumer has that knowledge, the rules we have ensure that the data is convertible if valid.

@jroper
Copy link
Contributor Author

jroper commented Jul 2, 2019

The consumer will typically have an expectation for the field even if it's Any typed.

Isn't that a contradiction? If the consumer has an expectation of the type, why is it using Any? What purpose does Any serve if it's not to allow attribute values to be sent with no expectation of what their type is? If the consumer expects Binary, then shouldn't the attribute type be Binary? What's an example use case for allowing Any if consumers are required to expect a certain type?

Here's an example case where the consumer does not have an expectation. Kafka event keys can be Strings, or bytes, or integers, or anything. So if I had an event key extension for Kafka that defined the key to be Any, the producer could treat it as a String or Binary, but the consumer doesn't know if the producer set a String or a base64'd Binary. Does this matter? In some use cases, yes. Let's say the consumer doesn't understand the event key, it's just updating a database table, using a base64ed version of the event key bytes as the primary key. To do that, it needs to know if the value is a Binary that is already base64ed, or if it's a String that needs to be base64ed, otherwise it's not going to update the right primary key in the database. The solution? The Kafka extension should not use the type Any. It should instead use the type Binary, then the consumer can expect it to be binary, know that it's always base64ed, and handle it correctly. By extension, I believe any case where you might use Any will have the same types of problems, and so Any should never be used as an attribute type value. Hence, it should not be allowed by the spec as an attribute type value.

@duglin
Copy link
Collaborator

duglin commented Jul 11, 2019

@jroper will #470 address this one?

@duglin
Copy link
Collaborator

duglin commented Aug 15, 2019

I believe with #484 merged we can now close this - if I'm mistake please let me know and we can reopen it.

@duglin duglin closed this as completed Aug 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants