Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Expected later base64 encoding tag #449

Closed
benluddy opened this issue Dec 8, 2023 · 5 comments
Closed

feature: Expected later base64 encoding tag #449

benluddy opened this issue Dec 8, 2023 · 5 comments

Comments

@benluddy
Copy link
Contributor

benluddy commented Dec 8, 2023

Is your feature request related to a problem? Please describe.

This request comes from a similar use case as #446. Essentially, Go struct objects are being serialized using encoding/json, transmitted to another program that does not have access to the definitions of the Go struct types, and deserialized (again using encoding/json) into an empty interface. I am working to support CBOR as a compatible alternative to the existing JSON encoding.

Currently, there's an incompatibility when dealing with Go fields of type []byte. The behavior of encoding/json is: marshaling []byte produces a JSON string containing the base64 encoding of the slice contents. Unmarshaling this back into a []byte does the reverse, transparently decoding the base64 string into the original bytes. Unmarshaling into an empty interface produces a Go string containing the base64 encoding.

As expected, CBOR marshaling doesn't perform the base64 encoding or decoding, since CBOR provides distinct byte string and text string types. It also preserves that distinction when decoding into an empty interface value and produces a []byte.

https://go.dev/play/p/n8nnk-HnHGi

Describe the solution you'd like

RFC 8949 (in https://www.rfc-editor.org/rfc/rfc8949.html#section-3.4.5.2) specifies several tags for "expected later encoding" that an encoder may attach to byte strings to communicate how the string should be converted to JSON. I would like to be able to optionally configure the CBOR encoder to automatically apply tag 22 when it serializes a Go []byte to a CBOR byte string, and optionally configure the decoder to honor the tag when decoding into an empty interface value.

Support for encoding expected later encoding tags could be controlled by an EncOption that sets a single (or no) expected later encoding for any encoded []byte. It might be interesting instead to infer tag 22 automatically when encoding struct fields of type []byte that have json field tags, but users would still reasonably expect an option to disable, so I don't think there a real upside there.

A new DecOption would control the behavior of decoding expected later encoding tags into empty interface values and into Go strings.

Describe alternatives you've considered

I'd like to be able to implement this using TagSet, but I don't think it's possible with the current interface.

Additional context

@fxamacker
Copy link
Owner

Hi @benluddy, thanks for opening this issue!

Yes, I agree the current interface of TagSet doesn't support this.

I'm open to extending TagSet without breaking backward compatibility. Adding decoding and encoding option would also work.

I'd need to look into this in order to have a preference. Do you have a preference between extending TagSet or adding decoding/encoding options?

@benluddy
Copy link
Contributor Author

I've looked at this problem more and no longer think tags are sufficient to reach drop-in compatibility with encoding/json when dealing with []byte.

Say that CBOR is configured to:

  • always encode []byte with tag 22
  • base64-encode the contents of tag 22 strings when decoding into Go strings and interface{}
  • ignore tag 22 when decoding into []byte

This gives JSON-compatible results for:

  • []byte-to-CBOR-to-[]byte (this path is compatible today)
  • []byte-to-CBOR-to-interface{}
  • []byte-to-CBOR-to-string

...but not for string-to-CBOR-to-[]byte. If a client uses JSON to serialize map[string]interface{"Bytes":"aGVsbG8gd29ybGQ="} and send it to a server that decodes into a struct{Bytes []byte}, the server will see []byte("hello world") in the Bytes field. The output of a CBOR encoder dropped into the same client would be read by the server as []byte("aGVsbG8gd29ybGQ=").

So there would also need to be a decode option that assumes untagged CBOR strings contain base64-encoded data when decoding into a []byte. Strings with tag 22 would not need to be decoded, so the option would depend on tag 22 being one of the built-in tags.

@benluddy
Copy link
Contributor Author

Hi @fxamacker, there were enough details to consider here that I went ahead and implemented a POC (#476). It ended up fairly close to what I described in my last comment. Please take a look when you're able. I'd like to arrive at an approach you're happy with before implementing full test coverage in my branch. This is the gist of the approach in the POC:

A CBOR encoder that is aware of the text format it will interoperate with can configure any (or none) of the expected later encoding tags to be automatically applied whenever a Go []byte is encoded to byte string (e.g. []byte("hello world") might encode as 22('hello world'). This is controlled by a new encode option, ByteSliceMode.

The same CBOR encoder might also be asked encode an interface{} that was itself the output of a text format decoder, like encoding/json. Any Go string might have originally been produced by applying a text-encoding to a []byte (e.g. []byte("hello world") might encode to the JSON string "aGVsbG8gd29ybGQ=" and decode back to the Go string "aGVsbG8gd29ybGQ="). The CBOR encoder has no way to recognize that a Go string in its input represents, as in the example, the base64 encoding of []byte("hello world").

A corresponding decoder needs to be able to handle the CBOR produced in both of the above cases appropriately whether decoding into a []byte or a string. To interoperate with encoding/json across struct and interface{} values, the desired decoder behavior is:

CBOR destination type destination value conversion
22('hello world') string "aGVsbG8gd29ybGQ=" encode
'hello world' string "hello world" none
22('hello world') []byte []byte("hello world") none
'aGVsbG8gd29ybGQ=' []byte []byte("hello world") decode

This is made configurable with two proposed decode options. First, TextConversions func(reflect.Type) TextConversionMode, which selects a text conversion (encode, decode, or none) based on destination type. Second, DefaultTextEncoding TextEncoding, which specifies a particular text encoding (base64url, base64, base16, or none) to assume for untagged byte strings when the text conversion mode is decode.

The test in the POC roundtrips from []byte to interface{} and back using both CBOR (using the configuration described above) and with encoding/json, verifying that the intermediate interface{} values in both cases are identical to each other and that the final values in both cases are identical to the original value.

@fxamacker
Copy link
Owner

Thanks again for the detailed write up! I shared some thoughts in PR #476.

The draft PR and round-trip tests were really helpful! 👍

@fxamacker
Copy link
Owner

Thanks Ben! Closed by #476.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants