Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decode option to allow rejecting inputs that contain certain simple values. #481

Merged
merged 1 commit into from
Apr 7, 2024

Conversation

benluddy
Copy link
Contributor

@benluddy benluddy commented Jan 30, 2024

Description

Proof of concept implementation of #477.

PR Was Proposed and Welcomed in Currently Open Issue

Checklist (for code PR only, ignore for docs PR)

  • Include unit tests that cover the new code
  • Pass all unit tests
  • Pass all 18 ci linters (golint, gosec, staticcheck, etc.)
  • Sign each commit with your real name and email.
    Last line of each commit message should be in this format:
    Signed-off-by: Firstname Lastname firstname.lastname@example.com
  • Certify the Developer's Certificate of Origin 1.1
    (see next section).

Certify the Developer's Certificate of Origin 1.1

  • By marking this item as completed, I certify
    the Developer Certificate of Origin 1.1.
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@fxamacker
Copy link
Owner

@benluddy Thanks for this POC and discussion in issue #477. I focused on big picture since this has draft status.

If my understanding is correct, use cases are:

  • no custom handling of simple values (default and matches current behavior)
  • custom handling of all simple values
  • custom handling of some simple values, with default handling for all other simple values

The first two use cases are handled nicely in this POC since decoder uses either defaultSimpleValues or received SimpleValues.

However, the third use case seems error-prone because user needs to figure out how to trigger default behavior and apply it for all simple values which don't require custom handling. Maybe it can be more user friendly. Thoughts?

@benluddy
Copy link
Contributor Author

However, the third use case seems error-prone because user needs to figure out how to trigger default behavior and apply it for all simple values which don't require custom handling. Maybe it can be more user friendly. Thoughts?

To be clear, this would support users who want to do something like reject the undefined simple value instead of treating it as nil, but otherwise preserve the existing treatment of all other simple values?

I will take a look at that. Maybe this sort of usage would be clearer if the option field had an opaque type, with some exported functionality to allow users to construct option values that effectively specify default "overrides" for only a few simple values.

@benluddy benluddy force-pushed the simplevalue-govalues branch 3 times, most recently from 2208a23 to 2f65e3b Compare March 11, 2024 21:34
@benluddy benluddy marked this pull request as ready for review March 11, 2024 21:46
@benluddy
Copy link
Contributor Author

@fxamacker based on your feedback, I changed the option from a sparse map to a pointer to an opaque, immutable SimpleValueRegistry type. Again, if the option is nil, the existing behavior is retained. Users can construct a custom SimpleValueRegistry with NewSimpleValueRegistry:

no custom handling of simple values (default and matches current behavior)

SimpleValues: nil,

or

SimpleValues: NewSimpleValueRegistry(WithDefaultSimpleValueAnalogs),

custom handling of all simple values

SimpleValues: NewSimpleValueRegistry(
  // any simple value not registered will be rejected
  // this example will only recognize false and true
  WithSimpleValueAnalog(20, false),
  WithSimpleValueAnalog(21, true),
),

custom handling of some simple values, with default handling for all other simple values

SimpleValues: NewSimpleValueRegistry(
  WithDefaultSimpleValueAnalogs,
  WithNoSimpleValueAnalog(23), // reject undefined
  WithSimpleValueAnalog(200, some.Thing),
),

@fxamacker
Copy link
Owner

fxamacker commented Mar 18, 2024

@benluddy Thanks for updating this PR! I'll take a closer look this week (hopefully tomorrow). 🙏

Copy link
Owner

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benluddy Thanks for updating this PR! Looks great!

I left a few comments, e.g. need to forbid mutable SimpleValue analogs, provide an additional way to create SimpleValue registry (to reduce user mistakes), etc.

nit: Not sure, but would names be more obvious if "Analog" was replaced by "Substitute" or "Sub"? E.g. "SimpeValueAnalog" -> "SimpleValueSubstitute"

decode.go Outdated
// SimpleValueRegistry is an immutable mapping from CBOR simple value number (0...23 and 32...255)
// to Go analog value.
type SimpleValueRegistry struct {
analogs [256]*interface{}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

decode.go Outdated
if sv >= 24 && sv <= 31 {
return fmt.Errorf("cbor: cannot set analog for reserved simple value %d", sv)
}
r.analogs[sv] = &analog
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably deny mutable analogs. If analog is of type *int, the integer value can be modified after SimpleValueRegistry is created.

In addition to forbidding pointers, we could also forbid containers able to hold pointers. This allows us to avoid some code bloat (e.g. checking if any element/field or nested element/field is a pointer).

For simplicity, maybe we can require analogs to essentially be one of these:

  • nil
  • bool
  • string
  • integers
  • floats

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that the library should forbid this. As you implied, it's not trivial to pin down without making the limitations very strict.

If limited to only a few built-in types, it becomes hard for users to distinguish between a simple value and a different value with an equivalent Go representation. If the limitation is relaxed to be based on kind rather than type, then users could define a named type with one of the supported underlying types, and use an analog of that type to distinguish the simple value.

However, I think that would eventually bite users who register analogs that they don't entirely control. If a perfect analog value is provided by a library, users may be forced to perform a second pass over their decoded object to convert from a user-defined sentinel value to the analog that they want to appear in the final object. Alternatively, a library-provided value may change during a dependency update in such a way that it no longer meets our restrictions on acceptable analog values.

I don't have a use case for this presently, but I lean toward leaving users responsible for recognizing that their *int analog is the pointer itself and not the referenced value. Can you imagine this causing thorny issues down the line within this library? Map key equivalence is the main area that comes to my mind.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If limited to only a few built-in types, it becomes hard for users to distinguish between a simple value and a different value with an equivalent Go representation. If the limitation is relaxed to be based on kind rather than type, then users could define a named type with one of the supported underlying types, and use an analog of that type to distinguish the simple value.

I agree. We can restrict based on kind rather than type.

However, I think that would eventually bite users who register analogs that they don't entirely control. If a perfect analog value is provided by a library, users may be forced to perform a second pass over their decoded object to convert from a user-defined sentinel value to the analog that they want to appear in the final object. Alternatively, a library-provided value may change during a dependency update in such a way that it no longer meets our restrictions on acceptable analog values.

Maybe we can also allow users to register Unmarshaler to handle simple values, as supported for tags.

So user can:

  • perform decoding operation via UnmarshalCBOR for registered simple value if any, or
  • decode simple value to registered analog

This would give user the flexibility to handle more complex analog (pointers and containers) and also let user assume the responsibility for handling edge cases related to that.

Thoughts?

I don't have a use case for this presently, but I lean toward leaving users responsible for recognizing that their *int analog is the pointer itself and not the referenced value. Can you imagine this causing thorny issues down the line within this library? Map key equivalence is the main area that comes to my mind.

My main concern is potential concurrency bugs introduced by users if we allow pointers and containers that can hold pointers. Time spent troubleshooting bugs in 3rd party apps or libraries takes time away from improving this codec.

Maybe we can forbid pointers and containers in this PR and open a separate PR to add support for registering Unmarshaler for simple values, which would allow the user to implement support pointers, containers, etc.

decode.go Outdated Show resolved Hide resolved
decode.go Show resolved Hide resolved
decode.go Outdated
Comment on lines 1238 to 1241
return &UnmarshalTypeError{
CBORType: t.String(),
GoType: v.Type().String(),
errorMsg: "simple value " + strconv.FormatInt(int64(val), 10) + " is not recognized",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return a new type of error (e.g. UnregisteredSimpleValueError) instead of reusing UnmarshalTypeError because:

  • UnmarshalTypeError is returned when CBOR data can't be decoded to specified Go type
  • The simple value error only happens when given simple value is explicitly not registered by users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed. WDYT about a more general UnsupportedValueError that is returned when we recognize a valid input but are configured not to accept it? This sort of thing is mentioned in https://www.rfc-editor.org/rfc/rfc8949.html#section-5-3:

CBOR-based protocols MUST specify how their decoders handle invalid and other unexpected data. CBOR-based protocols MAY specify that they treat arbitrary valid data as unexpected. Encoders for CBOR-based protocols MUST produce only valid items, that is, the protocol cannot be designed to make use of invalid items. An encoder can be capable of encoding as many or as few types of values as is required by the protocol in which it is used; a decoder can be capable of understanding as many or as few types of values as is required by the protocols in which it is used.

I have a similar use case for rejecting inputs containing bignum tags, and I don't think applications would get much utility from having distinct error types for different inputs that the library supports but their application protocol does not.

decode.go Outdated Show resolved Hide resolved
decode.go Outdated Show resolved Hide resolved
@benluddy
Copy link
Contributor Author

Thanks for the thorough review @fxamacker, I think I've made changes to address all except one of the inline comments. I held off on changing #481 (comment) for the moment. If you feel strongly about it then I'll go ahead and implement the change you suggested (that decision is not critical to any use case I have).

nit: Not sure, but would names be more obvious if "Analog" was replaced by "Substitute" or "Sub"? E.g. "SimpleValueAnalog" -> "SimpleValueSubstitute"

For me "analog" is a closer fit because there is a 1:1 correspondence between a simple value and a unique Go value. "Substitute" might imply that there is a natural representation of any given simple value that the substitute is replacing, but only a few of the simple values (true, false, and maybe null) have obvious Go representations. I'll defer to your judgment on this as well, please let me know if you would prefer to see the terminology changed.

@benluddy benluddy requested a review from fxamacker March 26, 2024 17:04
Copy link
Owner

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benluddy for updating this PR! It looks great!

For me "analog" is a closer fit because there is a 1:1 correspondence between a simple value and a unique Go value. "Substitute" might imply that there is a natural representation of any given simple value that the substitute is replacing, but only a few of the simple values (true, false, and maybe null) have obvious Go representations. I'll defer to your judgment on this as well, please let me know if you would prefer to see the terminology changed.

Thanks for describing the reasons for using the term "analog", it makes sense, lets keep your preferred terminology! 👍

I held off on changing #481 (comment) for the moment. If you feel strongly about it then I'll go ahead and implement the change you suggested (that decision is not critical to any use case I have).

I prefer being more restrictive in built-in decoding functions (forbidding containers and pointers as analogs to SimpleValue) and allowing users to implement support for containers, pointers, etc. by registering Unmarshaler for simple values if they need it.

By forbidding containers and pointers as analogs for simple values:

  • We avoid making it easier for users to introduce concurrency bugs. This helps me limit time spent troubleshooting concurrency bugs in 3rd-party apps and libraries when users suggest the bug is in this codec.
  • We avoid premature code bloat from checking nested elements of containers, etc.
  • We can open separate PR to allow users to register Unmarshaler for simple values so users can implement support for all kinds of types including containers and pointers.

Thoughts?

decode.go Outdated
Comment on lines 525 to 527
// Creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against an empty registry. Any simple value without a registered analog will
// produce an UnacceptableDataItemError if encountered in the input while unmarshaling.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add function name to comment.

Suggested change
// Creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against an empty registry. Any simple value without a registered analog will
// produce an UnacceptableDataItemError if encountered in the input while unmarshaling.
// NewSimpleValueRegistry creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against an empty registry. Any simple value without a registered analog will
// produce an UnacceptableDataItemError if encountered in the input while unmarshaling.

decode.go Outdated
Comment on lines 538 to 539
// Creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against a registry that is pre-populated with the library defaults.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add function name to comment.

Suggested change
// Creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against a registry that is pre-populated with the library defaults.
// NewSimpleValueRegistryFromDefaults creates a new SimpleValueRegistry. The registry state is initialized by executing the provided
// functions in order against a registry that is pre-populated with the library defaults.

decode.go Outdated Show resolved Hide resolved
decode.go Outdated
if sv >= 24 && sv <= 31 {
return fmt.Errorf("cbor: cannot set analog for reserved simple value %d", sv)
}
r.analogs[sv] = &analog
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If limited to only a few built-in types, it becomes hard for users to distinguish between a simple value and a different value with an equivalent Go representation. If the limitation is relaxed to be based on kind rather than type, then users could define a named type with one of the supported underlying types, and use an analog of that type to distinguish the simple value.

I agree. We can restrict based on kind rather than type.

However, I think that would eventually bite users who register analogs that they don't entirely control. If a perfect analog value is provided by a library, users may be forced to perform a second pass over their decoded object to convert from a user-defined sentinel value to the analog that they want to appear in the final object. Alternatively, a library-provided value may change during a dependency update in such a way that it no longer meets our restrictions on acceptable analog values.

Maybe we can also allow users to register Unmarshaler to handle simple values, as supported for tags.

So user can:

  • perform decoding operation via UnmarshalCBOR for registered simple value if any, or
  • decode simple value to registered analog

This would give user the flexibility to handle more complex analog (pointers and containers) and also let user assume the responsibility for handling edge cases related to that.

Thoughts?

I don't have a use case for this presently, but I lean toward leaving users responsible for recognizing that their *int analog is the pointer itself and not the referenced value. Can you imagine this causing thorny issues down the line within this library? Map key equivalence is the main area that comes to my mind.

My main concern is potential concurrency bugs introduced by users if we allow pointers and containers that can hold pointers. Time spent troubleshooting bugs in 3rd party apps or libraries takes time away from improving this codec.

Maybe we can forbid pointers and containers in this PR and open a separate PR to add support for registering Unmarshaler for simple values, which would allow the user to implement support pointers, containers, etc.

@benluddy
Copy link
Contributor Author

benluddy commented Apr 1, 2024

Maybe we can forbid pointers and containers in this PR and open a separate PR to add support for registering Unmarshaler for simple values, which would allow the user to implement support pointers, containers, etc.

I like that direction.

If support is eventually added for user-provided simple value unmarshalers, and we also implement limited support for certain simple value analog types today, there will ultimately be several ways to configure the same behaviors. My use cases don't require user-provided analogs at all, only the ability to reject simple values other than true, false, and null. Would you accept a more limited option for today that allows users to configure only which simple values their protocol rejects, without touching how accepted simple values are decoded? Given the discussion on this PR, I don't want to be too hasty to commit to a decision on how users provide their own analogs, but I think a simple value allowlist option and new error type would be uncontroversial.

@fxamacker
Copy link
Owner

Would you accept a more limited option for today that allows users to configure only which simple values their protocol rejects, without touching how accepted simple values are decoded? Given the discussion on this PR, I don't want to be too hasty to commit to a decision on how users provide their own analogs, but I think a simple value allowlist option and new error type would be uncontroversial.

@benluddy Yes, that sounds good to me! 👍

@benluddy benluddy changed the title Option to set arbitrary simple value to Go value mappings. Add decode option to allow rejecting inputs containing certain simple values. Apr 2, 2024
@benluddy benluddy changed the title Add decode option to allow rejecting inputs containing certain simple values. Add decode option to allow rejecting inputs that contain certain simple values. Apr 2, 2024
@benluddy benluddy requested a review from fxamacker April 2, 2024 16:37
@benluddy
Copy link
Contributor Author

benluddy commented Apr 2, 2024

Thanks for the great discussion on this feature @fxamacker!

I updated the PR so that users can reject arbitrary simple values but cannot change how any simple value unmarshals to a Go value. The option's API surface has roughly the same shape (only smaller) and should allow for the registration of custom simple value unmarshalers in the future. Please take another look when you get the chance.

Applications can use the decode option SimpleValues to override the default unmarshal behavior on a
per-simple-value basis. The option is open to future extension, and currently supports the choice
between the default backwards-compatible behavior (true as true, false as false, null as nil, and
undefined as nil, and cbor.SimpleValue(N) for each currently-unregistered but well-formed simple
value number N) and outright rejection with the new error type UnacceptableDataItemError.

Signed-off-by: Ben Luddy <bluddy@redhat.com>
Copy link
Owner

@fxamacker fxamacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benluddy Nice! 👍 Thanks for updating this PR and great discussions!

@fxamacker fxamacker merged commit 9f099e8 into fxamacker:master Apr 7, 2024
17 checks passed
@fxamacker fxamacker added the enhancement New feature or request label Apr 22, 2024
@fxamacker fxamacker added this to the v2.7.0 milestone Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants