Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drake application yaml logic handles !!binary tag #22318

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

SeanCurtis-TRI
Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI commented Dec 16, 2024

Adds the ability for Drake to treat the yaml !!binary tag in a consistent and coherent manner.


This change is Reviewable

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@jwnimmer-tri Here's the promised read functionality. Ignore R1, it's still floating on top of the path/FileSource serialization. R2 has the python stuff and R3 has the C++ side.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)

@SeanCurtis-TRI SeanCurtis-TRI force-pushed the PR_yaml_byte_string_v2 branch 4 times, most recently from 2af36ef to 5eda807 Compare December 17, 2024 17:03
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll post my review of the test cases now. Big picture, it seems like the C++ and Python tests lack parity -- details inline below.

For my review of the actual implementation changes, I think my feedback will make the most sense when accompanied by a patch, which I'm working on now and will share a bit later.

Reviewed 1 of 2 files at r2, 2 of 7 files at r3, 8 of 9 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: 12 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
This PR only seems to have reading. Usually we add both reading and writing in the same PR, as a way to help cross-check that both sides of the story are mutually compatible.

I suppose if its an easier PR train we can add reading and then PR the writing half immediately after (while backfilling any glitches in reading that show up), but we don't want to delay that very long.


common/yaml/test/example_structs.h line 70 at r5 (raw file):

  }

  std::vector<std::byte> value{std::byte(0), std::byte(1), std::byte(2)};

nit This should patch the Python test value, i.e., deadbeef. (Or Python could change to match this.)


common/yaml/test/example_structs.h line 92 at r5 (raw file):

}

struct AllScalarsStruct {

nit Don't we need a std::vector<std::byte> here, both for completeness and to match the Python tests?


common/yaml/yaml_node.h line 174 at r4 (raw file):

  static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"};

  // https://yaml.org/spec/1.2.2/#generic-string

nit Wrong documentation link


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

@dc.dataclass
class IntStruct:

See the comment about 10 lines up.

We need IntStruct on the C++ side now, too.


bindings/pydrake/common/test/yaml_typed_test.py line 321 at r5 (raw file):

        for value,  error_msg in cases:
            data = f"value: !!binary {value}"
            with self.assertRaisesRegex(yaml.constructor.ConstructorError,

nit This is overly coupling our test case to the implementation. We don't care about the type, we only care that the message is sensible.

(Therefore also remember to remove the import yaml.)

Suggestion:

Exception

bindings/pydrake/common/test/yaml_typed_test.py line 354 at r5 (raw file):

                self.assertEqual(x.value, b'')

        # Using !!binary and assigning it to non-bytes should throw.

Do these test cases exist on the C++ side? I didn't immediately find them.

Our goal is to keep the Python and C++ test case inputs as close alignment as possible. It's okay to diverge in the expected outcome, but the input panel should be roughly the same, declared in the same order, with the same phrasing, data, comments, etc. As naive of a cross-language porting as possible.


bindings/pydrake/common/test/yaml_typed_test.py line 307 at r4 (raw file):

            ("!!binary |\n  A3Rlc3Rfc3RyAw==", b"\x03test_str\x03"),
            ("!!binary |\n  A3Rlc3Rf\n  c3RyAw====", b"\x03test_str\x03"),
            ("!!binary", b''),

nit Double quotes for consistency?

In general throughout this method, probably the single vs double quotes could be made more uniform. I'll note a couple more instance inline, but not try to ding all of them.

Suggestion:

b""

bindings/pydrake/common/test/yaml_typed_test.py line 318 at r4 (raw file):

            ("A3Rfc3RyAw=", "Incorrect padding"),
            ("A3Rfc*RyAw==", "Invalid base64-encoded string")]
        for value,  error_msg in cases:

nit Disallowed styleguide abbreviation "msg". (I don't think we have any sodium salt of glutamic acid in our yaml parse.)

(Also nit too much whitespace in the middle.)

Suggestion:

for value, error_regex in cases:

bindings/pydrake/common/test/yaml_typed_test.py line 346 at r4 (raw file):

            ("true", "Expected.*bytes.*bool"),
        ]
        for value, error_message in cases:

BTW

Suggestion:

error_regex

bindings/pydrake/common/test/yaml_typed_test.py line 355 at r4 (raw file):

        # Using !!binary and assigning it to non-bytes should throw.
        cases = [
            (b'.inf', FloatStruct),

BTW Prior tests cases in this method used double quotes for b"stuff" input. Might as well stay consistent here too?


tools/workspace/yaml_cpp_internal/repository.bzl line 8 at r4 (raw file):

    github_archive(
        name = name,
        # local_repository_override = "/home/seancurtis/code/yaml-cpp",

nit Stray line

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_read_archive.h line 1 at r5 (raw file):

#pragma once

See the head commit here for my suggestions:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Big picture, we want to keep the complexity in the cc file and we must use ReportError for errors (so that we get line numbers).

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one way in writing binary strings differ between C++ and python. It's apparent in the acceptance strings in the two corresponding tests. Python always formats it as:

value: !!binary |
  bacdasldkfj

Whereas C++ prefers:

value: !!binary bacdasldkfj

Reading on both sides have been shown to accept both formats, so I'm choosing not to worry about the fact that they get formatted slightly differently on writiing.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

This PR only seems to have reading. Usually we add both reading and writing in the same PR, as a way to help cross-check that both sides of the story are mutually compatible.

I suppose if its an easier PR train we can add reading and then PR the writing half immediately after (while backfilling any glitches in reading that show up), but we don't want to delay that very long.

I thought I had been clear about its state. Apparently not. I'd intentionally deferred adding writing to the PR until we were in agreement on the reading semantics.

Writing has now been added.


bindings/pydrake/common/test/yaml_typed_test.py line 318 at r4 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

nit Disallowed styleguide abbreviation "msg". (I don't think we have any sodium salt of glutamic acid in our yaml parse.)

(Also nit too much whitespace in the middle.)

But you gotta love that dash of umami...


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

See the comment about 10 lines up.

We need IntStruct on the C++ side now, too.

Sadly, reviewable isn't good at providing relevant context for "comment about 10 lines up" as there are no other comments in this file.

However, the request for the IntStruct is clear.

(FTR, I hadn't bothered because of the difference between the yaml parsing logic in python vs C++. Python does all of the type conversion before we get a hold of the result, whereas C++ we do it ourselves. So, I felt the DoubleStruct provided coverage. I still believe it does. However, in the name of keeping the two tests in parallel -- in case our C++ yaml parser changes its behavior, it makes sense to provide the redundant coverage now).


bindings/pydrake/common/test/yaml_typed_test.py line 354 at r5 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Do these test cases exist on the C++ side? I didn't immediately find them.

Our goal is to keep the Python and C++ test case inputs as close alignment as possible. It's okay to diverge in the expected outcome, but the input panel should be roughly the same, declared in the same order, with the same phrasing, data, comments, etc. As naive of a cross-language porting as possible.

As alluded to above re: IntStruct, I felt what got tested here got tested there, albeit in a more compact form. However, for the reasons indicated above, I've expanded the C++ test to be a more literal translation of the python.

@SeanCurtis-TRI
Copy link
Contributor Author

Binary runs face first into writing json. I'm going to look into it.

@jwnimmer-tri
Copy link
Collaborator

I think it would be okay to leave that as future work, and have the json writer throw (maybe with a change-detector expect-throws test to guard us). The AllScalarsStruct could have a way to opt-out of the binary data (e.g., a bool template argument, or a non-serialized bool member field, which skipped it during Serialize).

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, wrong thread.

Anyway -- checkpoint on final reviews of all test code.

Reviewed 7 of 11 files at r6, all commit messages.
Reviewable status: 8 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Binary runs face first into writing json. I'm going to look into it.

I think it would be okay to leave that as future work, and have the json writer throw (maybe with a change-detector expect-throws test to guard us). The AllScalarsStruct could have a way to opt-out of the binary data (e.g., a bool template argument, or a non-serialized bool member field, which skipped it during Serialize).


bindings/pydrake/common/test/yaml_typed_test.py line 38 at r5 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Sadly, reviewable isn't good at providing relevant context for "comment about 10 lines up" as there are no other comments in this file.

However, the request for the IntStruct is clear.

(FTR, I hadn't bothered because of the difference between the yaml parsing logic in python vs C++. Python does all of the type conversion before we get a hold of the result, whereas C++ we do it ourselves. So, I felt the DoubleStruct provided coverage. I still believe it does. However, in the name of keeping the two tests in parallel -- in case our C++ yaml parser changes its behavior, it makes sense to provide the redundant coverage now).

Oops. I meant 10 lines up in the original file:

# To provide test coverage for all of the special cases of YAML loading, we'll
# define some dataclasses. These classes mimic
#  drake/common/yaml/test/example_structs.h
# and should be roughly kept in sync with the definitions in that file.

Basically the same idea as other threads -- that the C++ and Python tests should very strongly imitate each other. To the extent they differ at all, it should be for meaningful reasons that highlight an actual difference in behavior or capabilities between the two.

Perhaps this comment paragraph (and its twin on the C++ side) need to be expanded?


common/yaml/test/example_structs.h line 52 at r6 (raw file):

// A value used in the test data below to include a default (placeholder) value
// when initializing struct data members.
constexpr double kNominalInt = -1;

typo

Suggestion:

int

common/yaml/test/yaml_write_archive_test.cc line 74 at r6 (raw file):

TEST_F(YamlWriteArchiveTest, Bytes) {
  const auto test = [](const std::string& value, const std::string& expected) {
    const auto* data = reinterpret_cast<const std::byte*>(value.c_str());

BTW Using c_str() implies that we care about the trailing NIL, but in this case it is not relevant. Saying data() would better align with our intentions.

Suggestion:

value.data()

bindings/pydrake/common/test/yaml_typed_test.py line 319 at r6 (raw file):

            ("A3Rfc3RyAw=", "Incorrect padding"),
            ("A3Rfc*RyAw==", "Invalid base64-encoded string")]
        for value,  error_regex in cases:

nit Whitespace

Suggestion:

value, error_regex

bindings/pydrake/common/test/yaml_typed_test.py line 330 at r6 (raw file):

            ("!!str 1234", "Expected.*bytes.*str"),
            # Int.
            ("12", "Expected.*bytes.*int"),

nit Match C++ input exactly.

FYI it would also be okay by me to test both "12" and "!!int 12" as separate cases (in both languages), if you think that's better.

Suggestion:

"!!int 12"

bindings/pydrake/common/test/yaml_typed_test.py line 334 at r6 (raw file):

            # Pyyaml defect: 0o3 should be an int.
            ("0o3", "Expected.*bytes.*str"),
            # Pyyaml defect: 00:03 should be an int (value of 3).

BTW I am okay with leaving this comment alone, but as background the full answer for whether this is a "defect" depends on which version of the YAML specification pyyaml is claiming to implement. The specified regexes for inferring data types are different across yaml versions.


bindings/pydrake/common/test/yaml_typed_test.py line 337 at r6 (raw file):

            ("00:03", "Expected.*bytes.*str"),
            # Float.
            ("1234.5", "Expected.*bytes.*float"),

nit Match C++ input exactly.

Suggestion:

            ("1234.5", "Expected.*bytes.*float"),
            ("!!float 1234.5", "Expected.*bytes.*float"),

bindings/pydrake/common/test/yaml_typed_test.py line 383 at r6 (raw file):

        x = yaml_load_typed(schema=AllScalarsStruct, data=data, **options)
        self.assertEqual(x.some_bool, True)
        self.assertEqual(x.some_bytes, b'test string')

This test case does not match the C++ test case. Both should use the same yaml document as input.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: feature but still should fix some open discussions prior to assigning platform.

Reviewed 1 of 2 files at r2, 1 of 7 files at r3, 4 of 11 files at r6.
Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
Working

At least in Python (haven't checked C++ yet), the handling of Optional[bytes] and Union[..., bytes, ...] does not work correctly. I'm trying to play around a bit to see if I have advice on whether to try to fix it or else declare it to be out of scope. Stay tuned.


tools/workspace/yaml_cpp_internal/patches/upstream/b64_decode_failure_is_empty.patch line 6 at r6 (raw file):

  - The input has an invalid character.

However, it doesn't have the proper number of encoding characters (a multiple of

typo

Suggestion:

if the input

common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

  } else {
    emitted_tag = node_tag;
  }

Take note of the comment quoted here. It talks about "json schema" and explains how the data type will be correctly implied by the plain string because we are careful to emit text which the "json schema" regexes will notice when processing the plain string -- and therefore we elect not to emit the !!foo marker so that we don't clutter up the document.

Those conditions do not hold for !!binary -- it is not part of the json schema, and it does not have any regexs that would match against a plain scalar.

Therefore, the right way to handle binary is as a distinct else-if branch. See sample code below.

Suggestion:

  if ((node_tag == internal::Node::kTagNull) ||
      (node_tag == internal::Node::kTagBool) ||
      (node_tag == internal::Node::kTagInt) ||
      (node_tag == internal::Node::kTagFloat) ||
      (node_tag == internal::Node::kTagStr)) {
    // In most cases we don't need to emit the "JSON Schema" tags for YAML data,
    // because they are implied by default. However, YamlWriteArchive on variant
    // types sometimes marks the tag as important.
    if (node.IsTagImportant()) {
      // The `internal::Node::kTagFoo` all look like "tag:yaml.org,2002:foo".
      // We only want the "foo" part (after the second colon).
      emitted_tag = std::string("!!");
      emitted_tag.append(node_tag.substr(18));
    }
  } else if (node_tag == internal::Node::kTagBinary) {
    // Use the more compact "secondary tag" spelling.
    emitted_tag = "!!binary";
  } else {
    emitted_tag = node_tag;
  }

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

At least in Python (haven't checked C++ yet), the handling of Optional[bytes] and Union[..., bytes, ...] does not work correctly. I'm trying to play around a bit to see if I have advice on whether to try to fix it or else declare it to be out of scope. Stay tuned.

Here is a start on the fix (only for python). See the new head commit:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Do we want to try to finish this approach?

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 unresolved discussions, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Here is a start on the fix (only for python). See the new head commit:

https://github.com/jwnimmer-tri/drake/commits/yaml-binary-fixups/

Do we want to try to finish this approach?

(The essential idea is that in yaml_load_typed, our control flow must be solely dictated by the schema, not the yaml document. We can't validate the yaml value right at the start of the function and expect to be able to decide anything, we need to flow through the conditions that probe against the schema to decide which yaml values are allowed.)

@SeanCurtis-TRI
Copy link
Contributor Author

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

(The essential idea is that in yaml_load_typed, our control flow must be solely dictated by the schema, not the yaml document. We can't validate the yaml value right at the start of the function and expect to be able to decide anything, we need to flow through the conditions that probe against the schema to decide which yaml values are allowed.)

I took your commit and extended it for C++. I was explicitly looking at the optional to make sure it was happy when I ran out of time. Hopefully, I'll get a chance to push something new this morning, but my main priority is to get other PRs through platform as efficiently as possible.

Note: It just worked, unlike variant. Are the tests redundant?
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've got things updated. I have a third commit that adds coverage for "optional[bytes]". With all previous changes, they all simply seem to work. It's not clear to me if those additional tests (and the related infrastructure) is worthwhile.

There are various other "typed tests" in yaml_typed_test.py that I didn't explore. Should I? I glanced and they seemed to be testing a higher level of abstraction so i think I'm safe.

Anyhoooo....things seem to currently "work".

Reviewable status: 2 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Take note of the comment quoted here. It talks about "json schema" and explains how the data type will be correctly implied by the plain string because we are careful to emit text which the "json schema" regexes will notice when processing the plain string -- and therefore we elect not to emit the !!foo marker so that we don't clutter up the document.

Those conditions do not hold for !!binary -- it is not part of the json schema, and it does not have any regexs that would match against a plain scalar.

Therefore, the right way to handle binary is as a distinct else-if branch. See sample code below.

This whole thing felt a bit murky to me.

However, as I follow that vector, it seem obvious that shoving the kBinary into JsonSchemaTag is likewise misguided. kStr is acceptable because the json schema inherits from the failsafe schema. But the !!binary tag isn't part of that.

So, it seems to me that it would be better to use the Node::SetTag() overload that simply takes the string instead of extending the enumeration in a misleading direction. (This did necessitate a change to VisitScalar().)

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if those additional tests (and the related infrastructure) is worthwhile.

I hear you on this, but I think for parsers (and yaml type checking in particular) having a costly battery of tests pays off in the long term. We used to not have so many, and it was often painful. In this case since we've identified that JSON primitives vs YAML primitives require different logic, I think the tests will be helpful. If we ever add a second kind of YAML-only primitive, that would be a point where we might lean on !!binary to catch certain kinds of bugs instead of writing a full complement of new tests for the new YAML primitive.

In any case, I'll take a final look at the test suite from scratch, to offer any final thoughts.

Reviewed 12 of 12 files at r7, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers, commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @SeanCurtis-TRI)


common/yaml/yaml_write_archive.cc line 124 at r6 (raw file):

... kBinary into JsonSchemaTag is ... misguided.

Ah, good observation. I agree.

For !!binary, I am fine with calling SetTag with a std::string. However, FYI another approach would be to rename the enum to cover the wider domain (e.g., all yaml primitives), or to introduce a second enum for yaml-only primitives and overload SetTag on that new enum.

In any case, this thread is resolved.


a discussion (no related file):
Working

One more pass over all of the tests, from scratch.


common/yaml/yaml_write_archive.h line 239 at r7 (raw file):

    if constexpr (std::is_same_v<T, std::string>) {
      text = value;
      tag = internal::Node::kTagStr;

Whatever we do for !!binary tag, we shouldn't use SetTag(std::string) here for all of the json tags. It's important to keep using the compact (enum) representation in the common case, for performance.

Without writing it on paper to check, my best guess is that the tactic in the snippet below might be the smoothest? Up to you though, I'll let you play around and decide.

// Exactly one of these will be set by the if-else chain.
std::optional<JsonSchemaTag> json_tag;
std::string full_tag;
...
if (json_tag.has_value()) {
  scalar.SetTag(*json_tag);
} else {
  scalar.SetTag(std::move(full_tag));
}

common/yaml/yaml_read_archive.h line 356 at r7 (raw file):

    } else if constexpr (std::is_same_v<T, std::string>) {
      return tag == internal::Node::kTagStr;
    } else if constexpr (std::is_same_v<T, std::vector<std::byte>>) {

BTW The comment atop this is-else chain says "JSON schema tags". Per the other thread, possibly we want to amend that slightly for precision, since !!binary isn't part of the JSON schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants