Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'0008' and '0009' strings are not single-quoted when dumping. #740

Closed
jeshow opened this issue Aug 9, 2023 · 3 comments
Closed

'0008' and '0009' strings are not single-quoted when dumping. #740

jeshow opened this issue Aug 9, 2023 · 3 comments

Comments

@jeshow
Copy link

jeshow commented Aug 9, 2023

I'm using pyyaml version 6.0.1 and I've discovered that strings that end in '8' and '9' that are prefixed with at least one '0' are treated differently than strings ending with any other digit. This is problematic in that other interpreters (namely yaml-cpp) then fail to properly interpret the data as a string.

After diving into the resolver.py code, it looks like this is because of this regex, which matches '0007' but does not match '0008' or '0009'.

Resolver.add_implicit_resolver(
        'tag:yaml.org,2002:int',
        re.compile(r'''^(?:[-+]?0b[0-1_]+
                    |[-+]?0[0-7_]+
                    |[-+]?(?:0|[1-9][0-9_]*)
                    |[-+]?0x[0-9a-fA-F_]+
                    |[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+)$''', re.X),
        list('-+0123456789'))

Here is a minimal example:

import yaml
fields = {'0007': {'key': 'val'}, '0008': {'key': 'val'}, '0009': {'key': 'val'}, '0010': {'key': 'val'}}
with open('tmp.yaml', 'w') as stream:
  yaml.safe_dump(fields, stream)

The resulting file looks like:

'0007':
  key: val
0008:
  key: val
0009:
  key: val
'0010':
  key: val
@jeshow
Copy link
Author

jeshow commented Aug 9, 2023

There is a similar issue when parsing this data from text:

import yaml
text = '''
0007: {key: val}
0008: {key: val}
0009: {key: val}
0010: {key: val}
'''
data = yaml.safe_load(text)
print(data.keys())

This produces dict_keys([7, '0008', '0009', 8]).

@nitzmahone
Copy link
Member

You're running afoul of the YAML 1.1 base-8 integer representation. This is all "legit" through that lens, since PyYAML currently only supports YAML 1.1. It sucks, which is why octals got revamped in YAML 1.2 (and there are numerous ways to disable/bypass that behavior), but it's not a bug. Since it sounds like you're interopping with another YAML implementation that's reading those as 0-padded base-10 ints (1.2 behavior), you'd probably do better to quote or !str-tag them in your documents, and ensure that the PyYAML emitting side is emitting strings, not ints (which will be subject to the 1.1-octal-aware quoting behavior).

Closing as "not a bug, just an unfortunate reality until PyYAML grows proper 1.2 support".

@nitzmahone nitzmahone closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2023
@jeshow
Copy link
Author

jeshow commented Aug 10, 2023

Thank you @nitzmahone, for the answer and suggestion!
For anyone finding this in the future, I realized the real problem was much simpler.

My process involved emitting YAML from pyyaml, ingesting that with yaml-cpp, emitting a new file from yaml-cpp, and then ingesting that with pyyaml. In short, pyyaml -> yaml-cpp -> pyyaml.

The real problem occurred in the yaml-cpp -> pyyaml step, because my version of yaml-cpp treated all 000x elements as 0-prefixed strings. It would then write them as such with the default format -- without quotes.

When pyyaml went to process the fields, however, it would read them all as octal integers, converting elements like 0007 to 7 and 0010 to 8, but leaving 0008 and 0009 as strings (because they won't be interpreted as base-8).

With @nitzmahone's suggestion above, I modified my yaml-cpp encoder to explicitly set the default string tag for the spurious elements:

static Node encode(const ExampleStructure& rhs)
{
  Node node;
  node["data"] = rhs.data;
  node["name"] = rhs.name;
  node["name"].SetTag("tag:yaml.org,2002:str");
  return node;
}

That tagged the appropriate data elements in the YAML emitted by yaml-cpp, and then pyyaml could properly interpret it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants