PoC: Metadata implementation #574

dstufft · 2022-07-16T08:32:22Z

This is still a work in progress and hasn't even been tested at all yet to even see if it runs or anything.

However, I think it should correctly implement parsing the METADATA and metadata.json files correctly, with I think the most lenient parser that I could come up with that doesn't silently let bad data go unnoticed.

Some general notes about decisions made here (so far anyways):

The RawMetadata largely matches the format of metadata.json, but has some deviations to make using it better.
- Several of the key names in metadata.json have really awkward lack of pluralization (e.g. metdata.json has classifer: list[str], but RawMetadata has classifiers: list[str].
- All values in RawMetadata are optional regardless of what the core metadata spec says.
- the Project-URL metadata is represented as a dict[str, str] not list[str].
I believe it is unsafe to include unparsed data intermixed with parsed data, it makes it far too difficult to differentiate between them, so the parse_FORMAT functions return a tuple of RawMetdata that represents everything that could be parsed, and a dict[Any, Any] that represents what could not be.
The parse_FORMAT functions take either a bytes or a str, allowing callers to give us a bytes and we will do the right thing, or if they know that their document is broken in some way with a wrong encoding, they can decode it themselves before passing it in.
Round tripping to a byte for byte result is not a supported use case, but round tripping to a semantically equivalent result is.
Under no circumstances do we let malformed data (with what little correctness RawMetadata even enforces) pass through silently.

I tried to comment through everything, but there are a lot of subtle situations that I believe this will handle about as good as possible:

Extraneous keys are not implicitly accepted, rather they are pushed into a second data structure to mark them as unparsed.
- This holds true even for Project-URL, which conceptually is a map but due to RFC 822 not supporting maps, is serialized as a list of strings.
Types are explicitly checked to ensure that our typing matches our runtime, since this data is external, we can't assuming that the shape of it matches anything in particular.
When parsing METADATA repeat uses of a key that does not allow multiple uses makes that key unparseable and pushes it into the second structure.
- This might be fixing a possible security bug that's a variant of the confused deputy? There's nothing right now preventing a METADATA file from having Name emitted twice with different contents, which if we just blindly pick one of them as "the" value, different systems may randomly pick the other one, leaving two systems to parse the same file with different results.
Implement a RFC 822 aware, line by line decoding of METADATA, such that a file that is mostly utf8, but one field has been mojibaked can still parse the bulk of the file correctly ¹.

Anyways, tomorrow I'm going to actually test this, and try to get serialization back into METADATA and metadata.json done, which should be a lot easier and less finnicky.

The stdlib email parser plus RFC 822 together makes this horrible to do, because all of the parsing methods that accept bytes just do a hardcoded decode("ascii", "surrogateescape"), which means that anyone parsing METADATA with one of the byte interfaces from the email library is incorrectly parsing valid METADATA files that contain any utf8 characters that aren't also ascii characters. ↩

dstufft · 2022-07-16T21:41:10Z

This now can emit email and json metadata (still yet to be extensively tested)

More decisions made:

The code to emit email always does the distutils style RFC 822 escaping, which should be a no-op if the field contains no new lines ¹.
The code to emit email always emits Description as the email body.
Emitting assumes that you've passed in a correct RawMetadata, but as an extra layer of protection will not emit keys that is unknown to it.

This means in the presence of new lines, we will emit mangled, but otherwise safe, data. Validation is left to other layers. ↩

dstufft · 2022-07-17T01:50:48Z

I'm currently downloading a corpus of data from PyPI to test this PR with. There's a lot to download so it won't be done anytime soon, but testing with what I have so far results in 282469 METADATA or PKG-INFO files parsed with no left-over keys ¹² and 209 parsed with left over keys.

I'm digging into why exactly those had left-over keys, so far the most common reason is just bad data that can't be correctly parsed due to the new line problem I mentioned.

An example of a problem PKG-INFO is:

Metadata-Version: 1.1
Name: aisg-cli
Version: 0.1.0
Summary: AISG CLI Tool
Home-page: https://github.com/kensoh/aisg-cli
Author: AI Singapore
Author-email: engineering@aisingapore.org
License: UNKNOWN
Description-Content-Type: UNKNOWN
Description: # AISG CLI Tool

        Command line interface to simplify machine learning workflows - data acquisition, modeling, deployment

        |
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

That bar character was added by me, the other lines in that description are just \n but the line I added the bar to was padded out to the bar. This malformed PKG-INFO ends up being parsed by email.parser with a Description header that is set to # AISG CLI Tool, then a body payload that is set to:

        Command line interface to simplify machine learning workflows - data acquisition, modeling, deployment

        |
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

I think this shows the strength of being very careful about how we deserialize data and the leftover data structure, because all of the other libraries I've tested silently ignore the fact this metadata is malformed and throw away the # AISG CLI Tool data.

However, this PR returns a RawMetamdata:

{'author': 'AI Singapore',
 'author_email': 'engineering@aisingapore.org',
 'description_content_type': 'UNKNOWN',
 'home_page': 'https://github.com/kensoh/aisg-cli',
 'license': 'UNKNOWN',
 'metadata_version': '1.1',
 'name': 'aisg-cli',
 'summary': 'AISG CLI Tool',
 'version': '0.1.0'}

and the "leftover" data strcture looks like:

{'Description': ['# AISG CLI Tool',
                 '        Command line interface to simplify machine learning '
                 'workflows - data acquisition, modeling, deployment\r'
                 '\r\n'
                 '        \r\n'
                 'Platform: UNKNOWN\r\n'
                 'Classifier: Development Status :: 3 - Alpha\r\n'
                 'Classifier: Intended Audience :: Developers\r\n'
                 'Classifier: Intended Audience :: System Administrators\r\n'
                 'Classifier: Intended Audience :: Science/Research\r\n'
                 'Classifier: License :: OSI Approved :: Apache Software '
                 'License\r\n'
                 'Classifier: Programming Language :: Python :: 2\r\n'
                 'Classifier: Programming Language :: Python :: 2.7\r\n'
                 'Classifier: Programming Language :: Python :: 3\r\n'
                 'Classifier: Programming Language :: Python :: 3.4\r\n'
                 'Classifier: Programming Language :: Python :: 3.5\r\n'
                 'Classifier: Programming Language :: Python :: 3.6\r\n']}

which shows that there was an error parsing the Description key, and in this case because it saw two values for that key, it included a list that had both values.

I'm excluding the License-File data being left-over from these results, since that is the library behaving correctly. ↩
I haven't attempted to compare the results of what can be parsed between libraries to see how this is faring yet, just seeing what data it wasn't able to parse to sort out any blatant errors first. ↩

dstufft · 2022-07-17T02:15:44Z

Here's another one, this one is subtle:

Metadata-Version: 2.1
Name: adblock
Version: 0.4.3
Classifiers: Programming Language :: Python
Classifiers: Programming Language :: Rust
Classifiers: License :: OSI Approved :: MIT License
Classifiers: License :: OSI Approved :: Apache Software License
Home-Page: https://github.com/ArniDagur/python-adblock
Author: Árni Dagur <arni@dagur.eu>
Author-Email: Árni Dagur <arni@dagur.eu>
License: MIT OR Apache-2.0
Requires-Python: >=3.6
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

python-adblock
==========
Python wrapper for Brave's adblocking library, which is written in Rust.

### Building

\`\`\`
maturin build --release
\`\`\`

#### Build dependencies

| Build Dependency | Versions | Arch Linux | Url |
|------------------|----------|------------|-----|
| Python           | `>=3.6`  | `python3`  | -   |
| Rust             | `>=1.45` | `rust`     | -   |
| Maturin          | `*`      | `maturin`  | https://github.com/PyO3/maturin |

### Developing

I use Poetry for development. To create and enter a virtual environment, do
\`\`\`
poetry install
poetry shell
\`\`\`
then, to install the `adblock` module into the virtual environment, do
\`\`\`
maturin develop
\`\`\`

### Documentation

Rust documentation for the latest `master` branch can be found at https://arnidagur.github.io/python-adblock/docs/adblock/index.html.

### License

This project is licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
   http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or
   http://opensource.org/licenses/MIT)

at your option.

That produces a RawMetadata like:

{'author': 'Árni Dagur <arni@dagur.eu>',
 'author_email': 'Árni Dagur <arni@dagur.eu>',
 'description': 'python-adblock\r\n'
                '==========\r\n'
                "Python wrapper for Brave's adblocking library, which is "
                'written in Rust.\r\n'
                '\r\n'
                '### Building\r\n'
                '\r\n'
                '```\r\n'
                'maturin build --release\r\n'
                '```\r\n'
                '\r\n'
                '#### Build dependencies\r\n'
                '\r\n'
                '| Build Dependency | Versions | Arch Linux | Url |\r\n'
                '|------------------|----------|------------|-----|\r\n'
                '| Python           | `>=3.6`  | `python3`  | -   |\r\n'
                '| Rust             | `>=1.45` | `rust`     | -   |\r\n'
                '| Maturin          | `*`      | `maturin`  | '
                'https://github.com/PyO3/maturin |\r\n'
                '\r\n'
                '### Developing\r\n'
                '\r\n'
                'I use Poetry for development. To create and enter a virtual '
                'environment, do\r\n'
                '```\r\n'
                'poetry install\r\n'
                'poetry shell\r\n'
                '```\r\n'
                'then, to install the `adblock` module into the virtual '
                'environment, do\r\n'
                '```\r\n'
                'maturin develop\r\n'
                '```\r\n'
                '\r\n'
                '### Documentation\r\n'
                '\r\n'
                'Rust documentation for the latest `master` branch can be '
                'found at '
                'https://arnidagur.github.io/python-adblock/docs/adblock/index.html.\r\n'
                '\r\n'
                '### License\r\n'
                '\r\n'
                'This project is licensed under either of\r\n'
                '\r\n'
                ' * Apache License, Version 2.0, '
                '([LICENSE-APACHE](LICENSE-APACHE) or\r\n'
                '   http://www.apache.org/licenses/LICENSE-2.0)\r\n'
                ' * MIT license ([LICENSE-MIT](LICENSE-MIT) or\r\n'
                '   http://opensource.org/licenses/MIT)\r\n'
                '\r\n'
                'at your option.\r\n'
                '\n',
 'description_content_type': 'text/markdown; charset=UTF-8; variant=GFM',
 'home_page': 'https://github.com/ArniDagur/python-adblock',
 'license': 'MIT OR Apache-2.0',
 'metadata_version': '2.1',
 'name': 'adblock',
 'requires_python': '>=3.6',
 'version': '0.4.3'}

with a left overs of:

{'classifiers': ['Programming Language :: Python',
                 'Programming Language :: Rust',
                 'License :: OSI Approved :: MIT License',
                 'License :: OSI Approved :: Apache Software License']}

It looks at some point maturin was emitting Classifiers instead of Classifier, which this immediately caught ¹.

See maturin commit where it was fixed: https://github.com/PyO3/maturin/commit/0cb3d79d5b3aa75a4cfc3a4ef8b353dfa7161279 ↩

dstufft · 2022-07-17T02:23:43Z

More weird bad data that normally passes silently:

Metadata-Version: 2.1
Name: asciinema
Version: 2.2.0
Summary: Terminal session recorder
Home-page: https://asciinema.org
Download-URL: 
https: //github.com/asciinema/asciinema/archive/v2.2.0.tar.gz
Author: Marcin Kulik
Author-email: m@ku1ik.com
License: GNU GPLv3
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: System :: Shells
Classifier: Topic :: Terminals
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown; charset=UTF-8
License-File: LICENSE

...

Body removed for brevity:

{'author': 'Marcin Kulik',
 'author_email': 'm@ku1ik.com',
 'classifiers': ['Development Status :: 5 - Production/Stable',
                 'Environment :: Console',
                 'Intended Audience :: Developers',
                 'Intended Audience :: System Administrators',
                 'License :: OSI Approved :: GNU General Public License v3 or '
                 'later (GPLv3+)',
                 'Natural Language :: English',
                 'Programming Language :: Python',
                 'Programming Language :: Python :: 3.6',
                 'Programming Language :: Python :: 3.7',
                 'Programming Language :: Python :: 3.8',
                 'Programming Language :: Python :: 3.9',
                 'Programming Language :: Python :: 3.10',
                 'Topic :: System :: Shells',
                 'Topic :: Terminals',
                 'Topic :: Utilities'],
 'description_content_type': 'text/markdown; charset=UTF-8',
 'download_url': '',
 'home_page': 'https://asciinema.org',
 'license': 'GNU GPLv3',
 'metadata_version': '2.1',
 'name': 'asciinema',
 'platforms': ['UNKNOWN'],
 'summary': 'Terminal session recorder',
 'version': '2.2.0'}

with leftovers

{'https': ['//github.com/asciinema/asciinema/archive/v2.2.0.tar.gz'],
 'license-file': ['LICENSE']}

Looks like that file was emitted with a stray new line after the Download-URL causing the url to be on the next line and get parsed as a header.

This is what pkg_metadata gets:

{'author': 'Marcin Kulik',
 'author_email': 'm@ku1ik.com',
 'classifier': ['Development Status :: 5 - Production/Stable',
                'Environment :: Console',
                'Intended Audience :: Developers',
                'Intended Audience :: System Administrators',
                'License :: OSI Approved :: GNU General Public License v3 or '
                'later (GPLv3+)',
                'Natural Language :: English',
                'Programming Language :: Python',
                'Programming Language :: Python :: 3.6',
                'Programming Language :: Python :: 3.7',
                'Programming Language :: Python :: 3.8',
                'Programming Language :: Python :: 3.9',
                'Programming Language :: Python :: 3.10',
                'Topic :: System :: Shells',
                'Topic :: Terminals',
                'Topic :: Utilities'],
 'description_content_type': 'text/markdown; charset=UTF-8',
 'download_url': '',
 'home_page': 'https://asciinema.org',
 'license': 'GNU GPLv3',
 'metadata_version': '2.1',
 'name': 'asciinema',
 'platform': ['UNKNOWN'],
 'summary': 'Terminal session recorder',
 'version': '2.2.0'}

dstufft · 2022-07-17T02:32:23Z

So far all of the metadata files with leftover data I've investigated are due to the METADATA file itself being broken in some way. The bulk of them are due to stray \n causing the rest of the file to get parsed as the body like:

Metadata-Version: 1.1
Name: applicationinsights
Version: 0.11.10
Summary: This project extends the Application Insights API surface to support Python.
Home-page: https://github.com/Microsoft/ApplicationInsights-Python
Author: Microsoft
Author-email: appinsightssdk@microsoft.com
License: MIT
Download-URL: https://github.com/Microsoft/ApplicationInsights-Python
Description: This SDK is no longer maintained or supported by Microsoft. Check out the `Python OpenCensus SDK <https://docs.microsoft.com/azure/azure-monitor/app/opencensus-python>`_ for Azure Monitor's latest Python investments. Azure Monitor only provides support when using the `supported SDKs <https://docs.microsoft.com/en-us/azure/azure-monitor/app/platforms#unsupported-community-sdks>`_. We’re constantly assessing opportunities to expand our support for other languages, so follow our `GitHub Announcements <https://github.com/microsoft/ApplicationInsights-Announcements/issues>`_ page to receive the latest SDK news. 

        |
Keywords: analytics applicationinsights telemetry appinsights development
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6

the | was added by me again to show whitespace.

dstufft · 2022-07-17T15:33:14Z

My latest push starts rewriting the Metadata class to work in a much different way (though ultimately it will still be able to have mostly the same API ¹.

The new design has the following properties:

Users manually creating Metadata objects cannot create a Metadata with invalid metadata.
Adds Metadata.from_FORMAT methods to go from RawMetadata or METADATA / metadata.json, which will validate the metadata by default.
The from_FORMAT methods can optionally disable validation, which will let invalid data possibly be stored internally in Metadata.
Even in the case that validation is disabled, any access of a field will ensure that specific field is validated, as well as setting a field to a new value will ensure that new value is validated.
Will add Metaata.to_FORMAT methods to help go from a Metadata to a serialized form.
- These will ensure that only fully valid metadata files are emitted.

The way this PR's Metadata class works (or will work when fully implemented) is:

If you create a Metadata class using the normal constructor Metadata(...), then the type signature of the class will guide people towards making correct metadata (name/version with no default values, etc), and once the data has all be copied over to their respective attributes, a "global validation" will run that ensures policy level requirements (metadata-version is appropriate for the defined fields, etc. basically anything that requires looking at multiple fields to actually validate).

Thus, when you create a Metadata object from it's constructor, you're forced to pass only valid values to have a fully valid set of metadata.

In addition to that, it supports a number of alternate constructors: Metadata.from_raw(), Metadata.from_email, and Metdata.from_json. The email and json variety of those constructors just call their respective parse_FORMAT method, and if there is any left over unparsed data, will hard fail, otherwise they'll take the raw metata and pass it into Metadata.from_raw().

The from_raw constructor has some light magic to avoid invoking __init__, we want __init__ to eagerly validate metadata and ensure it's all valid, but we want to enable passing in data that may be invalid (we assume it's a valid RawMetadata however) and lazily validating it as needed.

We add a lazy_validator class property that will implement our per field validation (using helper validators to make it easy to compose and test them). This lazy_validator will pull data out of raw, parse + validate it, then store it in the validated dictionary to serve as a cache/store of validated data. Likewise when setting a value, it will do the same thing, and deletion will clear it from those dictionaries.

Thus, we ensure that on access or writing to a property, that property's data is always valid from the POV of the user, but since we're doing it lazily, it allows partial validation if needed.

The last part isn't written, but the plan is to also add a set of to_FORMAT methods to serialize a Metadata, as part of that, serialization it will run the "global" validation again, to ensure that we don't emit any invalid metadata (but each field is always consistent).

Except there's a problem where the existing API where it can't actually represent everything that is an otherwise well formed set of metadata, which I plan to address. ↩

dstufft · 2022-07-17T15:49:30Z

Overall, this design gives a lot of power, while ensuring a lot of safety and makes several pieces easier to implement:

The "Raw" layer will very leniently parse and represent metadata, but does so in a way that anything that isn't fully correct metadata is immediately obvious.
The Metadata layer will never let someone read or write invalid metadata on a per field basis, and in almost all cases on a per document basis.
- You can technically get invalid Metadata out, but the only kinds of rules it can break are rules that require inspecting multiple fields at once, and even then, that's only if you manually read each field and serialize them together without running the full validation.
However, the Metadata layer is lazy, so if you only care about a single field, you can access just that field in a validated way, without having to validate the rest of the data.
Metadata.from_FORMAT validates the full document by default, requiring people to opt in to the lazy validation.

This let's us serve a lot of different use cases:

If your goal is to read as much data as possible, and you don't really care about if it's well formed or not, you can use the raw layer and ignore the fact there is leftover data (or interpret it yourself).
If your goal is to read specific pieces of metadata and you don't care if the rest of the metadata is value, you have two choices:
- If you want to ignore malformed files that have leftover data, you'll have to use the parse_FORMAT functions, ignore the left overs and pass the RawMetadata into from_raw)
- If you want to ignore invalid fields, but you want to only read from documents with valid formatting, you can use any of the from_FORMAT methods.
In either case, you'll need to pass validate=False to the from_FORMAT method you're using to disable the eager validation.
If your goal is to read the metadata and you want to only work with valid metadata, any of the from_FORMAT methods will work for you with validate=True (the default).
If your goal is to write invalid metadata, you must use the raw layer, the Metadata layer will never let you write out invalid metadata.

On the design side, we've got very clear separation of concerns:

The raw layer only cares about turning bytes or str into very very lightly parsed documents, and it focuses entirely on doing that safely. Beyond trying to translate to/from the on disk formats to the intermediate format, it doesn't care about anything else.
The Metadata layer only cares about valid metadata and reading/writing from the raw intermediate format. It doesn't know anything at all about the on disk formats, nor does it need to.
The validations layer is setup to be easily composable, so each validation can be written with minimal knowledge or special casing.

brettcannon

I'm up for going with this general approach. Should we try to get the raw parts in first to keep the PRs small?

brettcannon · 2022-11-18T23:39:28Z

packaging/metadata/_types.py

+
+
+@enum.unique
+class DynamicField(enum.Enum):


I'm wondering if it is worth sticking with an enum or just with lowercase string literals for the metadata field names? Same goes for known/supported metadata versions.

brettcannon · 2022-11-18T23:40:51Z

packaging/metadata/_types.py

+    #
+    # However, we want to support validation to happen both up front
+    # and on the fly as you access attributes, and when using the
+    # on the fly validation, we don't want to validate anything else


Suggested change

# on the fly validation, we don't want to validate anything else

# on-the-fly validation, we don't want to validate anything else

brettcannon · 2022-11-18T23:41:29Z

packaging/metadata/_types.py

+    # purpose of RawMetadata.
+    _raw: RawMetadata
+
+    # Likewise, we need a place to store our honest to goodness actually


Suggested change

# Likewise, we need a place to store our honest to goodness actually

# Likewise, we need a place to store our honest-to-goodness, actually

brettcannon · 2022-11-18T23:41:43Z

packaging/metadata/_types.py

+    # validated metadata too, we could just store this in a dict, but
+    # this will give us better typing.


Suggested change

# validated metadata too, we could just store this in a dict, but

# this will give us better typing.

# validated metadata, too. We could just store this in a dict, but

# this will give us better typing.

brettcannon · 2022-11-18T23:43:06Z

packaging/metadata/_types.py

+    v2_3 = "2.3"
+
+
+class _ValidatedMetadata(TypedDict, total=False):


So would this class have a key for each piece of metadata that we are willing to perform conversions/validation on from raw metadata?

brettcannon · 2022-11-19T00:11:01Z

packaging/metadata/_validation.py

+    def full_validate(self, value: V | None) -> None:
+        if value is not None:
+            self.validate(value)
+
+    @abc.abstractmethod
+    def validate(self, value: V) -> None:


Why the two functions? Is it just to avoid having to deal with the None case for typing purposes?

brettcannon · 2022-11-19T00:12:15Z

packaging/metadata/raw.py

+    dynamic: List[str]
+
+    # Metadata 2.3 - PEP 685
+    # No new fields were added in PEP 685, just some edge case were


Suggested change

# No new fields were added in PEP 685, just some edge case were

# No new fields were added in PEP 685, just some edge cases were

brettcannon · 2022-11-19T00:13:28Z

packaging/metadata/raw.py

+
+
+_EMAIL_FIELD_ORDER = [
+    # Always put the metadata version first, incase it ever changes how


Suggested change

# Always put the metadata version first, incase it ever changes how

# Always put the metadata version first, in case it ever changes how

brettcannon · 2022-11-19T00:16:21Z

packaging/metadata/raw.py

+# class, some light touch ups can make a massive different in usability.
+
+
+_EMAIL_FIELD_MAPPING = {


This scares me that there's a typo somewhere, but we would probably find out pretty quickly, so my brain wanting to do this as a dict comprehension just needs to calm down. 😅

brettcannon · 2022-11-19T00:17:17Z

packaging/metadata/raw.py

+# This might appear to be a mapping of the same key to itself, and in many cases
+# it is. However, the algorithm in PEP 566 doesn't match 100% the keys chosen
+# for RawMetadata, so we use this mapping just like with email to handle that.
+_JSON_FIELD_MAPPING = {


Don't need a dict comprehension.

brettcannon · 2022-12-17T06:46:09Z

@dstufft what would you like to do to move this forward? Implement the raw stuff with tests first? Something else?

brettcannon · 2023-01-16T02:27:33Z

To help @dstufft move this forward, I have started my own branch that takes Donald's email header parsing code and begins to add tests and docs (this is currently a WIP w/ appropriate attribution to Donald via Co-authored-by): https://github.com/brettcannon/packaging/tree/raw-metadata . My hope is to get RawMetadata parsing working and then transparent Metadata validation/transformation/reading working as I have a direct need for that now (https://github.com/brettcannon/mousebender/ and getting a pure wheel resolver). We can add email header emission and such later on in separate PRs.

dstufft · 2023-06-30T00:36:39Z

I think this PoC has outlived it's usefulness now with the work @brettcannon has been doing, so going to close it.

dstufft added 3 commits July 15, 2022 22:40

Move metadata into a package

f5f4950

implement parsing from a metadata format to intermediate

88ada97

fix linting

3b42e3c

dstufft mentioned this pull request Jul 16, 2022

Plans for packaging.metadata #570

Open

3 tasks

dstufft added 12 commits July 16, 2022 10:45

Merge branch 'main' into metadata-parsing

bb61476

use older syntax for unions

9c78c5b

Use the method from pkg_metadata to deal with email encodinga

8e25767

correct casing and missing headers

54d663b

Handle str vs bytes data better

85516bc

sort metadata fields better, add missing Metadata 1.1 fields

e30b28e

linting

5b7e097

reorganization

2a795e9

Expose packaging.metadata.raw as it's own module

e095970

More compatible type hints

0a2b733

Enable emitting email/json from RawMetadata

36ac9b7

pyupgrade

54d354a

don't overwrite when payload is empty

a0f99e8

dstufft added 2 commits July 17, 2022 11:03

Start reworking the Metadata class

52be4b5

linting

6749285

dstufft added 2 commits July 17, 2022 13:30

more metadata

1d1487a

linting

d2d07e8

pradyunsg mentioned this pull request Oct 7, 2022

Release 22.0 #569

Closed

brettcannon self-requested a review October 22, 2022 19:25

kieferrm mentioned this pull request Nov 6, 2022

Python: Iteration Plan for November 2022 microsoft/vscode-python#20161

Closed

39 tasks

brettcannon reviewed Nov 19, 2022

View reviewed changes

pradyunsg added the packaging.metadata label Dec 7, 2022

brettcannon mentioned this pull request Jan 24, 2023

Parse raw metadata #671

Merged

dstufft closed this Jun 30, 2023

dstufft deleted the metadata-parsing branch September 29, 2023 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Metadata implementation #574

PoC: Metadata implementation #574

dstufft commented Jul 16, 2022

dstufft commented Jul 16, 2022

dstufft commented Jul 17, 2022 •

edited

Loading

dstufft commented Jul 17, 2022 •

edited

Loading

dstufft commented Jul 17, 2022

dstufft commented Jul 17, 2022

dstufft commented Jul 17, 2022

dstufft commented Jul 17, 2022

brettcannon left a comment

brettcannon Nov 18, 2022

brettcannon Nov 18, 2022

brettcannon Nov 18, 2022

brettcannon Nov 18, 2022

brettcannon Nov 18, 2022

brettcannon Nov 19, 2022

brettcannon Nov 19, 2022

brettcannon Nov 19, 2022

brettcannon Nov 19, 2022

brettcannon Nov 19, 2022 •

edited

Loading

brettcannon commented Dec 17, 2022

brettcannon commented Jan 16, 2023

dstufft commented Jun 30, 2023

	# on the fly validation, we don't want to validate anything else
	# on-the-fly validation, we don't want to validate anything else

	# Likewise, we need a place to store our honest to goodness actually
	# Likewise, we need a place to store our honest-to-goodness, actually

		# validated metadata too, we could just store this in a dict, but
		# this will give us better typing.

		v2_3 = "2.3"


		class _ValidatedMetadata(TypedDict, total=False):

	# No new fields were added in PEP 685, just some edge case were
	# No new fields were added in PEP 685, just some edge cases were



		_EMAIL_FIELD_ORDER = [
		# Always put the metadata version first, incase it ever changes how

		# class, some light touch ups can make a massive different in usability.


		_EMAIL_FIELD_MAPPING = {

PoC: Metadata implementation #574

PoC: Metadata implementation #574

Conversation

dstufft commented Jul 16, 2022

Footnotes

dstufft commented Jul 16, 2022

Footnotes

dstufft commented Jul 17, 2022 • edited Loading

Footnotes

dstufft commented Jul 17, 2022 • edited Loading

Footnotes

dstufft commented Jul 17, 2022

dstufft commented Jul 17, 2022

dstufft commented Jul 17, 2022

Footnotes

dstufft commented Jul 17, 2022

brettcannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brettcannon Nov 19, 2022 • edited Loading

Choose a reason for hiding this comment

brettcannon commented Dec 17, 2022

brettcannon commented Jan 16, 2023

dstufft commented Jun 30, 2023

dstufft commented Jul 17, 2022 •

edited

Loading

dstufft commented Jul 17, 2022 •

edited

Loading

brettcannon Nov 19, 2022 •

edited

Loading