Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindings(python): better message type #586

Open
AzideCupric opened this issue Aug 5, 2024 · 12 comments
Open

bindings(python): better message type #586

AzideCupric opened this issue Aug 5, 2024 · 12 comments

Comments

@AzideCupric
Copy link

Congratulations on the release of the first version of the Python bindings! 🎉

As I mentioned in issue #559 and in the bindings/python documentation:

Access to data of message types (e.g., Greeting) is currently only available through dictionary representations.

To address this, I attempted to implement a library that facilitates the conversion between dict and Python classes using msgspec. You can check it out here.

I chose msgspec over pydantic because msgspec is lightweight and sufficient for the task. However, it can be easily replaced with pydantic if necessary.

Initially, I adjusted the original dict structure:

{
  "Ok": {
      "tag": "a001",
      "code": null,
      "text": "Message 17 is the first unseen message"
  }
}

Currently, enum variant names are used as keys. These can be used as a tag field in tagged unions, so I placed them as the value of codec_model:

{
  "codec_model": "Ok",
  "tag": "a001",
  "code": null,
  "text": "Message 17 is the first unseen message"
}

Specifically, if the value isn't an object or if it represents a Rust enum variant object, the value is placed in a codec_data field (as implemented in utils.py):

{
  "Unseen": 17
}
// transforms to
{
  "codec_model": "Unseen",
  "codec_data": 17
}	

Next, I created some msgspec structs in the models directory to build upon this structure:

Unlike Rust enums, Python enums do not support Algebraic Data Types (ADT), so I used Union to simulate them:

class TaggedBase(Struct, tag_field="codec_model"):
    pass

# Example where value isn't an object
class Unseen(TaggedBase):
    codec_data: int

# Example where value is an object
class AppendUid(TaggedBase):
    uid_validity: NoZeroUint
    uid: NoZeroUint

# Example where the value is a Rust enum variant object
class Untagged(TaggedBase):
    kind: StatusKind
    code: Code | None
    text: str

class Tagged(TaggedBase):
    tag: str
    body: StatusBody

class Bye(TaggedBase):
    code: Code | None
    text: str

class Status(TaggedBase):
    codec_data: Untagged | Tagged | Bye

And so on...

In my repository, I referred to imap-types to define all the structures that will be used in the Python bindings. I also defined some functions in validate.py for use:

_, command = type_codec_decode(CommandCodec, b"ABCD UID FETCH 1,2:* (BODY.PEEK[1.2.3.4.MIME]<42.1337>)\r\n")
>>> Command(
    tag="ABCD",
    body=Fetch(
        sequence_set=[Single(codec_data=Value(codec_data=1)), Range(codec_data=(Value(codec_data=2), "Asterisk"))],
        macro_or_item_names=MessageDataItemNames(
            codec_data=[NameBodyExt(section=Mime(codec_data=[1, 2, 3, 4]), partial=(42, 1337), peek=True)]
        ),
        uid=True,
    ),
)

type_codec_encode(command).dump()
>>> b"ABCD UID FETCH 1,2:* (BODY.PEEK[1.2.3.4.MIME]<42.1337>)\r\n"

model_dump(command)
>>> {
    "tag": "ABCD",
    "body": {
        "Fetch": {
            "sequence_set": [{"Single": {"Value": 1}}, {"Range": [{"Value": 2}, "Asterisk"]}],
            "macro_or_item_names": {
                "MessageDataItemNames": [
                    {"BodyExt": {"section": {"Mime": [1, 2, 3, 4]}, "partial": [42, 1337], "peek": True}}
                ]
            },
            "uid": True,
        }
    },
}

You can find some tests in the tests directory, which can be run using pytest.


I haven’t tested all the structures yet, so there might be some structural mistakes. The main goal of this issue is to propose a potential solution to improve the typing experience on the Python side as I understand it. Additionally, I’m interested in exploring any effective methods to test the consistency of the structure between imap-types and imap-codec-model (it does seem a bit overwhelming).

If it's feasible, perhaps this could be integrated into the Python bindings library. I’d be happy to contribute further!

@duesee
Copy link
Owner

duesee commented Aug 5, 2024

Hey @AzideCupric! Thank you so much for your work (and the congratulations)! I'll try to take a look as soon as possible! I'm very happy about your contribution, but need to find a slot when to give this the attention it deserves :-) Unfortunately, I'm not the most experienced Python developer... But: Very happy to learn from you! Maybe Henning can also chime in.

@HenningHolmDE
Copy link
Contributor

Thanks again for your work, but also for creating this issue, @AzideCupric, I still had moving your suggestions into a new issue on my list but unfortunately didn't get round to it yet. I'm really sorry for keeping you waiting this long.

I think a better type interface on the Python side is definitely a topic we should persue further and your approach looks reasonable. I'm currently not sure how this impacts the maintainability of imap-codec and imap-types (as the type structure is maintained in both languages), but I'm sure we will find a way. Maybe we will have to separate the repositories after all, so Rust developers do not have to change Python code all the time and the Python bindings become a downstream project.

I'm still catching up with other projects at the moment, but otherwise I'm happy to support on this. Hopefully, I'll find the time to take a good look at your suggestions over the weekend.

@AzideCupric
Copy link
Author

Thanks for the patient responses from both of you. I am here awaiting good news.
And if there is a need, I am also more than willing to offer my help here.

@HenningHolmDE
Copy link
Contributor

@AzideCupric Thanks again for you patience! I now finally found some time to take a deeper look at this. I really like the work you have done here!

As I could not find an example on how to construct one of the types, I played around with it myself and result looks pretty neat (maybe except for the type paths but I did not want to pollute the local namespace too much):

# imap_codec_model/models/__init__.py
from . import fetch, sequence

# tests/test_validate.py
from imap_codec_model import models


def test_construct_command():
    remaining, decoded = type_codec_decode(
        CommandCodec, b"ABCD UID FETCH 1,2:* (BODY.PEEK[1.2.3.4.MIME]<42.1337>)\r\n"
    )
    assert remaining == b""

    ref = models.Command(
        tag="ABCD",
        body=models.command.Fetch(
            sequence_set=[
                models.sequence.Single(models.sequence.Value(1)),
                models.sequence.Range((models.sequence.Value(2), "Asterisk")),
            ],
            macro_or_item_names=models.fetch.MessageDataItemNames(
                [
                    models.fetch.NameBodyExt(
                        section=models.fetch.Mime([1, 2, 3, 4]),
                        partial=(42, 1337),
                        peek=True,
                    )
                ]
            ),
            uid=True,
        ),
    )
    assert decoded == ref

There is actually not much for me to add, as I think working on your IMAP extension for NoneBot will probably guide you to implementing a better API than any theoretical stuff we would come up with will. I would probably prefer wrapper types instead of re-export the types from imap-codec so you don't end up with multiple codec and message types in the user facing API.
Is there anything we could do on the Rust side to make life easier for you?

However, I think the question on how to continue with this is more of organizational than of technical nature. As mentioned above, I'm a still bit hesitant about extending imap-codec-python in this repository with lots of Python code as it would limit working on imap-codec to "polyglot developers". For the bare minimum right now, it is probably okay, but going forward, we should move to a different structure.

Off hand, I see two paths for this:

  1. We could move the Python bindings into their own repository where you can then add the new API to it.
  2. You could also create your own Python package (which would be imap-codec-model right now) sitting on top of imap-codec and we would update the documentation to refer users over to your package for a more pythonic interface.

@AzideCupric, @duesee What are your thoughts on this?

@AzideCupric
Copy link
Author

Thanks so much for your kind words!

Regarding your question:

Is there anything we could do on the Rust side to make life easier for you?

As you might have noticed in utils.py, there is an additional step where we adjust the dictionary structure before conversion. I wonder if there's a way to handle this on the Rust side. Specifically, could we modify the output structure from something like:

{
    "Unseen": 12
}

to a format more like:

{
    "tag": "Unseen",
    "data": 12
}

This would help avoid unnecessary performance costs and potential errors when adjusting the structure after generation.

Additionally, I'm concerned that testing each IMAP lines one by one might be too extensive and time-consuming. I'm wondering if there's a more efficient way to validate the alignment between the Python models and imap-types, so that any discrepancies can be identified and resolved more quickly.

  1. We could move the Python bindings into their own repository where you can then add the new API to it.
  2. You could also create your own Python package (which would be imap-codec-model right now) sitting on top of imap-codec and we would update the documentation to refer users over to your package for a more pythonic interface.

Between the two options, I lean more towards the first one. As a user, it’s always more convenient to have just one dependency rather than two. Moreover, needing to install two different packages from separate authors for a single feature might feel a bit odd or even unsettling.

@duesee
Copy link
Owner

duesee commented Aug 11, 2024

As you might have noticed in utils.py, there is an additional step where we adjust the dictionary structure before conversion. I wonder if there's a way to handle this on the Rust side. Specifically, could we modify the output structure from something like:

{
    "Unseen": 12
}

to a format more like:

{
    "tag": "Unseen",
    "data": 12
}

This can be configured through serdes https://serde.rs/enum-representations.html. I find the internally tagged variant most easy to work with. So, I would be in favor of this change 👍

@HenningHolmDE
Copy link
Contributor

@AzideCupric To give you a heads-up on this: @duesee and I will move the Python bindings code into their own repository over the next days.

Afterwards, it would be awesome if you can provide a PR with the first set of changes, as we really would like to see the great work you did merged into the project, so others can benefit from it.

Meanwhile, I will look into the mapping thing and see, how I can change it to the internally tagged representation. It would indeed be nice, if we could get rid of the conversions.

@AzideCupric
Copy link
Author

Thank you for the update! It's an honor to contribute to the project, I'm excited for the changes ahead and to see how we can improve things further together.

@HenningHolmDE
Copy link
Contributor

@AzideCupric Thanks again for your patience! After we have now separated the repositories (see https://github.com/duesee/imap-codec-python), I took a shot at the enum representation.

The serde attribute referred to by @duesee seems to work as expected. However, as there are several enumerations using tuple variants, thus, the only useful alternative to the "externally tagged" default seems to be "adjacently tagged". (At least if we want to keep this consistent for all enum types.)

This means, that { "Unseen": 12 } now becomes { "type": "Unseen", "data": 12 } as discussed. However, this also results in { "BodyExt": { "section": None, ... } } becoming { "type": "BodyExt", "data": { "section": None, ... } } instead of { "type": "BodyExt", "section": None, ... }. The latter might have been more convenient for conversion?

You should be able to compile/install the package directly from my development branch: pip install --force-reinstall imap-codec@git+https://github.com/duesee/imap-codec-python@henning/enum_representations

The corresponding branches/changes can be found here:

I'm curious if this makes things work better in combination with msgspec and allows us to get rid of the dict transformations.

@AzideCupric
Copy link
Author

AzideCupric commented Aug 29, 2024

Sorry for the delay in my response, and thank you so much for the work you've done on the enum_representations branch!
After reviewing the relevant code, I believe there are still some points that need further discussion.

I'm curious if this makes things work better in combination with msgspec and allows us to get rid of the dict transformations.

If "adjacently tagged" is the only option, as mentioned in my initial post, it forces all structures into the second format. For example, in the case of:

#[derive(Serialize, Deserialize, Debug)]
enum Status {
    Untagged(StatusBody),
    // ...
}

#[derive(Serialize, Deserialize, Debug)]
struct StatusBody {
    kind: StatusKind,
    code: Option<Code>,
    text: String,
}

We would have to write it as:

# now
class Untagged(TaggedBase):
    codec_data: StatusBody
    
class StatusBody(TaggedBase):
    kind: StatusKind
    code: Code | None
    text: str

res = Untagged(StatusBody(kind="Ok", code=None, text="..."))

# before
class Untagged(TaggedBase):
    kind: StatusKind
    code: Code | None
    text: str

res = Untagged(kind="Ok", code=None, text="...")

Compared to the previous approach, this adds an additional import of StatusBody and introduces another layer of nesting, making it less straightforward.

If possible, perhaps a custom derive macro could be implemented to apply Custom serialization. This could further adjust/restore structures that meet certain criteria, allowing us to retain the simplicity of the previous approach while still benefiting from the "adjacently tagged" base. Since some structural transformations seem inevitable, keeping them on the Rust side might be more efficient.

The latter might have been more convenient for conversion?

Additionally, I think the tag name used in the current enum_representations branch is too generic. The use of type as a tag might conflict with certain names in imap-types, potentially leading to unnecessary clashes. This is why I opted for codec_model and codec_data on the Python side.

Finally, I noticed that the current Python bindings repository could benefit from adding static type checking and formatting tools like pyright and ruff, as well as dependency management tools (like pdm) to enhance the development experience. Especially with the introduction of msgspec structs, static type checking would help catch errors early. If agreeable, I can contribute by adding some configurations and related documentation.

@duesee
Copy link
Owner

duesee commented Oct 3, 2024

Just a small update: We havn't forgotten about you :-) It's just that we don't have funding currently and too busy with our day-to-day jobs... But: Your work will not get old, don't worry! We'll surely come back to this eventually!

@AzideCupric
Copy link
Author

I'm in the same boat, and I also except this new feature can be implemented soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants