Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should our data models be hashable / immutable #24

Open
d-v-b opened this issue Nov 21, 2024 · 18 comments
Open

should our data models be hashable / immutable #24

d-v-b opened this issue Nov 21, 2024 · 18 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Nov 21, 2024

I have a test in pydantic-ome-ngff that checks if my multiscale model is hashable. I don't remember why I added this test, but evidently it can be useful for these metadata models to be hashable, e.g. so that we can store them in python sets.

Making metadata models hashable requires using frozen pydantic models / dataclasses. We previously decided against using frozen models, because we wanted to allow in-place mutation of those models. How do we weigh the value of in-place mutation vs the advantages of having hashable models? (Maybe we think in place mutation is much more useful...)

@joshmoore
Copy link
Member

I'm not sure if it's related, but if in place mutation is necessary for the needed extensibility, I'd value that over hashability.

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 22, 2024

I think the recommendation for extensions would be to use subclassing, e.g. class MyAxis(Axis): .... I think the overlap with mutability is small, but If anything, immutable data models would encourage subclassing, because users would be structurally discouraged from adding dynamic, undocumented fields to the core data models.

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 22, 2024

If we made our classes frozen, we would need to ensure that it's easy to make a copy with new, validated, attributes. We can do this with the following design

class Axis(Base):
    """
    Model for an element of `Multiscale.axes`.

    See https://ngff.openmicroscopy.org/0.4/#axes-md.
    """

    name: str
    type: str | None = None
    unit: str | None = None
    
    def with_name(name: str):
        """
        return a copy of the axis with a new name
        """
        return type(self)(**self.model_dump(exclude={'name'}), name=name)


    def with_type(name: str):
        """
        return a copy of the axis with a new type
        """
        return type(self)(**self.model_dump(exclude={'type'}), type=type)


ax = Axis(name='foo', type='space')
# this will make a new axis with name 'bar', and type time
new_ax = ax.with_name('bar').with_type('time')

obviously we would complete this with with_unit, if people like the design

@dstansby
Copy link
Contributor

I'm 👎 on having to manually write those with_* methods (although they could probably look a bit nicer if they were classmethods?). Doesn't pydantic provide a nice way to do this?

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 22, 2024

I'm happy to write them, and I they have to be instance methods because they need the original values

@tlambert03
Copy link

tlambert03 commented Nov 22, 2024

pydantic has model_copy that provides options to create a copy with specific field(s) updated.

In [12]: Axis(name='x').model_copy(update={'name': 'Y'})
Out[12]: Axis(name='Y', type=None, unit=None)

however it doesn't perform validation:

In [15]: Axis(name='x').model_copy(update={'name': ()})
Out[15]: Axis(name=(), type=None, unit=None)

when I want immutability, but also want the convenience of easy updates, I often add a replace() method to my base class:

class FrozenModel(BaseModel):
    model_config = {'frozen': True}

    def replace(self, **kwargs: Any) -> "Self":
        """Return a new instance replacing specified kwargs with new values."""
        return type(self).model_validate({**self.model_dump(), **kwargs})

and then if you want model specific type hinting on each subclass, one option is to use kwargs: Unpack[SomeTypedDict]. Note, however, model_dump() recurses all the way into a nested model, and if you only intend to allow swapping of top level fields... then you can save some cycles with:

        d = {k: getattr(self, k) for k in self.model_fields}
        return type(self).model_validate({**d, **kwargs})

I've generally been happy when I've made things immutable... particularly when they are objects may be passed as arguments to various listeners/consumers. So I'd certainly try to make them immutable, but yeah, like @joshmoore said, if it really gets in the way then meh

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 22, 2024

thanks @tlambert03 those examples are super helpful! I love that top-level replace method, that's exactly the kind of thing we could wrap with with_{field} if we do go in that direction.

@d-v-b d-v-b changed the title should our data models be hashable should our data models be hashable / immutable Nov 28, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 28, 2024

I updated the title of this issue to include an explicit reference to immutability.

Another point in favor of immutability for these data models: we have logically coupled fields, namely:

  • fields sensitive to the number of axes of the data, like multiscale.axes, multiscale.coordinateTransformations, and dataset.coordinateTransformations
  • the number of multiscale.datasets is coupled to the number of zarr arrays in the zarr group

If we allow mutation, and we attempt to validate when attributes change, then it's not possible to mutate just one of these fields without causing a validation error. We would have to mutate all the logically coupled fields in one go, which is not possible with vanilla __setattr__.

Consider the case of adding new axes in multiscales metadata -- if you only mutate the axes attribute, then you have invalidated the coordinateTransformations, and vice versa. The only valid way to apply the "add axes" transformation is by creating a totally new model.

So if we want to support operations like "add axes", then we have to treat the models as immutable at least within the scope of those operations, in which case IMO we should just be consistent and treat the models as immutable for all transformations.

(this came up in a conversation on zulip)

@tlambert03
Copy link

We would have to mutate all the logically coupled fields in one go, which is not possible with vanilla setattr.

yeah, but it is probably possible with @model_validator(mode='before'), where you could certainly make sure that the field isn't set at all if it's inconsistent with other fields on the model.
in other words, I think most (any?) interconnected models that you can conceive of are probably pretty feasible with pydantic... is your point here about the feasibility of implementing such a pattern? or something on higher level?

this is not to argue in favor of mutability, but if the counter argument here is more of a "how would we do that?" argument, i think it's possible?

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 28, 2024

@tlambert03 can you explain how this would work? Suppose i have a model like this:

class Model(BaseModel):
    foo: tuple[int, ...]
    bar: tuple[int, ...] 

    @model_validator(mode='after')
    def ensure_foo_bar_same_length(cls, v):
        if len(v.foo) != len(v.bar):
            raise ValidationError
        return v

how would you support something like this:

my_model = Model(foo=(1,2,3), bar=(4,5,6))
my_model.foo = (1,2,3,4) # this will raise if we validate on `__setattr__`
my_model.bar = (4,5,6,7)

@tlambert03
Copy link

sure,

first off, if you want pydantic to validate field assignments at all after model creation, you need to use validate_assignment

from pydantic import BaseModel, model_validator, ConfigDict

class Model(BaseModel):
    model_config = ConfigDict(validate_assignment=True)

With that alone, your code actually would raise an exception on the second line:

my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
my_model.foo = (1, 2, 3, 4)
Traceback (most recent call last):
  File "/Users/talley/Desktop/x.py", line 19, in <module>
    my_model.foo = (1, 2, 3, 4)
    ^^^^^^^^^^^^
  File "/Users/talley/micromamba/envs/all/lib/python3.13/site-packages/pydantic/main.py", line 923, in __setattr__
    self.__pydantic_validator__.validate_assignment(self, name, value)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for Model
  Value error, foo and bar must have the same length [type=value_error, input_value=Model(foo=(1, 2, 3, 4), bar=(4, 5, 6)), input_type=Model]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error
side-note on mode='after'

If you're using @model_validator(mode='after'), it now actually supports instance methods, which makes a lot of sense and provides a nicer syntax... so you could update your code to:

    @model_validator(mode='after')
    def ensure_foo_bar_same_length(self) -> Self:
        if len(self.foo) != len(self.bar):
            raise ValidationError
        return self

however there, is still a problem here, (it's an open issue at pydantic/pydantic#7105), which is that, while a validation error was raised, the model is left in an inconsistent state:

my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
with contextlib.suppress(Exception):  # swallow the validation error
    my_model.foo = (1, 2, 3, 4)

print(my_model)
foo=(1, 2, 3, 4) bar=(4, 5, 6)  # 😢

so, you have to use model_validator(mode='before'). This is more powerful, but now means you need to deal with the raw input rather than the post-validated data. Here's a complete example:

from typing import Any
from pydantic import BaseModel, model_validator, ConfigDict


class Model(BaseModel):
    model_config = ConfigDict(validate_assignment=True)

    foo: tuple[int, ...]
    bar: tuple[int, ...]

    @model_validator(mode="before")
    @classmethod
    def ensure_foo_bar_same_length(cls, data: dict[str, Any]) -> dict[str, Any]:
        foo = data.get("foo", ())
        bar = data.get("bar", ())
        if len(foo) != len(bar):
            raise ValueError("foo and bar must be the same length")
        return data


my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
try:
    my_model.foo = (1, 2, 3, 4)
except Exception as e:
    print(e)

print("after setting:", my_model)
1 validation error for Model
  Value error, foo and bar must be the same length [type=value_error, input_value={'foo': (1, 2, 3, 4), 'bar': (4, 5, 6)}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error
after setting: foo=(1, 2, 3) bar=(4, 5, 6)

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 28, 2024

Thanks for the examples, that's very helpful. The end of your example produces a pydantic class where validation runs on assignment, but because of the coupling between the fields, this assignment can never succeed. Suppose we filled out the model with more fields, and most of them were not coupled. Then assignment would work for some fields, but never for others, and this is exactly the outcome I want to avoid by making the models immutable! (that's what I was trying to express in this comment).

I think it's super valuable of the API for making changes to these models is uniform, which means we should avoid telling users "you can modify some attributes with direct assigment, unless they are special, in which case do it very differently"

I think this setup would be much better:

from typing import Any
from pydantic import BaseModel, model_validator, ConfigDict


class Model(BaseModel, frozen=True):
    foo: tuple[int, ...]
    bar: tuple[int, ...]
    baz: int
    @model_validator(mode="before")
    @classmethod
    def ensure_foo_bar_same_length(cls, data: dict[str, Any]) -> dict[str, Any]:
        foo = data.get("foo", ())
        bar = data.get("bar", ())
        if len(foo) != len(bar):
            raise ValueError("foo and bar must be the same length")
        return data

    @classmethod
    def with_baz(cls, new_baz) -> Self:
        """Create a new model with a new baz attribute"""
        ... 

    @classmethod
    def with_foo_and_bar(cls, new_foo, new_bar) -> Self:
        """Create a new model with new foo and bar attributes"""
        ... 

Essentially for each set of coupled attributes (a set which could have 1 element), we expose a convenience method that takes new values for those attributes as parameters and returns a brand-new model. No need to worry about invalid models with this approach. We do lose direct assignment, but IMO it's too problematic to keep.

@tlambert03
Copy link

this assignment can never succeed.

Ahhh yes I see now (sorry I missed that point earlier!). Yeah if you have fields that MUST change together... then one would need to use custom setter methods that take one or more arguments and validate them together (honestly, it kinda seems like you'd need that even for immutable objects with a "with_" method). And one should never directly modify the attribute

I know you have less control here over the model itself (inasmuch as it's just mirroring the spec). But if you can avoid mutually constrained fields altogether on the runtime model side, that would make the data model a bit easier to work with.

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 28, 2024

But if you can avoid mutually constrained fields altogether on the runtime model side, that would make the data model a bit easier to work with.

This is the job of libraries higher up in the stack :) for this library, our scope is narrowly constrained to modelling the exact metadata in OME-Zarr as it is defined there, which means we will have to deal with logically coupled fields

@tlambert03
Copy link

Well not necessarily, you could always have computed fields that do strictly model upstream, with input fields that are not mutually constrained. Point being you can absolutely model something strictly mirroring upstream, while providing an api that makes it easier to work with mutually constrained stuff.

(Again not endorsing either approach here, just exploring options :) )

@d-v-b
Copy link
Contributor Author

d-v-b commented Nov 28, 2024

we are pretty specifically constrained with respect to our input fields (i.e., they are mutually constrained), because (at least at the moment) we want to construct these models from raw JSON documents, and in those documents the fields are mutually constrained. So if we are to introduce a "simplified" API on top of that, it can't manifest in the signature of __init__, and would have to be somewhere else (e.g., a build method, that's basically "make a model but without needing to model JSON directly")

@tlambert03
Copy link

Yep definitely makes sense! I agree, it all brings it back to immutable models being probably a good idea

@dstansby
Copy link
Contributor

Finally read this thread! Seems like we should make all our models immutable for now, and if we want to make them mutable down the line it's easier to go that way than mutable > immutable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants