should our data models be hashable / immutable #24

d-v-b · 2024-11-21T22:40:26Z

I have a test in pydantic-ome-ngff that checks if my multiscale model is hashable. I don't remember why I added this test, but evidently it can be useful for these metadata models to be hashable, e.g. so that we can store them in python sets.

Making metadata models hashable requires using frozen pydantic models / dataclasses. We previously decided against using frozen models, because we wanted to allow in-place mutation of those models. How do we weigh the value of in-place mutation vs the advantages of having hashable models? (Maybe we think in place mutation is much more useful...)

The text was updated successfully, but these errors were encountered:

joshmoore · 2024-11-21T22:52:13Z

I'm not sure if it's related, but if in place mutation is necessary for the needed extensibility, I'd value that over hashability.

d-v-b · 2024-11-22T08:32:03Z

I think the recommendation for extensions would be to use subclassing, e.g. class MyAxis(Axis): .... I think the overlap with mutability is small, but If anything, immutable data models would encourage subclassing, because users would be structurally discouraged from adding dynamic, undocumented fields to the core data models.

d-v-b · 2024-11-22T08:42:03Z

If we made our classes frozen, we would need to ensure that it's easy to make a copy with new, validated, attributes. We can do this with the following design

class Axis(Base):
    """
    Model for an element of `Multiscale.axes`.

    See https://ngff.openmicroscopy.org/0.4/#axes-md.
    """

    name: str
    type: str | None = None
    unit: str | None = None
    
    def with_name(name: str):
        """
        return a copy of the axis with a new name
        """
        return type(self)(**self.model_dump(exclude={'name'}), name=name)


    def with_type(name: str):
        """
        return a copy of the axis with a new type
        """
        return type(self)(**self.model_dump(exclude={'type'}), type=type)


ax = Axis(name='foo', type='space')
# this will make a new axis with name 'bar', and type time
new_ax = ax.with_name('bar').with_type('time')

obviously we would complete this with with_unit, if people like the design

dstansby · 2024-11-22T10:39:02Z

I'm 👎 on having to manually write those with_* methods (although they could probably look a bit nicer if they were classmethods?). Doesn't pydantic provide a nice way to do this?

d-v-b · 2024-11-22T10:54:06Z

I'm happy to write them, and I they have to be instance methods because they need the original values

tlambert03 · 2024-11-22T16:29:40Z

pydantic has model_copy that provides options to create a copy with specific field(s) updated.

In [12]: Axis(name='x').model_copy(update={'name': 'Y'})
Out[12]: Axis(name='Y', type=None, unit=None)

however it doesn't perform validation:

In [15]: Axis(name='x').model_copy(update={'name': ()})
Out[15]: Axis(name=(), type=None, unit=None)

when I want immutability, but also want the convenience of easy updates, I often add a replace() method to my base class:

class FrozenModel(BaseModel):
    model_config = {'frozen': True}

    def replace(self, **kwargs: Any) -> "Self":
        """Return a new instance replacing specified kwargs with new values."""
        return type(self).model_validate({**self.model_dump(), **kwargs})

and then if you want model specific type hinting on each subclass, one option is to use kwargs: Unpack[SomeTypedDict]. Note, however, model_dump() recurses all the way into a nested model, and if you only intend to allow swapping of top level fields... then you can save some cycles with:

        d = {k: getattr(self, k) for k in self.model_fields}
        return type(self).model_validate({**d, **kwargs})

I've generally been happy when I've made things immutable... particularly when they are objects may be passed as arguments to various listeners/consumers. So I'd certainly try to make them immutable, but yeah, like @joshmoore said, if it really gets in the way then meh

d-v-b · 2024-11-22T16:34:33Z

thanks @tlambert03 those examples are super helpful! I love that top-level replace method, that's exactly the kind of thing we could wrap with with_{field} if we do go in that direction.

d-v-b · 2024-11-28T13:48:01Z

I updated the title of this issue to include an explicit reference to immutability.

Another point in favor of immutability for these data models: we have logically coupled fields, namely:

fields sensitive to the number of axes of the data, like multiscale.axes, multiscale.coordinateTransformations, and dataset.coordinateTransformations
the number of multiscale.datasets is coupled to the number of zarr arrays in the zarr group

If we allow mutation, and we attempt to validate when attributes change, then it's not possible to mutate just one of these fields without causing a validation error. We would have to mutate all the logically coupled fields in one go, which is not possible with vanilla __setattr__.

Consider the case of adding new axes in multiscales metadata -- if you only mutate the axes attribute, then you have invalidated the coordinateTransformations, and vice versa. The only valid way to apply the "add axes" transformation is by creating a totally new model.

So if we want to support operations like "add axes", then we have to treat the models as immutable at least within the scope of those operations, in which case IMO we should just be consistent and treat the models as immutable for all transformations.

(this came up in a conversation on zulip)

tlambert03 · 2024-11-28T14:59:42Z

We would have to mutate all the logically coupled fields in one go, which is not possible with vanilla setattr.

yeah, but it is probably possible with @model_validator(mode='before'), where you could certainly make sure that the field isn't set at all if it's inconsistent with other fields on the model.
in other words, I think most (any?) interconnected models that you can conceive of are probably pretty feasible with pydantic... is your point here about the feasibility of implementing such a pattern? or something on higher level?

this is not to argue in favor of mutability, but if the counter argument here is more of a "how would we do that?" argument, i think it's possible?

d-v-b · 2024-11-28T15:04:44Z

@tlambert03 can you explain how this would work? Suppose i have a model like this:

class Model(BaseModel):
    foo: tuple[int, ...]
    bar: tuple[int, ...] 

    @model_validator(mode='after')
    def ensure_foo_bar_same_length(cls, v):
        if len(v.foo) != len(v.bar):
            raise ValidationError
        return v

how would you support something like this:

my_model = Model(foo=(1,2,3), bar=(4,5,6))
my_model.foo = (1,2,3,4) # this will raise if we validate on `__setattr__`
my_model.bar = (4,5,6,7)

tlambert03 · 2024-11-28T15:32:20Z

sure,

first off, if you want pydantic to validate field assignments at all after model creation, you need to use validate_assignment

from pydantic import BaseModel, model_validator, ConfigDict

class Model(BaseModel):
    model_config = ConfigDict(validate_assignment=True)

With that alone, your code actually would raise an exception on the second line:

my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
my_model.foo = (1, 2, 3, 4)

Traceback (most recent call last):
  File "/Users/talley/Desktop/x.py", line 19, in <module>
    my_model.foo = (1, 2, 3, 4)
    ^^^^^^^^^^^^
  File "/Users/talley/micromamba/envs/all/lib/python3.13/site-packages/pydantic/main.py", line 923, in __setattr__
    self.__pydantic_validator__.validate_assignment(self, name, value)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 1 validation error for Model
  Value error, foo and bar must have the same length [type=value_error, input_value=Model(foo=(1, 2, 3, 4), bar=(4, 5, 6)), input_type=Model]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

side-note on mode='after'

If you're using @model_validator(mode='after'), it now actually supports instance methods, which makes a lot of sense and provides a nicer syntax... so you could update your code to:

    @model_validator(mode='after')
    def ensure_foo_bar_same_length(self) -> Self:
        if len(self.foo) != len(self.bar):
            raise ValidationError
        return self

however there, is still a problem here, (it's an open issue at pydantic/pydantic#7105), which is that, while a validation error was raised, the model is left in an inconsistent state:

my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
with contextlib.suppress(Exception):  # swallow the validation error
    my_model.foo = (1, 2, 3, 4)

print(my_model)

foo=(1, 2, 3, 4) bar=(4, 5, 6)  # 😢

so, you have to use model_validator(mode='before'). This is more powerful, but now means you need to deal with the raw input rather than the post-validated data. Here's a complete example:

from typing import Any
from pydantic import BaseModel, model_validator, ConfigDict


class Model(BaseModel):
    model_config = ConfigDict(validate_assignment=True)

    foo: tuple[int, ...]
    bar: tuple[int, ...]

    @model_validator(mode="before")
    @classmethod
    def ensure_foo_bar_same_length(cls, data: dict[str, Any]) -> dict[str, Any]:
        foo = data.get("foo", ())
        bar = data.get("bar", ())
        if len(foo) != len(bar):
            raise ValueError("foo and bar must be the same length")
        return data


my_model = Model(foo=(1, 2, 3), bar=(4, 5, 6))
try:
    my_model.foo = (1, 2, 3, 4)
except Exception as e:
    print(e)

print("after setting:", my_model)

1 validation error for Model
  Value error, foo and bar must be the same length [type=value_error, input_value={'foo': (1, 2, 3, 4), 'bar': (4, 5, 6)}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error
after setting: foo=(1, 2, 3) bar=(4, 5, 6)

d-v-b · 2024-11-28T16:54:57Z

Thanks for the examples, that's very helpful. The end of your example produces a pydantic class where validation runs on assignment, but because of the coupling between the fields, this assignment can never succeed. Suppose we filled out the model with more fields, and most of them were not coupled. Then assignment would work for some fields, but never for others, and this is exactly the outcome I want to avoid by making the models immutable! (that's what I was trying to express in this comment).

I think it's super valuable of the API for making changes to these models is uniform, which means we should avoid telling users "you can modify some attributes with direct assigment, unless they are special, in which case do it very differently"

I think this setup would be much better:

from typing import Any
from pydantic import BaseModel, model_validator, ConfigDict


class Model(BaseModel, frozen=True):
    foo: tuple[int, ...]
    bar: tuple[int, ...]
    baz: int
    @model_validator(mode="before")
    @classmethod
    def ensure_foo_bar_same_length(cls, data: dict[str, Any]) -> dict[str, Any]:
        foo = data.get("foo", ())
        bar = data.get("bar", ())
        if len(foo) != len(bar):
            raise ValueError("foo and bar must be the same length")
        return data

    @classmethod
    def with_baz(cls, new_baz) -> Self:
        """Create a new model with a new baz attribute"""
        ... 

    @classmethod
    def with_foo_and_bar(cls, new_foo, new_bar) -> Self:
        """Create a new model with new foo and bar attributes"""
        ...

Essentially for each set of coupled attributes (a set which could have 1 element), we expose a convenience method that takes new values for those attributes as parameters and returns a brand-new model. No need to worry about invalid models with this approach. We do lose direct assignment, but IMO it's too problematic to keep.

tlambert03 · 2024-11-28T19:04:13Z

this assignment can never succeed.

Ahhh yes I see now (sorry I missed that point earlier!). Yeah if you have fields that MUST change together... then one would need to use custom setter methods that take one or more arguments and validate them together (honestly, it kinda seems like you'd need that even for immutable objects with a "with_" method). And one should never directly modify the attribute

I know you have less control here over the model itself (inasmuch as it's just mirroring the spec). But if you can avoid mutually constrained fields altogether on the runtime model side, that would make the data model a bit easier to work with.

d-v-b · 2024-11-28T19:16:42Z

But if you can avoid mutually constrained fields altogether on the runtime model side, that would make the data model a bit easier to work with.

This is the job of libraries higher up in the stack :) for this library, our scope is narrowly constrained to modelling the exact metadata in OME-Zarr as it is defined there, which means we will have to deal with logically coupled fields

tlambert03 · 2024-11-28T19:44:30Z

Well not necessarily, you could always have computed fields that do strictly model upstream, with input fields that are not mutually constrained. Point being you can absolutely model something strictly mirroring upstream, while providing an api that makes it easier to work with mutually constrained stuff.

(Again not endorsing either approach here, just exploring options :) )

d-v-b · 2024-11-28T19:47:45Z

we are pretty specifically constrained with respect to our input fields (i.e., they are mutually constrained), because (at least at the moment) we want to construct these models from raw JSON documents, and in those documents the fields are mutually constrained. So if we are to introduce a "simplified" API on top of that, it can't manifest in the signature of __init__, and would have to be somewhere else (e.g., a build method, that's basically "make a model but without needing to model JSON directly")

tlambert03 · 2024-11-28T20:03:11Z

Yep definitely makes sense! I agree, it all brings it back to immutable models being probably a good idea

dstansby · 2024-12-10T13:43:24Z

Finally read this thread! Seems like we should make all our models immutable for now, and if we want to make them mutable down the line it's easier to go that way than mutable > immutable.

d-v-b changed the title ~~should our data models be hashable~~ should our data models be hashable / immutable Nov 28, 2024

dstansby mentioned this issue Dec 10, 2024

Make models immutable #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should our data models be hashable / immutable #24

should our data models be hashable / immutable #24

d-v-b commented Nov 21, 2024 •

edited

Loading

joshmoore commented Nov 21, 2024

d-v-b commented Nov 22, 2024

d-v-b commented Nov 22, 2024 •

edited

Loading

dstansby commented Nov 22, 2024

d-v-b commented Nov 22, 2024

tlambert03 commented Nov 22, 2024 •

edited

Loading

d-v-b commented Nov 22, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024 •

edited

Loading

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

dstansby commented Dec 10, 2024

should our data models be hashable / immutable #24

should our data models be hashable / immutable #24

Comments

d-v-b commented Nov 21, 2024 • edited Loading

joshmoore commented Nov 21, 2024

d-v-b commented Nov 22, 2024

d-v-b commented Nov 22, 2024 • edited Loading

dstansby commented Nov 22, 2024

d-v-b commented Nov 22, 2024

tlambert03 commented Nov 22, 2024 • edited Loading

d-v-b commented Nov 22, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024 • edited Loading

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

d-v-b commented Nov 28, 2024

tlambert03 commented Nov 28, 2024

dstansby commented Dec 10, 2024

d-v-b commented Nov 21, 2024 •

edited

Loading

d-v-b commented Nov 22, 2024 •

edited

Loading

tlambert03 commented Nov 22, 2024 •

edited

Loading

d-v-b commented Nov 28, 2024 •

edited

Loading