Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make KedroContext a frozen dataclass #1465

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Apr 21, 2022

Description

Fix #1459

Context

As KedroSession now control the lifecycle of Kedro's run, KedroContext acts as a data container and stores important attributes.
Since we now dropped Python 3.6 support, we can make use of Python's dataclasses.

The benefit of it is

  1. Signifying KedroContext will be just a data container.
  2. Simplify Python class construction (less boilerplate code) using the new Python feature.

Development Changes

This PR is a bit long so I try to keep the summary of changes at the top.

  • Make KedroContext a dataclass with attrs, Python's dataclasses was considered but it does not fit well. See more in the discussion.
    • Any simple public @property that we use to do self._x = x is now just a one line defintion at the top with attrs.setters.frozen
  • config_loader is now a public property of KedroContext.
  • Add tests to ensure important KedroContext attributes remains immutable in case we refactor this in future.

The Class init signature is unchanged as evidenced by this:
image

More Development Notes

  • Make KedroContext a dataclass, signifying it will be simply a container as we gradually move the logic out from it.
    • dataclass doesn't support partial "read-only" property, so there are 2 options to achieve that.
      1. Do it like old fashion, use @property to achieve the read-only attribute, just don't use dataclass or use alternatives like attrs (ancestor of dataclass, but it's not inside the standard library).
      2. Set frozen=True, but use __setattr__ for __post_init__. The benefit of using it is to lock the context, so context.arbitary_attr = 'xyz' assignment is not possible, it also reduce some boilerplate code to set @property, but the tradeoff is the more complicated __post_init__ method. (See https://stackoverflow.com/questions/59222092/how-to-use-the-post-init-method-in-dataclasses-in-python).
      3. Use dataclass but not frozen, mimic the frozen by custom implementation instead.

The __post_init__ look like this if we follow (b):

        object.__setattr__(self, "_package_name", package_name)
        object.__setattr__(
            self, "project_path", Path(project_path).expanduser().resolve()
        )
        object.__setattr__(self, "_extra_params", deepcopy(extra_params))
        object.__setattr__(self, "_hook_manager", hook_manager)

Originally I favor method (b) as I am okay with a cleaner class with the complicated bit just in __post_init__, however, it seems to create trouble for mypy and other linter as it doesn't understand the attribute assignment.

todo

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as draft April 21, 2022 13:42
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as ready for review April 22, 2022 18:00
@noklam noklam self-assigned this Apr 22, 2022
@antonymilne
Copy link
Contributor

antonymilne commented Apr 22, 2022

Great work getting to the bottom of this! I also didn't realise it would be so difficult.

A few quick comments:

  • remember to update the pyspark starters with the same change you make in the docs
  • I think your solution (b) is fine, but I'm also not personally worried if we just don't use frozen or use any properties at all, leave the variables as mutable and assume that people who overwrite them know what they are doing..Then we could do a more normal self._x = x assignment in __post_init__

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @AntonyMilneQB that it would be okay not to use frozen. We'll just need to make it clear to users what the implications of overwriting are.

Comment on lines 196 to 202
def __post_init__(self, package_name, project_path, hook_manager, extra_params):
object.__setattr__(self, "_package_name", package_name)
object.__setattr__(
self, "project_path", Path(project_path).expanduser().resolve()
)
object.__setattr__(self, "_hook_manager", hook_manager)
object.__setattr__(self, "_extra_params", deepcopy(extra_params))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic behind which attributes are added here and the ones that aren't? For example, why is package_name added but not env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions!

Personally, I am not too worried about it, but it is still a change in API since we now open up the possibility to overwrite it (though we can argue users could overwrite the self._xxxx if they are hacky.

Copy link
Contributor Author

@noklam noklam Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 2nd point question, It was just trying to keep the API unchanged.
env, project_path, config_loader , params (additional logic compose with self._extra_params, was class property, _hook_manger, _extra_params, _package_name was always internal.

frozen=True is basically an attempt to achieve similar effect as @property, but there are some limitations. Mainly due to dataclass does not offer a nice solution to immutability at the field level, but only class level.
With a normal class, we just do _attribute_name to signify it is an internal attribute and expose it with an attribute_name with @property.

An example can be found here.
https://noklam.github.io/blog/python/2022/04/22/python-dataclass-partiala-immutable.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right now I see what you mean. Thinking about this a bit more, maybe we should do it like you implemented and be a bit more careful with making properties changeable. It will be harder to make them frozen once they're mutable and users actually make use of that.

Copy link
Contributor Author

@noklam noklam Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is approach b, which works functionally, but it will also break IDE support & linters they seem to have trouble understand obj.__setattr__(self, 'a', 1) has the same semantic as self.a = 1. Things like code navigation no longer work.

dataclass is essentially a code generator, my hesitation is that it seems not benefiting too much in this case.

Copy link
Contributor Author

@noklam noklam Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to hear some opinion from @idanov

Updated: Have a brief chat with Ivan and we agree dataclasses does not seem to be a good fit, attrs is promising, pydantic is another one that does a similar job but more focuses on validation. Also Validation in run-time is not necessary for Kedro, it's more suitable for API that receive data.

@noklam
Copy link
Contributor Author

noklam commented Apr 27, 2022

Another possible workaround (c) by implementing the frozen mechanism, the advantage is it doesn't break code completion/linting and no weird object.__setattr__, but it starts to feel there is too many effort to fight with dataclass. Both (b) & (c) requires deep understanding how dataclass implement the immutability and find clever way to get around with it.

@dataclass
class KedroContext:
    """``KedroContext`` is the base class which holds the configuration and
    Kedro's main functionality.
    """

    package_name: InitVar[str]
    project_path: InitVar[Union[Path, str]]
    config_loader: ConfigLoader
    hook_manager: InitVar[PluginManager]
    env: Optional[str] = None
    extra_params: InitVar[Dict[str, Any]] = None


    def __post_init__(self, package_name, project_path, hook_manager, extra_params):        
        self.project_path = Path(project_path).expanduser().resolve()
        self._package_name = package_name
        self._hook_manager = hook_manager
        self._extra_params = deepcopy(extra_params)

        self._proteced_attributes = ["package_name", "project_path", "config_loader"]
        # `__setattr__`, `__delattr__` is the dataclass way to mimic class's property
    
        def __setattr__(self, name, value):
            if name in self._proteced_attributes:
                raise FrozenInstanceError(f"cannot assign to field {name!r}")

        def __delattr__(self, name):
            if name in self._proteced_attributes:
                raise FrozenInstanceError(f"cannot delete field {name!r}")
        self.__setattr__ = __setattr__
        self.__delattr__ = __delattr__

@noklam
Copy link
Contributor Author

noklam commented Apr 28, 2022

I am more happy with the attrs implementation on the right, any immutable attribute will still be using @Property, so nothing magic here. Ideally I hope I can just do field(frozen=True), but seems there is not an easy way to do it, and the proper way may be using a factory method to create a frozen class.

image

The syntax may looks unfamiliar at first, but basically attrs transform all signature _something, to something in the signature. Essentially a syntaic suger for this

class A:
  def __init__(self, x):
      self._x = x

attrs version would be

@define
class A:
  _x

You can see the Class init signature here, which is unchanged.
image

Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Don't forget to add these changes to the release file 👍

Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! ⭐ I fully agree that attrs is a better solution than the built in dataclasses. I've done some reading that compares both solutions (and also pydantic) and I think attrs will serve us better not only for the KedroContext but also for other classes that we might want to convert to this format (e.g. `Node).

(Some of the stuff I read in case others find it useful:

Don't forget to update the release notes!

@@ -1,11 +1,13 @@
# pylint: disable=no-member
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to disable this for the whole file or is it possible to add it to the specific line(s) it applies to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I change it back to the specific line(s). It's not clear why pylint complains about "no-member" for config_loader only but not the other attributes.

noklam added 2 commits May 4, 2022 17:49
Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
…loader-as-a-property' of https://github.com/kedro-org/kedro into feat/1459-make-kedrocontext-a-dataclass-and-add-config_loader-as-a-property

Signed-off-by: noklam <nok.lam.chan@quantumblack.com>
@Galileo-Galilei
Copy link
Member

@AntonyMilneQB wrote: I think your solution (b) is fine, but I'm also not personally worried if we just don't use frozen or use any properties at all, leave the variables as mutable and assume that people who overwrite them know what they are doing..Then we could do a more normal self._x = x assignment in post_init

@MerelTheisenQB wrote: I tend to agree with @AntonyMilneQB that it would be okay not to use frozen. We'll just need to make it clear to users what the implications of overwriting are.

Just to make a point : as a user, I really want be able to at least add some attributes to the context (especially to enable passing object between hooks, but also to have a clean API to expose objects to user with an interactive workflow). If I were to override one of the existing attributes I would certainly do it in purpose, but I understand your concerns about possible (albeit unlikely) involuntary overriding by a user.

Maybe a compromise would be to make the existing attributes frozen, but still enable to add new ones? I have absolutely no idea if it is easy to impement with attrs though.

@noklam What is the rational about making the class frozen (while it is currently not)? Do you think my concerns make sense or do I misunderstand something?

@noklam
Copy link
Contributor Author

noklam commented May 4, 2022

@Galileo-Galilei Thank you for your comments.

Just to make a point : as a user, I really want be able to at least add some attributes to the context (especially to enable passing object between hooks, but also to have a clean API to expose objects to user with an interactive workflow). If I were to override one of the existing attributes I would certainly do it in purpose, but I understand your concerns about possible (albeit unlikely) involuntary overriding by a user.

How would you retrieve these attributes after adding new attributes to the context? after_context_created is the only hook exposing context currently.

Maybe a compromise would be to make the existing attributes frozen, but still enable to add new ones? I have absolutely no idea if it is easy to impement with attrs though.

It would be still possible by just doing object.__setattr__('ATTR_NAME', value), instead of context.ATTR_NAME = value.

@noklam What is the rational about making the class frozen (while it is currently not)? Do you think my concerns make sense or do I misunderstand something?

You are correct that it is the KedroContext is more conservative now. I can share my thought.

  1. We are now exposing the context object directly, thus more chance that the user may override or abuse the object for everything (Context is a higher-level object in Kedro, similar to the argument why not exposing session that gives users access to everything indirectly). So l am being more conservative and try to limit the surface that exposes to users. It's much easier to relax the interface if we find reasonable needs.
  2. It's still possible to override, with an arguably less elegant method.
  3. From kedro framework perspective, context should really be just a container created by session and thus it feels right to make it "immutable".

These are my quick thoughts, I may miss something about interactive workflow, would love to hear more from you!

@antonymilne
Copy link
Contributor

antonymilne commented May 5, 2022

Thanks very much for the comments @Galileo-Galilei. Before I comment on whether or not we should make this class frozen let me just come up with some concrete examples to make sure I understand what you're saying here and in your latest comment #506 (comment). This took quite a bit of careful thought so please do let me know whether I've got these right!

Example 1: passing objects between hooks in same hook class

In this case you would like to use context as a container to share objects within the same class.

class HooksA:
    @hook_impl
    def after_context_created(self, context):
        context.new_attribute = "something"
        self.context = context

    @hook_impl
    def before_node_run(self):
        print(self.context.new_attribute)

The above requires being able to add a new attribute to context. If you don't care about using context then you can actually just do self.new_attribute = "something" instead, which wouldn't need to touch context at all.

Currently it's possible to do this as follows:

class HooksA:
    @hook_impl
    def before_pipeline_run(self):
        context = _active_session.load_context()
        context.new_attribute = "something"

    @hook_impl
    def before_node_run(self):
        context = _active_session.load_context()
        print(context.new_attribute)

Example 2: passing objects between hooks in different hook classes

In this case you would like to use context as a vehicle to share objects between different hook classes. This gets tricky... In addition to HooksA above we now have:

class HooksB:
    @hook_impl
    def after_context_created(self, context):
        print(context.new_attribute)
        self.context = context
    
    @hook_impl
    def before_node_run(self):
        print(self.context.new_attribute)

We know that HooksA.after_context_created always runs before HooksA.before_node_run, and similarly for HooksB, but it's not obvious what the ordering of HooksA vs. HooksB is. Hence the HooksA.before_node_run will work as intended, but for the HooksB.after_context_created and HooksA.before_run_run to work we rely on HooksA.after_context_created running before HooksB.after_context_created, which seems very fragile. (I don't even know how the ordering is defined if hooks are auto-registered through pip install rather than listed in HOOKS in settings.py - do you? Is it just alphabetical?)

In summary, we have:

  • HooksA.after_context_created - ok
  • HooksA.before_node_run - ok
  • HooksB.after_context_created - relies on ordering
  • HooksB.before_node_run - relies on ordering

The above is currently achieved as follows:

class HooksB:
    @hook_impl
    def before_pipeline_run(self):
        context = _active_session.load_context()
        print(context.new_attribute)
    
    @hook_impl
    def before_node_run(self):
        context = _active_session.load_context()
        print(context.new_attribute)

Here we have:

  • HooksA.before_pipeline_run - ok
  • HooksA.before_node_run - ok
  • HooksB.before_pipeline_run - relies on ordering
  • HooksB.before_node_run - ok

... which is better than above (since HooksB.before_node_run is now ok), but still seems fragile.

Example 3: passing objects into the interactive workflow

Similar to Example 1 above, but now context is again a vehicle used to pass the object to an ipython session. When I am in an ipython session with the kedro.extras.extensions.ipython loaded we do context = session.load_context() and make context available to the user. Hence after_context_created hooks have run, and in the Jupyter notebook you would be able to do print(context.new_attribute) if you could add new_attribute to the class.

As with Example 1, in theory I think you could also achieve this without using the context by creating a whole new_attribute variable rather than using the context container? So long as your line magic then does get_ipython().push(new_attribute)?

Quick comments

@Galileo-Galilei Is the main use of context for you here simply that it seems like a nice kedriffic place to put these extra variables (rather than making new_attribute a whole separate variable)? Or is there some particular behaviour that's enabled by context.new_attribute that can't be achieved easily with a whole separate variable? There seems to be many possible options here but I'm not sure the extent to which each of these is true:

  • context is actually semantically the right place for the objects you’re trying to pass
  • context is just a convenient container but you can achieve everything you want just by making new variables instead
  • context is actually necessary to pass these objects, i.e. achieves something you can’t do just by making new variables. I don’t think this is true for hooks but maybe for ipython?

Examples 1 and 3 seem straightforward, but Example 2 seems fundamentally difficult due to the way that hooks work independently. Is this an important one at all? Overall it would be useful to you here to have some sort of shared state between different hook classes but I'm not sure what the right way to do this is:

  • maybe there's some pluggy docs on how this should be achieved, but I suspect it's considered something of an anti-pattern (inter-dependencies between hooks)?
  • if there's no way to do in pluggy directly then you're relying on some sort of global kedro object (session/context/session_store). But I don't see an obvious way around the ordering fragility with any such method here?

@noklam noklam changed the title Make KedroContext a frozen dataclass and add config loader as a property Make KedroContext a frozen dataclass May 5, 2022
@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented May 8, 2022

TL; DR

  • Adding attributes to the context is a nice to have, not a must have. There are always workaround to achieve what I want, it's just a matter of slightly simplifying the developer experience.
  • Even more, since there is a workaround to add attributes even if the class is frozen, I'm completely fine with suggested implementation

Thank you both for your answers, it helps a lot.

@noklam:

How would you retrieve these attributes after adding new attributes to the context? after_context_created is the only hook exposing context currently.

My idea was exactly to do as what Antony suggests in his example 2. This is "tricky and fragile" as he explains, but it does currently works and I've seen some real world example, albeit this is uncommon.

I am totally in line with your point 1 (context is now exposed to user and as a consequence it must be more protected than it is now) and your point 2 (you have a workaround if a user really wants to add extra attributes - I did not know this was possible with attrs, that's good news!).

Regarding your point 3, I have more mixed feelings: I think it would feel very natural to kedro users to be able to do context.mlflow or context.spark to access custom configuration files. After all, context is where configuration belong in kedro projects, even if these configuration files are not "official" ones.

@AntonyMilneQB : you're reading my mind 😄 I was not very precise but your made a wonderful summary of the 3 use cases I envision.

  • for example 1, I don't really need to modify the context as you mention since I can store the extra attribute in the Hook instance itself to reuse it further, so I don't really care here.
  • for example 2, this is exactly what I suggested and you've described it very precisely. I've seen this use case in a real project, even if I feel like you this is very "tricky and fragile". I am 99% sure there is an ordering between hooks and I've checked it before, but I don't know exactly what the order is. I think it is "hooks from plugins in the order the packages are installed, then settings.py hooks in the order they are declared", but this should be checked.
  • for example 3, this is indeed what I implied. I actually do exactly what you suggest (push extra variable with get_ipython().push(new_attribute) in kedro-mlflow, but I face 2 challenges:
    • most users ignore that these variables exist when I add them in my own plugins, despite documenting them. If it was inside the context, I hope that auto completion would help discovering them.
    • many users in my company don't launch sessions with kedro jupyter notebook because they use a remote jupyterhub instance. As a consequence they create the session and the context manually at the beginning of their notebook, and they always forget to create and setup the mlflow configuration. The new after_context_created hook will enable to setup the configuration automatically which is great, but if they need to access extra configuration, they have to recreate it manually:
from kedro_mlflow.config import get_mlflow_config
mlflow_config=get_mlflow_config(session)
mlflow_config.setup() # this will be done automatically inside the hook, but I'd like to expose the mlflow_config variable too

from a dev experience perspective, it makes much more sense if they can access directly with context.mlflow_config with no action required from their side.

On your comments

  • context is actually semantically the right place for the objects you’re trying to pass
  • context is just a convenient container but you can achieve everything you want just by making new variables instead
  • context is actually necessary to pass these objects, i.e. achieves something you can’t do just by making new variables. I don’t think this is true for hooks but maybe for ipython?

It's more about points 1 & 2 : I can put new configuration in custom variables (and I already do), but I think it is much more "user friendly" to expose it through the context (it is where I would naturally look for extra configuration + autocompletion would help to discover it)

I agree example 1 & 3 are straightforward. Regarding example 2, I don't think this is something kedro should try to enforce specifically (and it indeed sounds like an anti pattern), but since it is currently doable it should be nice to preserve the behaviour if possible.

noklam and others added 3 commits May 16, 2022 12:47
@noklam
Copy link
Contributor Author

noklam commented May 16, 2022

Thanks for the comments. Storing plugins-related in context makes sense to me, and I think the workaround should be enough to support that.

I will proceed to merge this PR.

@noklam noklam changed the base branch from main to develop May 20, 2022 15:53
@noklam
Copy link
Contributor Author

noklam commented May 20, 2022

Final decision - this is going into develop instead of main, which should be released in 0.19.0. The argument for this is Removing the ability to add attributes to Context is a breaking change.

A separate PR will be made for main.

@antonymilne
Copy link
Contributor

Thanks @Galileo-Galilei for all your comments, and sorry for the very slow response! What you say about adding attributes to the context being more user- and developer-friendly than independent variables definitely makes sense 👍

Just a comment on the jupyter notebook point: we have been working on the interactive workflow and are planning to work on it further. There's a category for such issues, and there will be more added to this milestone when I get round to it over the next few weeks. As ever we're very interested in hearing your feedback on it and any suggestions you have!

Part of this work which has already happened (released fully in 0.18, but actually the ipython extension has been around for a while before that) is to simplify the flow for exactly the case you describe of managed jupyter instances. Basically users should no longer have to manually set up a context etc. in their jupyter notebook; instead, just loading the kedro ipython extension should do everything for them. See https://kedro.readthedocs.io/en/stable/tools_integration/ipython.html#managed-jupyter-instances for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit: Make KedroContext a dataclass and add config_loader as a property
5 participants