Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use soft-merge for certain configuration files for OmegaConfLoader #2122

Closed
merelcht opened this issue Dec 14, 2022 · 8 comments · Fixed by #3229
Closed

Use soft-merge for certain configuration files for OmegaConfLoader #2122

merelcht opened this issue Dec 14, 2022 · 8 comments · Fixed by #3229
Assignees

Comments

@merelcht
Copy link
Member

merelcht commented Dec 14, 2022

Description

At the moment all configuration files are merged in a destructive way. This means all fields/keys are overwritten when two files are merged e.g.:

Base catalog file:

regressor:
  type: pickle.PickleDataSet
  filepath: data/06_models/regressor.pickle
  versioned: true

Overriding catalog file:

regressor:
  type: pandas.CSVDataSet

Will result in:

regressor:
  type: pandas.CSVDataSet

With catalog and logging files this behaviour is desirable because if you'd like to change a file type, you don't want any fields to remain that belonged to the other file type (e.g. plotly_args: for the PlotlyDataSet). However, with parameter or other config files (e.g. for mlflow) it might be better to merge them in a soft way so you can do something like:

Base mlflow.yml file:

tracking:
  disable_tracking:
    pipelines:
    - on_exit_notification

  experiment:
    name: name-of-local-experiment

  params:
    long_params_strategy: tag

Overriding mlflow.yml file:

tracking:
  experiment:
    name: name-of-prod-experiment

Results in:

tracking:
  disable_tracking:
    pipelines:
    - on_exit_notification
  experiment:
    name: name-of-prod-experiment
  params:
    long_params_strategy: tag

This can be achieved my using the OmegaConf merge: https://omegaconf.readthedocs.io/en/2.2_branch/usage.html#merging-configurations

Context

#1868 (comment)

Possible Implementation

Add another setting to OmegaConfLoader where the user can specify which files to merge in a destructive way and which one with the regular OmegaConf merge.

  • Check if it's possible to remove keys with the OmegaConf merge.

Possible Alternative

Merge catalog files in a destructive way and all other in a soft way. This would be a breaking change.

@merelcht merelcht added this to the Configuration overhaul milestone Dec 14, 2022
@merelcht merelcht changed the title Use soft-merge for non catalog configuration Use soft-merge for non catalog configuration for OmegaConfLoader Jan 4, 2023
@merelcht merelcht changed the title Use soft-merge for non catalog configuration for OmegaConfLoader Use soft-merge for certain configuration files for OmegaConfLoader Jan 4, 2023
@MatthiasRoels
Copy link

After I explored the first iteration of the OmegaConfLoader, I decided to give this one a take of my own. I was able to already implement a soft-merge functionality as well. But for now, this only applies to params (hard-coded). I mostly drew inspiration from the current implementation as well as the older ConfigLoader implementations. You can find it on my GitHub: https://github.com/MatthiasRoels/kedro-examples/blob/main/custom-configloader/src/prototype/lib/configloader.py
Note that there are tests included too (mostly copy-paste from the existing test-script).

To iterate further on this, we should come up with a way to expose to the user which patterns they would like to use in "merge mode" and which ones they would like to use in "overwrite mode" (.e.g merge params, but overwrite catalog entries).

As a final note (having used a TemplatedConfigLoader before combined with Yaml anchors and aliases to keep our catalog files leaner), we used to do something along the following lines to reduce complexity in our catalog files:

_s3: &s3
  fs_args:
    s3_additional_kwargs:
      ServerSideEncryption: AES256

_parquet: &parquet
  type: kedro.extras.datasets.pandas.ParquetDataSet
  <<: *s3

example_dataset:
  <<: *parquet
  filepath: s3-file-uri

With OmegaConf, this is no longer possible and while replacing Yaml anchors with interpolations is possible, this would also mean that the keys "s3" and "parquet" would end up in our catalog config files. So, ideally, we should come up with a way to filter these keys (or implement something similar to what we had in TemplatedConfigLoader with globals).

Finally, one other disadvantage of the current implementations of OmegaConfLoader is that we can only use interpolations on a per pattern-directory basis (e.g. parameters-base). Ideally, we also come up with a way to use this globally (perhaps with something like a globals pattern that we load when we init our OmegaConfLoader class?

Anyway, I found this a fun exercise to do and I hope this brings some additional ideas/inspiration to the table for future development. Happy to discuss this further as well (you can also find me on the Kedro Slack channel).

@merelcht
Copy link
Member Author

This is amazing @MatthiasRoels , thanks so much for having a look at the OmegaConfLoader and giving your feedback! When it's officially released we'll get more people to do the same thing 😄

After I explored the first iteration of the OmegaConfLoader, I decided to give this one a take of my own. I was able to already implement a soft-merge functionality as well. But for now, this only applies to params (hard-coded). I mostly drew inspiration from the current implementation as well as the older ConfigLoader implementations. You can find it on my GitHub: https://github.com/MatthiasRoels/kedro-examples/blob/main/custom-configloader/src/prototype/lib/configloader.py Note that there are tests included too (mostly copy-paste from the existing test-script).

Nice one! This will definitely be useful when we implement the functionality. We'll likely get this in for the 0.19.x series.

As a final note (having used a TemplatedConfigLoader before combined with Yaml anchors and aliases to keep our catalog files leaner), we used to do something along the following lines to reduce complexity in our catalog files:

_s3: &s3
  fs_args:
    s3_additional_kwargs:
      ServerSideEncryption: AES256

_parquet: &parquet
  type: kedro.extras.datasets.pandas.ParquetDataSet
  <<: *s3

example_dataset:
  <<: *parquet
  filepath: s3-file-uri

With OmegaConf, this is no longer possible and while replacing Yaml anchors with interpolations is possible, this would also mean that the keys "s3" and "parquet" would end up in our catalog config files. So, ideally, we should come up with a way to filter these keys (or implement something similar to what we had in TemplatedConfigLoader with globals).

This is a very good point, and definitely something we need to explore further. If OmegaConfLoader is to successfully replace TemplatedConfigLoader it needs to at least offer similar functionality. Large config files is a huge issue for Kedro users, so we need to make sure that this new config loader doesn't make it worse.

Finally, one other disadvantage of the current implementations of OmegaConfLoader is that we can only use interpolations on a per pattern-directory basis (e.g. parameters-base). Ideally, we also come up with a way to use this globally (perhaps with something like a globals pattern that we load when we init our OmegaConfLoader class?

This is somewhat related to: #2175 You're absolutely right that we should come up with a way to use interpolations across config.

Anyway, I found this a fun exercise to do and I hope this brings some additional ideas/inspiration to the table for future development. Happy to discuss this further as well (you can also find me on the Kedro Slack channel).

Awesome! 😄 This is really insightful and I appreciate you taking the time to investigate and leave the feedback ⭐ I'll definitely reach out to you when I need more input!

@MatthiasRoels
Copy link

MatthiasRoels commented Jan 23, 2023

Thanks! Related to #2175, I think I have a neat implementation for that one too, extending on what I already implemented (with just a tiny bit refactoring). Will code that up tomorrow and post a link to my code in that issue.

Happy to create a PR to contribute my changes!

@adri0
Copy link

adri0 commented Jun 22, 2023

As another temporary workaround, at least for the parameters.yml I noticed that you can overload a single nested value by creating an entry in the non-base parameters.yml using the path to the value, while keeping remaining values intact. Using @MatthiasRoels example, it works more or less like this:

Base parameters.yml file:

tracking:
  disable_tracking:
    pipelines:
    - on_exit_notification

  experiment:
    name: name-of-local-experiment

  params:
    long_params_strategy: tag

Overriding parameters.yml file:

tracking.experiment.name: name-of-prod-experiment

Results in:

tracking:
  disable_tracking:
    pipelines:
    - on_exit_notification
  experiment:
    name: name-of-prod-experiment
  params:
    long_params_strategy: tag

@IngerMathilde
Copy link
Contributor

Has anyone found a workaround for globals as well? The above mentioned does only work for parameters sadly.

@IngerMathilde
Copy link
Contributor

IngerMathilde commented Oct 8, 2023

For anyone interested, I solved it by adding this to our custom config loader.


SOFT_MERGE_KEY = "~~soft_merge~~"
class ExtendedOmegaConfigLoader(OmegaConfigLoader):
    
    def __getitem__(self, key: str) -> dict[str, Any]:
        """Extension to allow for soft merging on certain parameters.

        Soft merging has to be explicit, by placing it behind a ~~soft_merge~~
        key.

        For example in a yaml file it could look like this.
        ~~soft_merge~~:
            general:
                schema: value_to_be_soft_merged
        """
        item = super().__getitem__(key)
        if key in self:
            return item
        soft_merge_values = item.pop(SOFT_MERGE_KEY, {})
        return OmegaConf.to_container(
            OmegaConf.merge(item, soft_merge_values), resolve=True
        )

For us it is especially useful when writing test cases. As you then often want to overwrite only a limited number of parameters, to check if the base config is still valid.

@astrojuanlu
Copy link
Member

Hi @IngerMathilde , thanks for sharing your workaround! This was marked as a blocker for 0.19, so we will get to it very soon.

@merelcht merelcht moved this to To Do in Kedro Framework Oct 25, 2023
@merelcht merelcht moved this from To Do to In Progress in Kedro Framework Oct 25, 2023
@merelcht merelcht self-assigned this Oct 25, 2023
@merelcht merelcht moved this from In Progress to In Review in Kedro Framework Oct 25, 2023
@merelcht merelcht linked a pull request Oct 25, 2023 that will close this issue
7 tasks
@merelcht
Copy link
Member Author

merelcht commented Nov 1, 2023

Completed in #3229

@merelcht merelcht closed this as completed Nov 1, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants