Dataset factories doesn't resolve nested config properly #2992

ankatiyar · 2023-08-31T15:33:11Z

Description

Reported by @m-gris
Dataset factories does not work properly for config that is nested.

Context

The problem is in DataCatalog._resolve_config(). It fills out the placeholder on top level keys properly but doesn't go into nested dicts.

Steps to Reproduce

Message quoted from slack -

Hi everyone,
I’ve started to use kedro-mlflow and am under the impression that there is some conflict / incompatibility with the dataset factory.
Having moved from
"models.{model}.embeddings":
    type: pickle.PickleDataSet
    filepath: "data/07_model_output/{model}_embeddings.pkl"
    backend: joblib
    versioned: True
    layer: model_output
to
"models.{model}.embeddings":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: pickle.PickleDataSet
    filepath: "data/07_model_output/{model}_embeddings.pkl"
    backend: joblib
    versioned: True
  layer: model_output
results in artifacts being saved without the model named properly interpolated, i.e “litteraly” saved as {model}_embeddings.pkl
Any comment ? Anything I’m missing ?

The text was updated successfully, but these errors were encountered:

Gabriel2409 · 2023-09-06T09:57:41Z

Hi @ankatiyar,
I encountered the same problem and was about to post an issue when I found yours.
I think the config resolver also fails if an argument contains a list (not only a nested dict).
I was about to suggest something like this:

    def _resolve_config(
        self,
        data_set_name: str,
        matched_pattern: str,
    ) -> dict[str, Any]:
        """Get resolved AbstractDataset from a factory config"""
        result = parse(matched_pattern, data_set_name)
        config_copy = copy.deepcopy(self._dataset_patterns[matched_pattern])
        config_copy = self._recursively_resolve_nested_fields(
            config_copy,
            data_set_name,
            matched_pattern, 
            result)
        return config_copy

    def _recursively_resolve_nested_fields(
            self,
            config_field,
            data_set_name,
            matched_pattern, 
            result)-> dict[str, Any]:
        if isinstance(config_field, dict):
            for key, value in config_field.items():
                config_field[key] = self._recursively_resolve_nested_fields(
                    value,
                    data_set_name,
                    matched_pattern,
                    result
                    )
        elif isinstance(config_field, list):
            for i, value in enumerate(config_field):
                config_field[i] = self._recursively_resolve_nested_fields(
                    value,
                    data_set_name,
                    matched_pattern,
                    result
                    )
        elif isinstance(config_field, Iterable) and "}" in config_field:
            # result.named: gives access to all dict items in the match result.
            # format_map fills in dict values into a string with {...} placeholders
            # of the same key name.
            try:
                config_field = str(config_field).format_map(result.named)
            except KeyError as exc:
                raise DatasetError(
                    f"Unable to resolve '{config_field}' for the pattern '{matched_pattern}'"
                ) from exc
        return config_field

Do you think the list case should be added to the PR?

ankatiyar · 2023-09-06T11:07:19Z

Hi @Gabriel2409,
Thanks for flagging this and suggesting a solution for this too! 😄
I'll try fixing this in the same open PR.
Would you mind providing an example of what a catalog entry with a list would look like?

Gabriel2409 · 2023-09-06T12:27:24Z

@ankatiyar
For example, in pandas.CSVDataSet, load_args.na_values is a list

So we could have something like this, which gives you different ways to load the same file by specifying a different column as NA (a bit artificial I admit but it is for the sake of the example)

"project_ignore_{param}":
    type: pandas.CSVDataSet
    filepath: data/01_raw/project.csv
    load_args:
      sep: ","
      na_values: ["#NA", {param}]
    save_args:
      index: False
      date_format: "%Y-%m-%d %H:%M"
      decimal: .

A more straightforward example would be a custom dataset containing a list of filepath.

myname:
    type: MyCustomDataSet
    filepaths: [path1, path2]

Note that in my suggested implementation, if a list contains a dict, only the value is parsed, not the key, which is consistent with how it is currently done

"my{param}"
    type: MyCustomDataSet
    mylist: [path/to/my{param}, myawesome{param}] # correctly replaced
    mylistofdict: [{"key{param}": "value{param}"}] # {param} only replaced in the value, not the key

ankatiyar added the Issue: Bug Report 🐞 Bug that needs to be fixed label Aug 31, 2023

ankatiyar added this to Kedro Framework Aug 31, 2023

ankatiyar mentioned this issue Aug 31, 2023

Make dataset factory resolve nested dict properly #2993

Merged

6 tasks

github-actions bot mentioned this issue Sep 1, 2023

Monthly issue metrics report #2996

Closed

ankatiyar self-assigned this Sep 1, 2023

merelcht moved this to In Progress in Kedro Framework Sep 6, 2023

merelcht added this to the Make `OmegaConfigLoader` ready for 0.19.0 milestone Sep 6, 2023

merelcht moved this from In Progress to In Review in Kedro Framework Sep 7, 2023

ankatiyar closed this as completed in #2993 Sep 7, 2023

github-project-automation bot moved this from In Review to Done in Kedro Framework Sep 7, 2023

noklam mentioned this issue Sep 15, 2023

Dataset factory placeholders do not work in nested dataset definitions #3037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset factories doesn't resolve nested config properly #2992

Dataset factories doesn't resolve nested config properly #2992

ankatiyar commented Aug 31, 2023

Gabriel2409 commented Sep 6, 2023

ankatiyar commented Sep 6, 2023

Gabriel2409 commented Sep 6, 2023

Dataset factories doesn't resolve nested config properly #2992

Dataset factories doesn't resolve nested config properly #2992

Comments

ankatiyar commented Aug 31, 2023

Description

Context

Steps to Reproduce

Gabriel2409 commented Sep 6, 2023

ankatiyar commented Sep 6, 2023

Gabriel2409 commented Sep 6, 2023