Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset factories doesn't resolve nested config properly #2992

Closed
ankatiyar opened this issue Aug 31, 2023 · 3 comments · Fixed by #2993
Closed

Dataset factories doesn't resolve nested config properly #2992

ankatiyar opened this issue Aug 31, 2023 · 3 comments · Fixed by #2993
Assignees
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@ankatiyar
Copy link
Contributor

Description

Reported by @m-gris
Dataset factories does not work properly for config that is nested.

Context

The problem is in DataCatalog._resolve_config(). It fills out the placeholder on top level keys properly but doesn't go into nested dicts.

Steps to Reproduce

Message quoted from slack -

Hi everyone,
I’ve started to use kedro-mlflow and am under the impression that there is some conflict / incompatibility with the dataset factory.
Having moved from

"models.{model}.embeddings":
    type: pickle.PickleDataSet
    filepath: "data/07_model_output/{model}_embeddings.pkl"
    backend: joblib
    versioned: True
    layer: model_output

to

"models.{model}.embeddings":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet
  data_set:
    type: pickle.PickleDataSet
    filepath: "data/07_model_output/{model}_embeddings.pkl"
    backend: joblib
    versioned: True
  layer: model_output

results in artifacts being saved without the model named properly interpolated, i.e “litteraly” saved as {model}_embeddings.pkl
Any comment ? Anything I’m missing ?

@Gabriel2409
Copy link

Hi @ankatiyar,
I encountered the same problem and was about to post an issue when I found yours.
I think the config resolver also fails if an argument contains a list (not only a nested dict).
I was about to suggest something like this:

    def _resolve_config(
        self,
        data_set_name: str,
        matched_pattern: str,
    ) -> dict[str, Any]:
        """Get resolved AbstractDataset from a factory config"""
        result = parse(matched_pattern, data_set_name)
        config_copy = copy.deepcopy(self._dataset_patterns[matched_pattern])
        config_copy = self._recursively_resolve_nested_fields(
            config_copy,
            data_set_name,
            matched_pattern, 
            result)
        return config_copy

    def _recursively_resolve_nested_fields(
            self,
            config_field,
            data_set_name,
            matched_pattern, 
            result)-> dict[str, Any]:
        if isinstance(config_field, dict):
            for key, value in config_field.items():
                config_field[key] = self._recursively_resolve_nested_fields(
                    value,
                    data_set_name,
                    matched_pattern,
                    result
                    )
        elif isinstance(config_field, list):
            for i, value in enumerate(config_field):
                config_field[i] = self._recursively_resolve_nested_fields(
                    value,
                    data_set_name,
                    matched_pattern,
                    result
                    )
        elif isinstance(config_field, Iterable) and "}" in config_field:
            # result.named: gives access to all dict items in the match result.
            # format_map fills in dict values into a string with {...} placeholders
            # of the same key name.
            try:
                config_field = str(config_field).format_map(result.named)
            except KeyError as exc:
                raise DatasetError(
                    f"Unable to resolve '{config_field}' for the pattern '{matched_pattern}'"
                ) from exc
        return config_field

Do you think the list case should be added to the PR?

@ankatiyar
Copy link
Contributor Author

Hi @Gabriel2409,
Thanks for flagging this and suggesting a solution for this too! 😄
I'll try fixing this in the same open PR.
Would you mind providing an example of what a catalog entry with a list would look like?

@merelcht merelcht moved this to In Progress in Kedro Framework Sep 6, 2023
@Gabriel2409
Copy link

@ankatiyar
For example, in pandas.CSVDataSet, load_args.na_values is a list

So we could have something like this, which gives you different ways to load the same file by specifying a different column as NA (a bit artificial I admit but it is for the sake of the example)

"project_ignore_{param}":
    type: pandas.CSVDataSet
    filepath: data/01_raw/project.csv
    load_args:
      sep: ","
      na_values: ["#NA", {param}]
    save_args:
      index: False
      date_format: "%Y-%m-%d %H:%M"
      decimal: .

A more straightforward example would be a custom dataset containing a list of filepath.

myname:
    type: MyCustomDataSet
    filepaths: [path1, path2]

Note that in my suggested implementation, if a list contains a dict, only the value is parsed, not the key, which is consistent with how it is currently done

"my{param}"
    type: MyCustomDataSet
    mylist: [path/to/my{param}, myawesome{param}] # correctly replaced
    mylistofdict: [{"key{param}": "value{param}"}] # {param} only replaced in the value, not the key

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants