Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 2 additions & 30 deletions kedro/framework/context/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,8 @@ def _get_catalog(
save_version=save_version,
)

feed_dict = self._get_feed_dict()
params = self.params
feed_dict = {"parameters": params, "params": params}
catalog.add_feed_dict(feed_dict)
_validate_transcoded_datasets(catalog)
self._hook_manager.hook.after_catalog_created(
Expand All @@ -248,35 +249,6 @@ def _get_catalog(
)
return catalog

def _get_feed_dict(self) -> dict[str, Any]:
"""Get parameters and return the feed dictionary."""
params = self.params
feed_dict = {"parameters": params}

def _add_param_to_feed_dict(param_name: str, param_value: Any) -> None:
"""This recursively adds parameter paths to the `feed_dict`,
whenever `param_value` is a dictionary itself, so that users can
specify specific nested parameters in their node inputs.

Example:

>>> param_name = "a"
>>> param_value = {"b": 1}
>>> _add_param_to_feed_dict(param_name, param_value)
>>> assert feed_dict["params:a"] == {"b": 1}
>>> assert feed_dict["params:a.b"] == 1
"""
key = f"params:{param_name}"
feed_dict[key] = param_value
if isinstance(param_value, dict):
for key, val in param_value.items():
_add_param_to_feed_dict(f"{param_name}.{key}", val)

for param_name, param_value in params.items():
_add_param_to_feed_dict(param_name, param_value)

return feed_dict

def _get_config_credentials(self) -> dict[str, Any]:
"""Getter for credentials specified in credentials directory."""
try:
Expand Down
14 changes: 14 additions & 0 deletions kedro/io/data_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import re
from typing import Any, Dict

from omegaconf import OmegaConf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about coupling the DataCatalog class with OmegaConf here, let alone special-casing parameters by using the : separator.

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to that 👆🏻

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader itself. I would have called it coupling if it uses the actual OmegaConfigLoader class, but this just imports the library.

Originally posted by @merelcht in #3893 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been 3 months so I had to re-read the PR again to give a sensible answer to that.

Maybe "coupling" is not the right word but I still feel weird about using OmegaConf inside DataCatalog in this way. Yes, we do have a dependency on OmegaConf already in other parts of the framework, but the way DataCatalog.load introspects the result of dataset.load() and applies extra logic if it's a dictionary feels wrong. What if the user has a custom dataset that returns a dict that is not meant to be configuration?

In any case, I'm also okay with separating the discussion about this solution from the idea of making parameters first-class citizens rather than fake datasets (before or after tackling #2240).

from parse import parse

from kedro.io.core import (
Expand Down Expand Up @@ -416,6 +417,8 @@ def _get_dataset(

def __contains__(self, dataset_name: str) -> bool:
"""Check if an item is in the catalog as a materialised dataset or pattern"""
if ":" in dataset_name:
dataset_name, _ = dataset_name.split(":", 1)
matched_pattern = self._match_pattern(self._dataset_patterns, dataset_name)
if dataset_name in self._datasets or matched_pattern:
return True
Expand Down Expand Up @@ -477,6 +480,10 @@ def load(self, name: str, version: str | None = None) -> Any:
>>>
>>> df = io.load("cars")
"""
query = None
if ":" in name:
name, query = name.split(":", 1)

load_version = Version(version, None) if version else None
dataset = self._get_dataset(name, version=load_version)

Expand All @@ -488,6 +495,13 @@ def load(self, name: str, version: str | None = None) -> Any:
)

result = dataset.load()
if query and isinstance(result, dict):
result = OmegaConf.select(OmegaConf.create(result), query)
result = (
OmegaConf.to_container(result)
if OmegaConf.is_config(result)
else result
)

return result

Expand Down