-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset factories doesn't resolve nested config properly #2992
Comments
Hi @ankatiyar, def _resolve_config(
self,
data_set_name: str,
matched_pattern: str,
) -> dict[str, Any]:
"""Get resolved AbstractDataset from a factory config"""
result = parse(matched_pattern, data_set_name)
config_copy = copy.deepcopy(self._dataset_patterns[matched_pattern])
config_copy = self._recursively_resolve_nested_fields(
config_copy,
data_set_name,
matched_pattern,
result)
return config_copy
def _recursively_resolve_nested_fields(
self,
config_field,
data_set_name,
matched_pattern,
result)-> dict[str, Any]:
if isinstance(config_field, dict):
for key, value in config_field.items():
config_field[key] = self._recursively_resolve_nested_fields(
value,
data_set_name,
matched_pattern,
result
)
elif isinstance(config_field, list):
for i, value in enumerate(config_field):
config_field[i] = self._recursively_resolve_nested_fields(
value,
data_set_name,
matched_pattern,
result
)
elif isinstance(config_field, Iterable) and "}" in config_field:
# result.named: gives access to all dict items in the match result.
# format_map fills in dict values into a string with {...} placeholders
# of the same key name.
try:
config_field = str(config_field).format_map(result.named)
except KeyError as exc:
raise DatasetError(
f"Unable to resolve '{config_field}' for the pattern '{matched_pattern}'"
) from exc
return config_field Do you think the list case should be added to the PR? |
Hi @Gabriel2409, |
@ankatiyar So we could have something like this, which gives you different ways to load the same file by specifying a different column as NA (a bit artificial I admit but it is for the sake of the example) "project_ignore_{param}":
type: pandas.CSVDataSet
filepath: data/01_raw/project.csv
load_args:
sep: ","
na_values: ["#NA", {param}]
save_args:
index: False
date_format: "%Y-%m-%d %H:%M"
decimal: . A more straightforward example would be a custom dataset containing a list of filepath. myname:
type: MyCustomDataSet
filepaths: [path1, path2] Note that in my suggested implementation, if a list contains a dict, only the value is parsed, not the key, which is consistent with how it is currently done "my{param}"
type: MyCustomDataSet
mylist: [path/to/my{param}, myawesome{param}] # correctly replaced
mylistofdict: [{"key{param}": "value{param}"}] # {param} only replaced in the value, not the key |
Description
Reported by @m-gris
Dataset factories does not work properly for config that is nested.
Context
The problem is in
DataCatalog._resolve_config()
. It fills out the placeholder on top level keys properly but doesn't go into nested dicts.Steps to Reproduce
Message quoted from slack -
The text was updated successfully, but these errors were encountered: