Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Dataset factories (new version) #2743

Closed
wants to merge 1 commit into from

Conversation

ankatiyar
Copy link
Contributor

Description

Slightly different implementation of dataset factories that works with versions and credentials

UPDATE:

This now works with -

  • kedro run --load-versions="france_companies:2023-06-12T10.01.52.889Z" when france_companies might not exist as an explicit catalog entry but as {country}_companies
  • using credentials in a pattern catalog entry

Development notes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@noklam noklam self-requested a review June 29, 2023 14:47
@noklam
Copy link
Contributor

noklam commented Jun 30, 2023

I try to use your repository https://github.com/ankatiyar/space-patterns/tree/main and do kedro run.

This doesn't work and I am not sure why

Steps:

  1. git clone https://github.com/noklam/space-patterns
  2. `git checkout feature/dataset-factories-new (kedro)
  3. kedro run --pipeline pipe1 - Fail
  4. git checkout main (kedro)
  5. kedro run --pipeline pipe1 - Success

}
# Already add explicit entry datasets
for ds_name, ds_config in catalog.items():
if "}" not in ds_name and cls._is_full_config(ds_config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a small helper function here i.e. _is_pattern even if it's a one-liner, as the semantics make it easier to read.

@@ -311,6 +428,28 @@ def _get_dataset(

return data_set

def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:
def _resolve_config(self, data_set_name: str, data_set_pattern: str) -> dict[str, Any]:

# Merge config with entry created containing load and save versions
config_copy.update(self._raw_catalog[data_set_name])
for key, value in config_copy.items():
if isinstance(value, Iterable) and "}" in value:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check for "{" sometimes but "}"? Again can we just use one function to check if something is a pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am note sure why are we checking for Iterable here?

Comment on lines +441 to +446
string_value = str(value)
# result.named: gives access to all dict items in the match result.
# format_map fills in dict values into a string with {...} placeholders
# of the same key name.
try:
config_copy[key] = string_value.format_map(result.named)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string_value = str(value)
# result.named: gives access to all dict items in the match result.
# format_map fills in dict values into a string with {...} placeholders
# of the same key name.
try:
config_copy[key] = string_value.format_map(result.named)
# result.named: gives access to all dict items in the match result.
# format_map fills in dict values into a string with {...} placeholders
# of the same key name.
try:
config_copy[key] = str(value).format_map(result.named)

if unsatisfied:
raise ValueError(
f"Pipeline input(s) {unsatisfied} not found in the DataCatalog"
)

free_outputs = pipeline.outputs() - set(catalog.list())
unregistered_ds = pipeline.data_sets() - set(catalog.list())
free_outputs = pipeline.outputs() - set(registered_ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point is all the pattern datasets materialised in catalog already?


@classmethod
def _match_name_against_pattern(
cls, raw_catalog: dict[str, Any], data_set_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is raw_cata

Suggested change
cls, raw_catalog: dict[str, Any], data_set_name: str
cls, raw_catalog: dict[str, dict[str, Any]], data_set_name: str

Is this the correct typing? When I read through it I have a hard time to map all the types. Maybe worth to create TypeAlias .

https://docs.python.org/3/library/typing.html#type-aliases

Comment on lines +324 to +328
@staticmethod
def _is_full_config(config: dict[str, Any]) -> bool:
"""Check if the config is a full config"""
remaining = set(config.keys()) - {"load_version", "save_version"}
return bool(remaining)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why config.keys() substract load_version and save_version equal to a full config?

Comment on lines +330 to +336
def __contains__(self, item):
"""Check if an item is in the catalog as a materialised dataset or pattern"""
if item in self._data_sets or self._match_name_against_pattern(
self._raw_catalog, item
):
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 I like this

@merelcht merelcht mentioned this pull request Jul 4, 2023
9 tasks
@ankatiyar ankatiyar closed this Jul 5, 2023
@ankatiyar ankatiyar deleted the feature/dataset-factories-new branch July 5, 2023 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants