[DRAFT] Dataset factories (new version) #2743

ankatiyar · 2023-06-28T16:13:08Z

Description

Slightly different implementation of dataset factories that works with versions and credentials

UPDATE:

This now works with -

kedro run --load-versions="france_companies:2023-06-12T10.01.52.889Z" when france_companies might not exist as an explicit catalog entry but as {country}_companies
using credentials in a pattern catalog entry

Development notes

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

noklam · 2023-06-30T15:32:57Z

I try to use your repository https://github.com/ankatiyar/space-patterns/tree/main and do kedro run.

This doesn't work and I am not sure why

Steps:

git clone https://github.com/noklam/space-patterns
`git checkout feature/dataset-factories-new (kedro)
kedro run --pipeline pipe1 - Fail
git checkout main (kedro)
kedro run --pipeline pipe1 - Success

noklam · 2023-06-30T15:33:58Z

kedro/io/data_catalog.py

+        }
+        # Already add explicit entry datasets
+        for ds_name, ds_config in catalog.items():
+            if "}" not in ds_name and cls._is_full_config(ds_config):


I think we need a small helper function here i.e. _is_pattern even if it's a one-liner, as the semantics make it easier to read.

noklam · 2023-06-30T15:45:23Z

kedro/io/data_catalog.py

@@ -311,6 +428,28 @@ def _get_dataset(

        return data_set

+    def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:


Suggested change

def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:

def _resolve_config(self, data_set_name: str, data_set_pattern: str) -> dict[str, Any]:

noklam · 2023-06-30T15:46:40Z

kedro/io/data_catalog.py

+            # Merge config with entry created containing load and save versions
+            config_copy.update(self._raw_catalog[data_set_name])
+        for key, value in config_copy.items():
+            if isinstance(value, Iterable) and "}" in value:


Why do we check for "{" sometimes but "}"? Again can we just use one function to check if something is a pattern?

I am note sure why are we checking for Iterable here?

noklam · 2023-06-30T15:48:53Z

kedro/io/data_catalog.py

+                string_value = str(value)
+                # result.named: gives access to all dict items in the match result.
+                # format_map fills in dict values into a string with {...} placeholders
+                # of the same key name.
+                try:
+                    config_copy[key] = string_value.format_map(result.named)


Suggested change

string_value = str(value)

# result.named: gives access to all dict items in the match result.

# format_map fills in dict values into a string with {...} placeholders

# of the same key name.

try:

config_copy[key] = string_value.format_map(result.named)

# result.named: gives access to all dict items in the match result.

# format_map fills in dict values into a string with {...} placeholders

# of the same key name.

try:

config_copy[key] = str(value).format_map(result.named)

noklam · 2023-06-30T15:50:40Z

kedro/runner/runner.py

        if unsatisfied:
            raise ValueError(
                f"Pipeline input(s) {unsatisfied} not found in the DataCatalog"
            )

-        free_outputs = pipeline.outputs() - set(catalog.list())
-        unregistered_ds = pipeline.data_sets() - set(catalog.list())
+        free_outputs = pipeline.outputs() - set(registered_ds)


At this point is all the pattern datasets materialised in catalog already?

noklam · 2023-06-30T15:52:51Z

kedro/io/data_catalog.py

+
+    @classmethod
+    def _match_name_against_pattern(
+        cls, raw_catalog: dict[str, Any], data_set_name: str


Is raw_cata

Suggested change

cls, raw_catalog: dict[str, Any], data_set_name: str

cls, raw_catalog: dict[str, dict[str, Any]], data_set_name: str

Is this the correct typing? When I read through it I have a hard time to map all the types. Maybe worth to create TypeAlias .

https://docs.python.org/3/library/typing.html#type-aliases

noklam · 2023-06-30T15:55:45Z

kedro/io/data_catalog.py

+    @staticmethod
+    def _is_full_config(config: dict[str, Any]) -> bool:
+        """Check if the config is a full config"""
+        remaining = set(config.keys()) - {"load_version", "save_version"}
+        return bool(remaining)


why config.keys() substract load_version and save_version equal to a full config?

noklam · 2023-06-30T15:56:51Z

kedro/io/data_catalog.py

+    def __contains__(self, item):
+        """Check if an item is in the catalog as a materialised dataset or pattern"""
+        if item in self._data_sets or self._match_name_against_pattern(
+            self._raw_catalog, item
+        ):
+            return True
+        return False


👍🏼 I like this

Dataset factories new implementation

37cabc4

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

noklam self-requested a review June 29, 2023 14:47

noklam reviewed Jun 30, 2023

View reviewed changes

merelcht mentioned this pull request Jul 4, 2023

Dataset factories #2635

Merged

9 tasks

ankatiyar closed this Jul 5, 2023

ankatiyar deleted the feature/dataset-factories-new branch July 5, 2023 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Dataset factories (new version) #2743

[DRAFT] Dataset factories (new version) #2743

ankatiyar commented Jun 28, 2023

noklam commented Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

noklam Jun 30, 2023

		@@ -311,6 +428,28 @@ def _get_dataset(

		return data_set

		def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:

	def _resolve_config(self, data_set_name, data_set_pattern) -> dict[str, Any]:
	def _resolve_config(self, data_set_name: str, data_set_pattern: str) -> dict[str, Any]:

	cls, raw_catalog: dict[str, Any], data_set_name: str
	cls, raw_catalog: dict[str, dict[str, Any]], data_set_name: str

[DRAFT] Dataset factories (new version) #2743

[DRAFT] Dataset factories (new version) #2743

Conversation

ankatiyar commented Jun 28, 2023

Description

UPDATE:

Development notes

Checklist

noklam commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment