Dataset factories #2635

ankatiyar · 2023-06-02T14:22:44Z

Description

Resolve #2423 This PR is introduces the dataset factories feature. Look at #2670 for the documentation on how to use it.

Development notes

Kedro run workflow

-->kedro run in terminal
--> Session is created and run
--> Context created which loads all the config from the conf_source
--> DataCatalog.fromconfig(dict:catalog) which initialises the DataCatalog object

At this point the DataCatalog is created with explicitly mentioned datasets in DataCatalog._data_sets
Sort the patterns and save them to DataCatalog._dataset_patterns
Also save load_version and save_version to self._load_version and self._save_version

--> runner.run(pipeline, catalog, .., ..)

loop through all datasets used by the pipeline and save all that exist (either as explicit datasets or patterns) to registered_ds
~~The check for existence will materialise and add the datasets to the catalog (in DataCatalog.__contains__())~~

--> Pipeline is run and catalog._get_dataset(dataset_name)-> AbstractDataSet : ~~No major change because datasets should already be materialised at the check for existence~~

Add datasets that exist as patterns as materialised datasets to the catalog

NOTE : The tests are not updated to reflect the latest changes (TODO)

Toy project to test with : https://github.com/ankatiyar/space-patterns

Follow up tasks to do

Update tests
Documentation Add documentation for the dataset factories feature #2666
Use dataset factories to register default datasets #2668
Update kedro catalog list + Add new catalog CLI commands for dataset factories #2603

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

merelcht

I've done a first review of the code in data_catalog.py and left some questions. My main question is what the purpose of the _pattern_name_matches_cache is and the methods that update it.

kedro/io/data_catalog.py

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

ankatiyar · 2023-06-13T13:03:11Z

@merelcht The reason I added self._pattern_name_matches_cache is because the first thing we need to do in the runner is check if all the datasets used in the pipeline exist in the catalog using catalog.exists_in_catalog(). At this step if we find a pattern that matches, it's added to the cache dict which stores dataset name -> pattern match. (Just so when we do catalog._get_dataset(dataset_name) we don't have to iterate over all the patterns again).

This might be overkill and I can revert to what the prototype was doing but I thought for large catalogs, this would avoid doing the same iteration twice. What do you think?

The workflow for then fetching a dataset is ->

If the dataset is in self._data_set - go to step 3
If not, I reuse the self.exists_in_catalog() to check if it is a pattern.
i. If this is the first time we're checking for the dataset's existence( e.g when someone is using DataCatalog as a standalone and the runner hasn't populated the cache), add "the name -> pattern" entry in the self._pattern_name_matches_cache.
ii. get the factory pattern for the dataset name from _pattern_name_matches_cache()
iii. get resolved dataset using _get_resolved_dataset() -> resolves the config and then loads the instance of AbstractDataSet
iv. Add the resolved dataset to self._data_sets
Load dataset from self._data_sets and return

merelcht

Thanks for clarifying the use of the cache @ankatiyar, it makes more sense now! I think we can keep it, but perhaps give it a slightly shorter name e.g. _pattern_matches_cache.
In general, I struggled with following the code because some of the names of the methods. I tried giving suggestions on how to improve, but it's quite personal so it would be good to get some other opinions as well 🙂

One thing every single user in the interviews mentioned is a warning about the catch-all pattern, so we must really implement that, including tests.

Another thing that was mentioned several times was a warning about when multiple patterns match a dataset, it would be good to have a test for that to see how that would happen and if we can also warn about that.

kedro/io/data_catalog.py

tests/io/test_data_catalog.py

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

ankatiyar · 2023-06-15T13:49:49Z

@merelcht I'll leave the warning about catch-all patterns and multiple matches out of this PR but do it as a part of #2668

noklam

Drop some comments for 1st round of review.

kedro/io/data_catalog.py

noklam · 2023-06-15T15:31:12Z

kedro/io/data_catalog.py

+        self._sorted_dataset_patterns = sorted(
+            self.dataset_patterns.keys(),
+            key=lambda pattern: (
+                -(_specificity(pattern)),
+                -pattern.count("{"),
+                pattern,
+            ),
+        )


This should be refactor as a method and have its own unit tests

kedro/io/data_catalog.py

noklam · 2023-06-15T15:41:22Z

kedro/io/data_catalog.py

@@ -567,16 +642,47 @@ def list(self, regex_search: str | None = None) -> list[str]:
            ) from exc
        return [dset_name for dset_name in self._data_sets if pattern.search(dset_name)]

+    def exists_in_catalog_config(self, dataset_name: str) -> bool:


Is this the cache that add matched pattern into catalog? If so why do we need to keep it separately instead of just creating the entry in catalog?

This fn just adds the dataset name -> pattern in the self._pattern_matches_cache does not create an entry in the catalog. This is because this function is also called in the runner to check for existence of datasets from the pipeline in the catalog. The actual adding of a dataset to the catalog happens in self._get_dataset()
This is to avoid doing the pattern matching multiple times when it's called inside a Session run but the actual resolving and adding is still at _get_dataset time to allow for DataCatalog standalone usage

This isn't a good name for a public method, it's way too descriptive. I would suggest that you change it to def __contains__(self, dataset_name):. This will allow us to do a test like this:
if dataset_name in catalog:..., which is way more intuitive and nicer than catalog.exists_in_catalog_config(dataset_name).

kedro/io/data_catalog.py

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

merelcht

I reviewed both this PR and #2743 and I definitely prefer the approach here with adding load_versions and save_version arguments instead of raw_catalog. I left some more comments, but in general I think this is nearly finished.

kedro/io/data_catalog.py

merelcht · 2023-07-04T10:35:29Z

kedro/io/data_catalog.py

+        Args:
+            pattern: The factory pattern


For completeness this doc string should also include the return type/value.

It's a private method, the Args: part is leftover from when it was a helper fn outside DataCatalog I've removed it

kedro/io/data_catalog.py

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

merelcht

I left one more minor comment, but apart from that I'm very happy with how this looks! It would be good to get another review from @noklam or @idanov, but from my side this is good to go 👍

kedro/io/data_catalog.py

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

noklam

I have some comments here, but I figured it is easier if I drop all my comments in one draft PR, it's not ready yet. I will ping you when it's ready.

Review PR since it is easier.

noklam · 2023-07-05T23:20:12Z

kedro/io/data_catalog.py

        self,
        data_sets: dict[str, AbstractDataSet] = None,
        feed_dict: dict[str, Any] = None,
        layers: dict[str, set[str]] = None,
+        dataset_patterns: dict[str, dict[str, Any]] = None,


Maybe it is good to introduce keyword-only arguments here, similar to the proposal to get us more freedom to re-arrange the arguments later without breaking changes. The current argument lists does not make too much sense

noklam · 2023-07-05T23:37:41Z

kedro/io/data_catalog.py

@@ -170,8 +184,12 @@ def __init__(
        self._data_sets = dict(data_sets or {})
        self.datasets = _FrozenDatasets(self._data_sets)
        self.layers = layers
+        # Keep a record of all patterns in the catalog.
+        # {dataset pattern name : dataset pattern body}
+        self._dataset_patterns = dict(dataset_patterns or {})


This still confuse me a lot, is there any difference? Isn't dataset_patterns a dict already?

Suggested change

self._dataset_patterns = dict(dataset_patterns or {})

self._dataset_patterns =dataset_patterns or {}

noklam · 2023-07-05T23:37:54Z

kedro/io/data_catalog.py

+        # Keep a record of all patterns in the catalog.
+        # {dataset pattern name : dataset pattern body}
+        self._dataset_patterns = dict(dataset_patterns or {})
+        self._load_versions = dict(load_versions or {})


Same as self._dataset_patterns

noklam · 2023-07-05T23:38:13Z

kedro/io/data_catalog.py

        catalog = copy.deepcopy(catalog) or {}
        credentials = copy.deepcopy(credentials) or {}
        save_version = save_version or generate_timestamp()
        load_versions = copy.deepcopy(load_versions) or {}
+        layers: dict[str, set[str]] = defaultdict(set)


Same argument, good to have keywords-only argument here even we are going to remove layers soon

noklam · 2023-07-06T10:51:01Z

kedro/io/data_catalog.py

+        if "{" in pattern:
+            return True
+        return False


Suggested change

if "{" in pattern:

return True

return False

return "{" in pattern

noklam · 2023-07-06T13:45:34Z

As discussed, let's separate the Type Alias proposed in #2770.
I will close #2770 and approve this PR now. :)

noklam

LGTM now! Let's ship it and do the type alias separately.

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add warning for catch-all patterns Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Update warning message Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

idanov · 2023-07-07T09:16:43Z

kedro/io/data_catalog.py

+        for pattern, _ in data_set_patterns.items():
+            result = parse(pattern, data_set_name)
+            if result:
+                return pattern
+        return None


This could be nicely rewritten to:

matches = (parse(pattern, data_set_name) for pattern in data_set_patterns.keys()) return next(filter(None, matches), None)

idanov · 2023-07-07T09:18:37Z

kedro/io/data_catalog.py

+        missing_keys = [
+            key
+            for key in load_versions.keys()
+            if not (cls._match_pattern(sorted_patterns, key) or key in catalog)


This test should be reordered, since key in catalog is easier to test and probably will match more often. No need to go through _match_pattern for all the non-patterned datasets.

idanov · 2023-07-07T09:20:31Z

kedro/io/data_catalog.py

+        sorted_patterns = {}
+        for key in sorted_keys:
+            sorted_patterns[key] = data_set_patterns[key]
+        return sorted_patterns


Shorter and neater.

return {key: data_set_patterns[key] for key in sorted_keys}

merelcht and others added 3 commits May 22, 2023 17:48

Cleaned up and up to date version of dataset factories code

ec66a12

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Add some simple tests

5e6c15d

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

Add parsing rules

0fca72c

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

ankatiyar requested a review from merelcht June 5, 2023 10:10

ankatiyar added 2 commits June 8, 2023 15:06

Refactor

06ed1a4

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Add some tests

b0e3fb9

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

ankatiyar mentioned this pull request Jun 8, 2023

[Draft] Dataset factories - Eager resolving approach #2632

Closed

8 tasks

ankatiyar changed the title ~~[Draft] Dataset factories - Lazy resolving approach~~ [Draft] Dataset factories Jun 8, 2023

ankatiyar and others added 3 commits June 12, 2023 10:57

Add unit tests

0833af2

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Fix test + refactor runner

8fc80f9

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Merge branch 'main' into feature/dataset-factories-for-catalog

091f794

ankatiyar marked this pull request as ready for review June 12, 2023 10:36

ankatiyar requested a review from idanov as a code owner June 12, 2023 10:36

ankatiyar changed the title ~~[Draft] Dataset factories~~ Dataset factories Jun 12, 2023

ankatiyar requested review from noklam and antonymilne June 12, 2023 14:06

ankatiyar mentioned this pull request Jun 12, 2023

Add documentation for dataset factories feature #2670

Merged

5 tasks

merelcht reviewed Jun 12, 2023

View reviewed changes

Add comments + update specificity fn

8c192ee

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

merelcht mentioned this pull request Jun 13, 2023

[PROTOTYPE NOT TO BE MERGED] Dataset factories prototype #2560

Closed

5 tasks

ankatiyar mentioned this pull request Jun 14, 2023

[DRAFT] Dataset factory parsing rules demo #2559

Closed

5 tasks

merelcht reviewed Jun 14, 2023

View reviewed changes

ankatiyar and others added 3 commits June 15, 2023 14:06

Update function names

3e2642c

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Merge branch 'main' into feature/dataset-factories-for-catalog

c2635d0

Update test

d310486

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

noklam reviewed Jun 15, 2023

View reviewed changes

ankatiyar and others added 2 commits June 19, 2023 13:10

Merge branch 'main' into feature/dataset-factories-for-catalog

573c67f

Release notes + update resume scenario fix

9d80de4

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

ankatiyar requested review from merelcht and noklam July 3, 2023 15:02

ankatiyar and others added 3 commits July 3, 2023 16:18

linting + small fix _get_datasets

eee606a

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Remove check for existence

635510a

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Merge branch 'main' into feature/dataset-factories-for-catalog

394f37b

merelcht reviewed Jul 4, 2023

View reviewed changes

ankatiyar and others added 4 commits July 5, 2023 09:56

Add updated tests + Release notes

a1c602d

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

change classmethod to staticmethod for _match_patterns

978d0a5

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Add test for layer

b4fe7a7

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Merge branch 'main' into feature/dataset-factories-for-catalog

2782dca

merelcht approved these changes Jul 5, 2023

View reviewed changes

kedro/io/data_catalog.py Outdated Show resolved Hide resolved

kedro/io/data_catalog.py Show resolved Hide resolved

Minor change from code review

85d3df1

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

noklam reviewed Jul 5, 2023

View reviewed changes

noklam reviewed Jul 6, 2023

View reviewed changes

ankatiyar mentioned this pull request Jul 6, 2023

Add warning for catch-all patterns [dataset factories] #2774

Merged

5 tasks

noklam self-requested a review July 6, 2023 13:45

noklam approved these changes Jul 6, 2023

View reviewed changes

noklam mentioned this pull request Jul 6, 2023

Introduce Type Alias for Patterns and more #2776

Closed

ankatiyar and others added 3 commits July 6, 2023 15:13

Remove type conversion

8904ce3

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

Merge branch 'main' into feature/dataset-factories-for-catalog

fa6c256

ankatiyar merged commit 6da8bde into main Jul 6, 2023

ankatiyar deleted the feature/dataset-factories-for-catalog branch July 6, 2023 15:06

idanov reviewed Jul 7, 2023

View reviewed changes

ankatiyar mentioned this pull request Jul 7, 2023

Add type alias for dataset factory patterns #2779

Merged

5 tasks

AhdraMeraliQB mentioned this pull request Jul 14, 2023

Update kedro catalog list command to account for dataset factories #2793

Merged

5 tasks

noklam mentioned this pull request Sep 18, 2023

Lazy Loading of Catalog Items #2829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset factories #2635

Dataset factories #2635

ankatiyar commented Jun 2, 2023 •

edited

Loading

merelcht left a comment

ankatiyar commented Jun 13, 2023 •

edited

Loading

merelcht left a comment

ankatiyar commented Jun 15, 2023

noklam left a comment

noklam Jun 15, 2023

noklam Jun 15, 2023

ankatiyar Jun 19, 2023

idanov Jun 22, 2023

merelcht left a comment

merelcht Jul 4, 2023

ankatiyar Jul 5, 2023

merelcht left a comment

noklam left a comment

noklam Jul 5, 2023

noklam Jul 5, 2023

noklam Jul 5, 2023

noklam Jul 5, 2023

noklam Jul 6, 2023 •

edited

Loading

noklam commented Jul 6, 2023

noklam left a comment

idanov Jul 7, 2023

idanov Jul 7, 2023

idanov Jul 7, 2023

	self._dataset_patterns = dict(dataset_patterns or {})
	self._dataset_patterns =dataset_patterns or {}

Dataset factories #2635

Dataset factories #2635

Conversation

ankatiyar commented Jun 2, 2023 • edited Loading

Description

Development notes

Kedro run workflow

Follow up tasks to do

Checklist

merelcht left a comment

Choose a reason for hiding this comment

ankatiyar commented Jun 13, 2023 • edited Loading

merelcht left a comment

Choose a reason for hiding this comment

ankatiyar commented Jun 15, 2023

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noklam Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

noklam commented Jul 6, 2023

noklam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankatiyar commented Jun 2, 2023 •

edited

Loading

ankatiyar commented Jun 13, 2023 •

edited

Loading

noklam Jul 6, 2023 •

edited

Loading