-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset factories #2635
Dataset factories #2635
Conversation
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first review of the code in data_catalog.py
and left some questions. My main question is what the purpose of the _pattern_name_matches_cache
is and the methods that update it.
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@merelcht The reason I added This might be overkill and I can revert to what the prototype was doing but I thought for large catalogs, this would avoid doing the same iteration twice. What do you think? The workflow for then fetching a dataset is ->
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying the use of the cache @ankatiyar, it makes more sense now! I think we can keep it, but perhaps give it a slightly shorter name e.g. _pattern_matches_cache
.
In general, I struggled with following the code because some of the names of the methods. I tried giving suggestions on how to improve, but it's quite personal so it would be good to get some other opinions as well 🙂
One thing every single user in the interviews mentioned is a warning about the catch-all pattern, so we must really implement that, including tests.
Another thing that was mentioned several times was a warning about when multiple patterns match a dataset, it would be good to have a test for that to see how that would happen and if we can also warn about that.
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop some comments for 1st round of review.
kedro/io/data_catalog.py
Outdated
self._sorted_dataset_patterns = sorted( | ||
self.dataset_patterns.keys(), | ||
key=lambda pattern: ( | ||
-(_specificity(pattern)), | ||
-pattern.count("{"), | ||
pattern, | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be refactor as a method and have its own unit tests
kedro/io/data_catalog.py
Outdated
@@ -567,16 +642,47 @@ def list(self, regex_search: str | None = None) -> list[str]: | |||
) from exc | |||
return [dset_name for dset_name in self._data_sets if pattern.search(dset_name)] | |||
|
|||
def exists_in_catalog_config(self, dataset_name: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the cache that add matched pattern into catalog? If so why do we need to keep it separately instead of just creating the entry in catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fn just adds the dataset name -> pattern in the self._pattern_matches_cache
does not create an entry in the catalog. This is because this function is also called in the runner to check for existence of datasets from the pipeline in the catalog. The actual adding of a dataset to the catalog happens in self._get_dataset()
This is to avoid doing the pattern matching multiple times when it's called inside a Session run but the actual resolving and adding is still at _get_dataset
time to allow for DataCatalog standalone usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a good name for a public method, it's way too descriptive. I would suggest that you change it to def __contains__(self, dataset_name):
. This will allow us to do a test like this:
if dataset_name in catalog:...
, which is way more intuitive and nicer than catalog.exists_in_catalog_config(dataset_name)
.
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed both this PR and #2743 and I definitely prefer the approach here with adding load_versions
and save_version
arguments instead of raw_catalog
. I left some more comments, but in general I think this is nearly finished.
kedro/io/data_catalog.py
Outdated
Args: | ||
pattern: The factory pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness this doc string should also include the return type/value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a private method, the Args:
part is leftover from when it was a helper fn outside DataCatalog
I've removed it
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments here, but I figured it is easier if I drop all my comments in one draft PR, it's not ready yet. I will ping you when it's ready.
Review PR since it is easier.
self, | ||
data_sets: dict[str, AbstractDataSet] = None, | ||
feed_dict: dict[str, Any] = None, | ||
layers: dict[str, set[str]] = None, | ||
dataset_patterns: dict[str, dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is good to introduce keyword-only arguments here, similar to the proposal to get us more freedom to re-arrange the arguments later without breaking changes. The current argument lists does not make too much sense
kedro/io/data_catalog.py
Outdated
@@ -170,8 +184,12 @@ def __init__( | |||
self._data_sets = dict(data_sets or {}) | |||
self.datasets = _FrozenDatasets(self._data_sets) | |||
self.layers = layers | |||
# Keep a record of all patterns in the catalog. | |||
# {dataset pattern name : dataset pattern body} | |||
self._dataset_patterns = dict(dataset_patterns or {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still confuse me a lot, is there any difference? Isn't dataset_patterns a dict already?
self._dataset_patterns = dict(dataset_patterns or {}) | |
self._dataset_patterns =dataset_patterns or {} |
kedro/io/data_catalog.py
Outdated
# Keep a record of all patterns in the catalog. | ||
# {dataset pattern name : dataset pattern body} | ||
self._dataset_patterns = dict(dataset_patterns or {}) | ||
self._load_versions = dict(load_versions or {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as self._dataset_patterns
catalog = copy.deepcopy(catalog) or {} | ||
credentials = copy.deepcopy(credentials) or {} | ||
save_version = save_version or generate_timestamp() | ||
load_versions = copy.deepcopy(load_versions) or {} | ||
layers: dict[str, set[str]] = defaultdict(set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same argument, good to have keywords-only argument here even we are going to remove layers soon
kedro/io/data_catalog.py
Outdated
if "{" in pattern: | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "{" in pattern: | |
return True | |
return False | |
return "{" in pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! Let's ship it and do the type alias separately.
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
* Add warning for catch-all patterns Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Update warning message Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
for pattern, _ in data_set_patterns.items(): | ||
result = parse(pattern, data_set_name) | ||
if result: | ||
return pattern | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be nicely rewritten to:
matches = (parse(pattern, data_set_name) for pattern in data_set_patterns.keys())
return next(filter(None, matches), None)
missing_keys = [ | ||
key | ||
for key in load_versions.keys() | ||
if not (cls._match_pattern(sorted_patterns, key) or key in catalog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be reordered, since key in catalog
is easier to test and probably will match more often. No need to go through _match_pattern
for all the non-patterned datasets.
sorted_patterns = {} | ||
for key in sorted_keys: | ||
sorted_patterns[key] = data_set_patterns[key] | ||
return sorted_patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shorter and neater.
return {key: data_set_patterns[key] for key in sorted_keys}
Description
Resolve #2423 This PR is introduces the dataset factories feature. Look at #2670 for the documentation on how to use it.
Development notes
Kedro run workflow
-->
kedro run
in terminal-->
Session
is created and run-->
Context
created which loads all the config from theconf_source
-->
DataCatalog.fromconfig(dict:catalog)
which initialises theDataCatalog
objectDataCatalog._data_sets
DataCatalog._dataset_patterns
load_version
andsave_version
toself._load_version
andself._save_version
-->
runner.run(pipeline, catalog, .., ..)
registered_ds
The check for existence will materialise and add the datasets to the catalog (inDataCatalog.__contains__()
)--> Pipeline is run and
catalog._get_dataset(dataset_name)-> AbstractDataSet
:No major change because datasets should already be materialised at the check for existenceNOTE : The tests are not updated to reflect the latest changes (TODO)
Toy project to test with : https://github.com/ankatiyar/space-patterns
Follow up tasks to do
kedro catalog list
+ Add new catalog CLI commands for dataset factories #2603Checklist
RELEASE.md
file