Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: KedroDataCatalog #4151

Merged
merged 176 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 158 commits
Commits
Show all changes
176 commits
Select commit Hold shift + click to select a range
a8f4fb3
Added a skeleton for AbstractDataCatalog and KedroDataCatalog
ElenaKhaustova Jul 31, 2024
7d56818
Removed from_config method
ElenaKhaustova Jul 31, 2024
787e121
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 2, 2024
0b80f23
Implemented _init_datasets method
ElenaKhaustova Aug 2, 2024
5c727df
Implemented get dataset
ElenaKhaustova Aug 2, 2024
05c9171
Started resolve_patterns implementation
ElenaKhaustova Aug 2, 2024
5c804d6
Implemented resolve_patterns
ElenaKhaustova Aug 5, 2024
e9ba5c4
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 5, 2024
530f7d6
Fixed credentials resolving
ElenaKhaustova Aug 5, 2024
64be83c
Updated match pattern
ElenaKhaustova Aug 6, 2024
c29828a
Implemented add from dict method
ElenaKhaustova Aug 6, 2024
957403a
Updated io __init__
ElenaKhaustova Aug 6, 2024
14908ff
Added list method
ElenaKhaustova Aug 6, 2024
c5e925b
Implemented _validate_missing_keys
ElenaKhaustova Aug 6, 2024
b9a92b0
Added datasets access logic
ElenaKhaustova Aug 7, 2024
2cb794f
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 7, 2024
2f32593
Added __contains__ and comments on lazy loading
ElenaKhaustova Aug 7, 2024
d1ea64e
Renamed dataset_name to ds_name
ElenaKhaustova Aug 8, 2024
fb89fca
Updated some docstrings
ElenaKhaustova Aug 8, 2024
4486939
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 12, 2024
c667645
Fixed _update_ds_configs
ElenaKhaustova Aug 12, 2024
be8e929
Fixed _init_datasets
ElenaKhaustova Aug 12, 2024
ec7ac39
Implemented add_runtime_patterns
ElenaKhaustova Aug 12, 2024
8e23450
Fixed runtime patterns usage
ElenaKhaustova Aug 13, 2024
529e61a
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 19, 2024
e4cb21c
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 21, 2024
50bc816
Moved pattern logic out of data catalog, implemented KedroDataCatalog
ElenaKhaustova Aug 21, 2024
6dfbcb0
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 22, 2024
9346f08
KedroDataCatalog updates
ElenaKhaustova Aug 22, 2024
9568e29
Added property to return config
ElenaKhaustova Aug 28, 2024
86efdfe
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 28, 2024
5e27660
Added list patterns method
ElenaKhaustova Aug 28, 2024
72b11d0
Renamed and moved ConfigResolver
ElenaKhaustova Aug 29, 2024
f0a4090
Renamed ConfigResolver
ElenaKhaustova Aug 29, 2024
a4da52a
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 29, 2024
7d6227f
Cleaned KedroDataCatalog
ElenaKhaustova Aug 29, 2024
4092291
Cleaned up DataCatalogConfigResolver
ElenaKhaustova Aug 29, 2024
63e47f9
Docs build fix attempt
ElenaKhaustova Aug 30, 2024
85bf720
KedroDataCatalog draft
ElenaKhaustova Sep 5, 2024
68f6527
Removed KedroDataCatalog
ElenaKhaustova Sep 5, 2024
2ac4a2f
Updated from_config method
ElenaKhaustova Sep 5, 2024
cb5879d
Updated constructor and add methods
ElenaKhaustova Sep 5, 2024
9038e96
Updated _get_dataset method
ElenaKhaustova Sep 5, 2024
cc89565
Updated __contains__
ElenaKhaustova Sep 5, 2024
59b6764
Updated __eq__ and shallow_copy
ElenaKhaustova Sep 5, 2024
4f5a3fb
Added __iter__ and __getitem__
ElenaKhaustova Sep 5, 2024
12ed6f2
Removed unused imports
ElenaKhaustova Sep 5, 2024
a106cec
Added TODO
ElenaKhaustova Sep 5, 2024
6df04f7
Updated runner.run()
ElenaKhaustova Sep 5, 2024
8566e27
Updated session
ElenaKhaustova Sep 5, 2024
2dcea33
Added confil_resolver property
ElenaKhaustova Sep 5, 2024
a46597f
Updated catalog list command
ElenaKhaustova Sep 5, 2024
3787545
Updated catalog create command
ElenaKhaustova Sep 5, 2024
68d612d
Updated catalog rank command
ElenaKhaustova Sep 5, 2024
af5bee9
Updated catalog resolve command
ElenaKhaustova Sep 5, 2024
acc4d6e
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 5, 2024
e67ff0f
Remove some methods
ElenaKhaustova Sep 5, 2024
7b3afa2
Removed ds configs from catalog
ElenaKhaustova Sep 6, 2024
658a759
Fixed lint
ElenaKhaustova Sep 6, 2024
7be2a8e
Fixed typo
ElenaKhaustova Sep 6, 2024
09f3f26
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 6, 2024
9e43a9a
Added module docstring
ElenaKhaustova Sep 6, 2024
b28a9bf
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 6, 2024
c9f3469
Merge branch '4110-move-pattern-resolution-logic' into 3995-data-cata…
ElenaKhaustova Sep 6, 2024
49a3b27
Renaming methods
ElenaKhaustova Sep 6, 2024
25b6501
Removed None from Pattern type
ElenaKhaustova Sep 6, 2024
3a646de
Fixed docs failing to find class reference
ElenaKhaustova Sep 6, 2024
5e5df4a
Fixed docs failing to find class reference
ElenaKhaustova Sep 6, 2024
aa59a35
Updated Patterns type
ElenaKhaustova Sep 6, 2024
c7efa3e
Fix tests (#4149)
ankatiyar Sep 6, 2024
023ffc6
Returned constants to avoid breaking changes
ElenaKhaustova Sep 6, 2024
6971779
Merge branch '4110-move-pattern-resolution-logic' into 3995-data-cata…
ElenaKhaustova Sep 6, 2024
d57a567
Udapted KedroDataCatalog for recent changes
ElenaKhaustova Sep 6, 2024
585b44f
Minor fix
ElenaKhaustova Sep 6, 2024
2769def
Merge branch '4110-move-pattern-resolution-logic' into 3995-data-cata…
ElenaKhaustova Sep 6, 2024
e447078
Updated test_sorting_order_with_other_dataset_through_extra_pattern
ElenaKhaustova Sep 9, 2024
beb0165
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 9, 2024
975e968
Removed odd properties
ElenaKhaustova Sep 9, 2024
11d782c
Updated tests
ElenaKhaustova Sep 9, 2024
e4abd23
Removed None from _fetch_credentials input
ElenaKhaustova Sep 9, 2024
5f105de
Merge branch '4110-move-pattern-resolution-logic' into 3995-data-cata…
ElenaKhaustova Sep 9, 2024
f9cb9c6
Updated specs and context
ElenaKhaustova Sep 9, 2024
31a9484
Updated runners
ElenaKhaustova Sep 9, 2024
ced1b7a
Updated default catalog validation
ElenaKhaustova Sep 9, 2024
7f9b576
Updated default catalog validation
ElenaKhaustova Sep 9, 2024
a3828d9
Updated contains and added exists methods for KedroDataCatalog
ElenaKhaustova Sep 9, 2024
16610c4
Fixed docs
ElenaKhaustova Sep 9, 2024
321affe
Fixing docs and lint
ElenaKhaustova Sep 9, 2024
ff25405
Fixed docs
ElenaKhaustova Sep 9, 2024
d0000c0
Fixed docs
ElenaKhaustova Sep 9, 2024
7f5ddec
Fixed unit tests
ElenaKhaustova Sep 10, 2024
e030bb6
Added __eq__
ElenaKhaustova Sep 10, 2024
6433dd8
Renamed DataCatalogConfigResolver to CatalogConfigResolver
ElenaKhaustova Sep 10, 2024
355576f
Renamed _init_configs to _resolve_config_credentials
ElenaKhaustova Sep 10, 2024
39d9ff6
Moved functions to the class
ElenaKhaustova Sep 10, 2024
659c9da
Refactored resolve_dataset_pattern
ElenaKhaustova Sep 10, 2024
840b32a
Fixed refactored part
ElenaKhaustova Sep 10, 2024
77f551c
Changed the order of arguments for DataCatalog constructor
ElenaKhaustova Sep 10, 2024
6e079a1
Replaced __getitem__ with .get()
ElenaKhaustova Sep 10, 2024
1f7e5f8
Updated catalog commands
ElenaKhaustova Sep 10, 2024
80f0e3d
Moved warm up block outside of the try block
ElenaKhaustova Sep 10, 2024
017cda3
Fixed linter
ElenaKhaustova Sep 10, 2024
cab6f06
Removed odd copying
ElenaKhaustova Sep 10, 2024
ac1ecc0
Merge branch '4110-move-pattern-resolution-logic' into 3995-data-cata…
ElenaKhaustova Sep 10, 2024
e955930
Renamed DataCatalogConfigResolver to CatalogConfigResolver
ElenaKhaustova Sep 10, 2024
a07f3d4
Renamed AbstractDataCatalog to BaseDataCatalog
ElenaKhaustova Sep 10, 2024
4ecb826
Moved validate_dataset_config inside catalog
ElenaKhaustova Sep 10, 2024
2b9be66
Renamed _init_dataset to _add_from_config
ElenaKhaustova Sep 10, 2024
fb3831b
Fix lint
ElenaKhaustova Sep 10, 2024
8f604d1
Updated release notes
ElenaKhaustova Sep 11, 2024
9a4db18
Returned DatasetError
ElenaKhaustova Sep 11, 2024
0a6946a
Added _dataset_patterns and _default_pattern to _config_resolver to a…
ElenaKhaustova Sep 11, 2024
fee7bd6
Made resolve_dataset_pattern return just dict
ElenaKhaustova Sep 11, 2024
f5a7992
Fixed linter
ElenaKhaustova Sep 11, 2024
1c981f3
Added Catalogprotocol draft
ElenaKhaustova Sep 11, 2024
6128be7
Implemented CatalogProtocol
ElenaKhaustova Sep 12, 2024
8c91d0e
Updated types
ElenaKhaustova Sep 12, 2024
18d2ba0
Fixed linter
ElenaKhaustova Sep 12, 2024
d48c6d3
Added _ImplementsCatalogProtocolValidator
ElenaKhaustova Sep 12, 2024
45ce6bc
Updated docstrings
ElenaKhaustova Sep 12, 2024
6ca972f
Fixed tests
ElenaKhaustova Sep 12, 2024
fdce5ea
Fixed docs
ElenaKhaustova Sep 12, 2024
3029963
Excluded Potocol from coverage
ElenaKhaustova Sep 12, 2024
0150a21
Merge branch 'main' into 4138-catalog-protocol
ElenaKhaustova Sep 12, 2024
0833a84
Fixed docs
ElenaKhaustova Sep 12, 2024
95ccb3c
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 13, 2024
07908a8
Renamed catalog source to kedro_data_catalog
ElenaKhaustova Sep 13, 2024
25a6fcf
Renamed data set to dataset in docstrings
ElenaKhaustova Sep 13, 2024
07f8c12
Updated add_from_dict
ElenaKhaustova Sep 13, 2024
3a1a0f2
Revised comments and TODOs
ElenaKhaustova Sep 13, 2024
cf663a0
Updated error message to point to specific catalog type
ElenaKhaustova Sep 13, 2024
caa7316
Fixed tests
ElenaKhaustova Sep 13, 2024
9540a32
Merge branch '4138-catalog-protocol' into 3995-data-catalog-2.0
ElenaKhaustova Sep 13, 2024
0ac154d
Merged with protocol
ElenaKhaustova Sep 13, 2024
0ec1f23
Removed reference to DataCatalog in docstrings
ElenaKhaustova Sep 13, 2024
96d4576
Merge branch '4138-catalog-protocol' into 3995-data-catalog-2.0
ElenaKhaustova Sep 13, 2024
4ecd8fd
Fixed docs
ElenaKhaustova Sep 13, 2024
11b3426
Reordered methods
ElenaKhaustova Sep 13, 2024
741b682
Removed add_all from protocol
ElenaKhaustova Sep 13, 2024
88ba38b
Merge branch '4138-catalog-protocol' into 3995-data-catalog-2.0
ElenaKhaustova Sep 13, 2024
0020095
Changed the order of arguments
ElenaKhaustova Sep 13, 2024
78feb51
Updated docstrings
ElenaKhaustova Sep 13, 2024
6bf912c
Updated docstrings
ElenaKhaustova Sep 13, 2024
c7699ec
Merge branch '4138-catalog-protocol' into 3995-data-catalog-2.0
ElenaKhaustova Sep 13, 2024
bcd2d37
Added __repr__
ElenaKhaustova Sep 16, 2024
eb7e8f5
Made __getitem__ return deepcopy
ElenaKhaustova Sep 16, 2024
7348c12
Fixed bug in get_dataset()
ElenaKhaustova Sep 16, 2024
5aee9e9
Fixed __eq__
ElenaKhaustova Sep 16, 2024
c9c7c9a
Fixed docstrings
ElenaKhaustova Sep 16, 2024
c66df33
Merge branch 'main' into 4138-catalog-protocol
ElenaKhaustova Sep 16, 2024
2f1dcbd
Merge branch '4138-catalog-protocol' into 3995-data-catalog-2.0
ElenaKhaustova Sep 16, 2024
4b8d90c
Added __setitem__
ElenaKhaustova Sep 17, 2024
8f870a8
Unit tests for `KedroDataCatalog` (#4171)
ElenaKhaustova Sep 17, 2024
70dc177
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 17, 2024
ae7a271
Updated RELEASE.md
ElenaKhaustova Sep 17, 2024
135cb0e
Removed deep copies
ElenaKhaustova Sep 18, 2024
ca4867c
Removed some interface that will be changed in the next version
ElenaKhaustova Sep 18, 2024
4745f71
Removed key completions
ElenaKhaustova Sep 18, 2024
033a0b7
Fixinf typos
ElenaKhaustova Sep 18, 2024
e74ffda
Removed key completions test
ElenaKhaustova Sep 18, 2024
00af3ec
Replaced data set with dataset
ElenaKhaustova Sep 18, 2024
2de7ccb
Added docstring for get_dataset() method
ElenaKhaustova Sep 18, 2024
8affed6
Renamed pytest fixture
ElenaKhaustova Sep 18, 2024
a52672e
Addressed review comments
ElenaKhaustova Sep 19, 2024
84f249c
Updated _assert_requirements_ok starters test
ElenaKhaustova Sep 20, 2024
2548119
Revert "Updated _assert_requirements_ok starters test"
ElenaKhaustova Sep 20, 2024
ac124e3
Updated error message
ElenaKhaustova Sep 20, 2024
f62ed03
Replaced typo
ElenaKhaustova Sep 20, 2024
b65609f
Replaced data set with dataset in docstrings
ElenaKhaustova Sep 20, 2024
17199ad
Updated tests
ElenaKhaustova Sep 20, 2024
44c576e
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 20, 2024
6d5f094
Made KedroDataCatalog subclass from CatalogProtocol
ElenaKhaustova Sep 23, 2024
e24b2a6
Updated release notes
ElenaKhaustova Sep 23, 2024
c8ef90f
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 23, 2024
d19941f
Renamed resolve_dataset_pattern to resolve_pattern
ElenaKhaustova Sep 24, 2024
572594e
Merge branch 'main' into 3995-data-catalog-2.0
ElenaKhaustova Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Upcoming Release

## Major features and improvements
* Implemented `KedroDataCatalog` repeating `DataCatalog` functionality with a few API enhancements:

Check warning on line 4 in RELEASE.md

View workflow job for this annotation

GitHub Actions / vale

[vale] RELEASE.md#L4

[Kedro.weaselwords] 'few' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'few' is a weasel word!", "location": {"path": "RELEASE.md", "range": {"start": {"line": 4, "column": 79}}}, "severity": "WARNING"}
* Removed `_FrozenDatasets` and access datasets as properties;
* Added get dataset by name feature: dedicated function and access by key;
* Added iterate over the datasets feature;
* `add_feed_dict()` was simplified and renamed to `add_raw_data()`;
* Datasets' initialisation was moved out from `from_config()` method to the constructor.
* Implemented `Protocol` abstraction for the current `DataCatalog` and adding new catalog implementations.
* Refactored `kedro run` and `kedro catalog` commands.
* Moved pattern resolution logic from `DataCatalog` to a separate component - `CatalogConfigResolver`. Updated `DataCatalog` to use `CatalogConfigResolver` internally.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
"kedro.io.kedro_data_catalog.KedroDataCatalog",
idanov marked this conversation as resolved.
Show resolved Hide resolved
"kedro.io.memory_dataset.MemoryDataset",
"kedro.io.partitioned_dataset.PartitionedDataset",
"kedro.pipeline.pipeline.Pipeline",
Expand Down Expand Up @@ -172,6 +173,7 @@
"Patterns",
"CatalogConfigResolver",
"CatalogProtocol",
"KedroDataCatalog",
),
"py:data": (
"typing.Any",
Expand Down
2 changes: 2 additions & 0 deletions kedro/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
Version,
)
from .data_catalog import DataCatalog
from .kedro_data_catalog import KedroDataCatalog
from .lambda_dataset import LambdaDataset
from .memory_dataset import MemoryDataset
from .shared_memory_dataset import SharedMemoryDataset
Expand All @@ -30,6 +31,7 @@
"DatasetAlreadyExistsError",
"DatasetError",
"DatasetNotFoundError",
"KedroDataCatalog",
"LambdaDataset",
"MemoryDataset",
"SharedMemoryDataset",
Expand Down
326 changes: 326 additions & 0 deletions kedro/io/kedro_data_catalog.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,326 @@
"""``KedroDataCatalog`` stores instances of ``AbstractDataset`` implementations to
provide ``load`` and ``save`` capabilities from anywhere in the program. To
use a ``KedroDataCatalog``, you need to instantiate it with a dictionary of datasets.
Then it will act as a single point of reference for your calls, relaying load and
save functions to the underlying datasets.
"""

from __future__ import annotations

import copy
import difflib
import logging
import re
from typing import Any

from kedro.io.catalog_config_resolver import CatalogConfigResolver, Patterns
from kedro.io.core import (
AbstractDataset,
AbstractVersionedDataset,
DatasetAlreadyExistsError,
DatasetError,
DatasetNotFoundError,
Version,
generate_timestamp,
)
from kedro.io.memory_dataset import MemoryDataset
from kedro.utils import _format_rich, _has_rich_handler


class KedroDataCatalog:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be: class KedroDataCatalog(CatalogProtocol) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need it in case we share some logic between the implementations which we intentionally don't want to do to keep all the implementations independent from the protocol class.

while it’s possible to subclass a protocol explicitly, it’s not necessary to do so for the sake of type-checking

https://peps.python.org/pep-0544/#explicitly-declaring-implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind, but the point was to clearly show that we do not have a shared logic and addition of new catalog does not require an explicit declaration.

def __init__(
self,
datasets: dict[str, AbstractDataset] | None = None,
raw_data: dict[str, Any] | None = None,
config_resolver: CatalogConfigResolver | None = None,
load_versions: dict[str, str] | None = None,
save_version: str | None = None,
) -> None:
"""``KedroDataCatalog`` stores instances of ``AbstractDataset``
implementations to provide ``load`` and ``save`` capabilities from
anywhere in the program. To use a ``KedroDataCatalog``, you need to
instantiate it with a dictionary of data sets. Then it will act as a
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
single point of reference for your calls, relaying load and save
functions to the underlying data sets.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

Args:
datasets: A dictionary of data set names and data set instances.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
raw_data: A dictionary with data to be added in memory as `MemoryDataset`` instances.
Keys represent dataset names and the values are raw data.
config_resolver: An instance of CatalogConfigResolver to resolve dataset patterns and configurations.
load_versions: A mapping between data set names and versions
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
to load. Has no effect on data sets without enabled versioning.
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
save_version: Version string to be used for ``save`` operations
by all data sets with enabled versioning. It must: a) be a
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
case-insensitive string that conforms with operating system
filename limitations, b) always return the latest version when
sorted in lexicographical order..
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
"""
self._config_resolver = config_resolver or CatalogConfigResolver()
self._datasets = datasets or {}
self._load_versions = load_versions or {}
self._save_version = save_version

self._use_rich_markup = _has_rich_handler()

for ds_name, ds_config in self._config_resolver.config.items():
self._add_from_config(ds_name, ds_config)

if raw_data:
self.add_raw_data(raw_data)

@property
def datasets(self) -> dict[str, Any]:
return copy.copy(self._datasets)

@datasets.setter
def datasets(self, value: Any) -> None:
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
raise AttributeError(
"Operation not allowed! Please use KedroDataCatalog.add() instead."
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
)

@property
def config_resolver(self) -> CatalogConfigResolver:
return self._config_resolver

def __repr__(self) -> str:
return self._datasets.__repr__()
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

def __contains__(self, dataset_name: str) -> bool:
"""Check if an item is in the catalog as a materialised dataset or pattern"""
return (
dataset_name in self._datasets
or self._config_resolver.match_pattern(dataset_name) is not None
)

def __eq__(self, other) -> bool: # type: ignore[no-untyped-def]
return (self._datasets, self._config_resolver.list_patterns()) == (
other._datasets,
other.config_resolver.list_patterns(),
)

@property
def _logger(self) -> logging.Logger:
return logging.getLogger(__name__)

@classmethod
def from_config(
cls,
catalog: dict[str, dict[str, Any]] | None,
credentials: dict[str, dict[str, Any]] | None = None,
load_versions: dict[str, str] | None = None,
save_version: str | None = None,
) -> KedroDataCatalog:
"""Create a ``KedroDataCatalog`` instance from configuration. This is a
factory method used to provide developers with a way to instantiate
``KedroDataCatalog`` with configuration parsed from configuration files.
"""
catalog = catalog or {}
config_resolver = CatalogConfigResolver(catalog, credentials)
save_version = save_version or generate_timestamp()
load_versions = load_versions or {}

missing_keys = [
ds_name
for ds_name in load_versions
if not (
ds_name in config_resolver.config
or config_resolver.match_pattern(ds_name)
)
]
if missing_keys:
raise DatasetNotFoundError(
f"'load_versions' keys [{', '.join(sorted(missing_keys))}] "
f"are not found in the catalog."
)

return cls(
load_versions=load_versions,
save_version=save_version,
config_resolver=config_resolver,
)

@staticmethod
def _validate_dataset_config(ds_name: str, ds_config: Any) -> None:
if not isinstance(ds_config, dict):
raise DatasetError(
f"Catalog entry '{ds_name}' is not a valid dataset configuration. "
"\nHint: If this catalog entry is intended for variable interpolation, "
"make sure that the key is preceded by an underscore."
)

def _add_from_config(self, ds_name: str, ds_config: dict[str, Any]) -> None:
merelcht marked this conversation as resolved.
Show resolved Hide resolved
# TODO: Add lazy loading feature to store the configuration but not to init actual dataset
# TODO: Initialise actual dataset when load or save
merelcht marked this conversation as resolved.
Show resolved Hide resolved
self._validate_dataset_config(ds_name, ds_config)
ds = AbstractDataset.from_config(
ds_name,
ds_config,
self._load_versions.get(ds_name),
self._save_version,
)

self.add(ds_name, ds)

def get_dataset(
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
self, ds_name: str, version: Version | None = None, suggest: bool = True
) -> AbstractDataset:
ds_config = self._config_resolver.resolve_dataset_pattern(ds_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we resolve only if ds_name not in self._datasets? Or it's just to make the code a bit simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.

Copy link
Member

@idanov idanov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was just a question, either is fine :)


if ds_name not in self._datasets and ds_config:
self._add_from_config(ds_name, ds_config)

dataset = self._datasets.get(ds_name, None)

if dataset is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:

  • resolve the dataset pattern
  • if not part of the materialised datasets, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario
  • otherwise continue with non-error scenario

I think we can make the flow a bit less zig-zagy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if as the logic needs to go inside the if fail [] else [] scenario.

Now the logic is like this:

  • if not part of the materialised datasets, resolve the dataset pattern
  • if resolved, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario
  • otherwise, continue with a non-error scenario

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that we should have as a first line something like:

if ds_name not in self._datasets and self._config_resolver.match_pattern():
    ...

and then continue with the error scenario, and then go on with everything else. It'll be much easier to follow this way. Btw while checking if this is possible, I saw a problem in the resolver - it can fail even if it matches, but that should not happen.

elif isinstance(config, str) and "}" in config:
try:
config = config.format_map(resolved_vars.named)
except KeyError as exc:
raise DatasetError(
f"Unable to resolve '{config}' from the pattern '{pattern}'. Keys used in the configuration "
f"should be present in the dataset factory pattern."
) from exc

This ☝️ should have been checked at the config_resolver init time, basically we should not allow to create a config_resolver with unresolvable configs or add invalid configs that cannot be resolved.

Also there's other changes like e.g. resolve_dataset_pattern should be just resolve_pattern similar to all other public methods there, which never include the word dataset (rightfully). Hopefully the resolver API is not released yet, so we can change it now.

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Checking for a pattern match is not enough, as the config could already be resolved at the init time. resolve_pattern method encapsulates this, not to expose this logic outside the config_resolver. So we ask the config_resolver to provide a config for a pattern without bothering about how it's happening inside. I don't think we need to move any resolution logic (including any specific checks) to the catalog level.
  2. The suggestion about configs resolution makes sense to me. We can move this validation to the init time and simplify the resolution method. But would do that in a separate PR as it doesn't touch the catalog and will be done at the level of the config resolver.
  3. Method was renamed.

Copy link
Member

@idanov idanov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?

In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.

error_msg = f"Dataset '{ds_name}' not found in the catalog"
# Flag to turn on/off fuzzy-matching which can be time consuming and
# slow down plugins like `kedro-viz`
if suggest:
matches = difflib.get_close_matches(ds_name, self._datasets.keys())
if matches:
suggestions = ", ".join(matches)
error_msg += f" - did you mean one of these instead: {suggestions}"
raise DatasetNotFoundError(error_msg)

if version and isinstance(dataset, AbstractVersionedDataset):
# we only want to return a similar-looking dataset,
# not modify the one stored in the current catalog
dataset = dataset._copy(_version=version)

return dataset

def _get_dataset(
self, dataset_name: str, version: Version | None = None, suggest: bool = True
) -> AbstractDataset:
# TODO: remove when removing old catalog
return self.get_dataset(dataset_name, version, suggest)

def add(
self, ds_name: str, dataset: AbstractDataset, replace: bool = False
) -> None:
"""Adds a new ``AbstractDataset`` object to the ``KedroDataCatalog``."""
if ds_name in self._datasets:
if replace:
self._logger.warning("Replacing dataset '%s'", ds_name)
else:
raise DatasetAlreadyExistsError(
f"Dataset '{ds_name}' has already been registered"
)
self._datasets[ds_name] = dataset

def list(self, regex_search: str | None = None) -> list[str]:
"""
List of all dataset names registered in the catalog.
This can be filtered by providing an optional regular expression
which will only return matching keys.
"""

if regex_search is None:
return list(self._datasets.keys())

if not regex_search.strip():
self._logger.warning("The empty string will not match any datasets")
return []

try:
pattern = re.compile(regex_search, flags=re.IGNORECASE)
except re.error as exc:
raise SyntaxError(
f"Invalid regular expression provided: '{regex_search}'"
) from exc
return [ds_name for ds_name in self._datasets if pattern.search(ds_name)]

def save(self, name: str, data: Any) -> None:
"""Save data to a registered dataset."""
dataset = self.get_dataset(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataset = self.get_dataset(name)
dataset = self[name]

Maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it as it is in case someone is trying to save an unregistered dataset. get_dataset() will resolve it in case it's a pattern or validate that the dataset is not in the catalog.


self._logger.info(
"Saving data to %s (%s)...",
_format_rich(name, "dark_orange") if self._use_rich_markup else name,
type(dataset).__name__,
extra={"markup": True},
)

dataset.save(data)

def load(self, name: str, version: str | None = None) -> Any:
"""Loads a registered dataset."""
load_version = Version(version, None) if version else None
dataset = self.get_dataset(name, version=load_version)

self._logger.info(
"Loading data from %s (%s)...",
_format_rich(name, "dark_orange") if self._use_rich_markup else name,
type(dataset).__name__,
extra={"markup": True},
)

return dataset.load()

def release(self, name: str) -> None:
"""Release any cached data associated with a dataset
Args:
name: A dataset to be checked.
Raises:
DatasetNotFoundError: When a dataset with the given name
has not yet been registered.
"""
dataset = self.get_dataset(name)
dataset.release()

def confirm(self, name: str) -> None:
"""Confirm a dataset by its name.
Args:
name: Name of the dataset.
Raises:
DatasetError: When the dataset does not have `confirm` method.
"""
self._logger.info("Confirming dataset '%s'", name)
dataset = self.get_dataset(name)

if hasattr(dataset, "confirm"):
dataset.confirm()
else:
raise DatasetError(f"Dataset '{name}' does not have 'confirm' method")

def add_raw_data(self, data: dict[str, Any], replace: bool = False) -> None:
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
# This method was simplified to add memory datasets only, since
# adding AbstractDataset can be done via add() method
for ds_name, ds_data in data.items():
self.add(ds_name, MemoryDataset(data=ds_data), replace) # type: ignore[abstract]

def add_feed_dict(self, feed_dict: dict[str, Any], replace: bool = False) -> None:
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
# TODO: remove when removing old catalog
return self.add_raw_data(feed_dict, replace)

def shallow_copy(
self, extra_dataset_patterns: Patterns | None = None
) -> KedroDataCatalog:
# TODO: remove when old catalog
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
"""Returns a shallow copy of the current object.

Returns:
Copy of the current object.
"""
if extra_dataset_patterns:
self._config_resolver.add_runtime_patterns(extra_dataset_patterns)
return self

def exists(self, name: str) -> bool:
"""Checks whether registered dataset exists by calling its `exists()`
method. Raises a warning and returns False if `exists()` is not
implemented.

Args:
name: A dataset to be checked.

Returns:
Whether the dataset output exists.

"""
try:
dataset = self._get_dataset(name)
except DatasetNotFoundError:
return False
return dataset.exists()
Loading
Loading