Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: Protocol abstraction for DataCatalog #4160

Merged
merged 107 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
a8f4fb3
Added a skeleton for AbstractDataCatalog and KedroDataCatalog
ElenaKhaustova Jul 31, 2024
7d56818
Removed from_config method
ElenaKhaustova Jul 31, 2024
787e121
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 2, 2024
0b80f23
Implemented _init_datasets method
ElenaKhaustova Aug 2, 2024
5c727df
Implemented get dataset
ElenaKhaustova Aug 2, 2024
05c9171
Started resolve_patterns implementation
ElenaKhaustova Aug 2, 2024
5c804d6
Implemented resolve_patterns
ElenaKhaustova Aug 5, 2024
e9ba5c4
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 5, 2024
530f7d6
Fixed credentials resolving
ElenaKhaustova Aug 5, 2024
64be83c
Updated match pattern
ElenaKhaustova Aug 6, 2024
c29828a
Implemented add from dict method
ElenaKhaustova Aug 6, 2024
957403a
Updated io __init__
ElenaKhaustova Aug 6, 2024
14908ff
Added list method
ElenaKhaustova Aug 6, 2024
c5e925b
Implemented _validate_missing_keys
ElenaKhaustova Aug 6, 2024
b9a92b0
Added datasets access logic
ElenaKhaustova Aug 7, 2024
2cb794f
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 7, 2024
2f32593
Added __contains__ and comments on lazy loading
ElenaKhaustova Aug 7, 2024
d1ea64e
Renamed dataset_name to ds_name
ElenaKhaustova Aug 8, 2024
fb89fca
Updated some docstrings
ElenaKhaustova Aug 8, 2024
4486939
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 12, 2024
c667645
Fixed _update_ds_configs
ElenaKhaustova Aug 12, 2024
be8e929
Fixed _init_datasets
ElenaKhaustova Aug 12, 2024
ec7ac39
Implemented add_runtime_patterns
ElenaKhaustova Aug 12, 2024
8e23450
Fixed runtime patterns usage
ElenaKhaustova Aug 13, 2024
529e61a
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 19, 2024
e4cb21c
Merge branch 'main' into refactor-pattern-logic
ElenaKhaustova Aug 21, 2024
50bc816
Moved pattern logic out of data catalog, implemented KedroDataCatalog
ElenaKhaustova Aug 21, 2024
6dfbcb0
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 22, 2024
9346f08
KedroDataCatalog updates
ElenaKhaustova Aug 22, 2024
9568e29
Added property to return config
ElenaKhaustova Aug 28, 2024
86efdfe
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 28, 2024
5e27660
Added list patterns method
ElenaKhaustova Aug 28, 2024
72b11d0
Renamed and moved ConfigResolver
ElenaKhaustova Aug 29, 2024
f0a4090
Renamed ConfigResolver
ElenaKhaustova Aug 29, 2024
a4da52a
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Aug 29, 2024
7d6227f
Cleaned KedroDataCatalog
ElenaKhaustova Aug 29, 2024
4092291
Cleaned up DataCatalogConfigResolver
ElenaKhaustova Aug 29, 2024
63e47f9
Docs build fix attempt
ElenaKhaustova Aug 30, 2024
68f6527
Removed KedroDataCatalog
ElenaKhaustova Sep 5, 2024
2ac4a2f
Updated from_config method
ElenaKhaustova Sep 5, 2024
cb5879d
Updated constructor and add methods
ElenaKhaustova Sep 5, 2024
9038e96
Updated _get_dataset method
ElenaKhaustova Sep 5, 2024
cc89565
Updated __contains__
ElenaKhaustova Sep 5, 2024
59b6764
Updated __eq__ and shallow_copy
ElenaKhaustova Sep 5, 2024
4f5a3fb
Added __iter__ and __getitem__
ElenaKhaustova Sep 5, 2024
12ed6f2
Removed unused imports
ElenaKhaustova Sep 5, 2024
a106cec
Added TODO
ElenaKhaustova Sep 5, 2024
6df04f7
Updated runner.run()
ElenaKhaustova Sep 5, 2024
8566e27
Updated session
ElenaKhaustova Sep 5, 2024
2dcea33
Added confil_resolver property
ElenaKhaustova Sep 5, 2024
a46597f
Updated catalog list command
ElenaKhaustova Sep 5, 2024
3787545
Updated catalog create command
ElenaKhaustova Sep 5, 2024
68d612d
Updated catalog rank command
ElenaKhaustova Sep 5, 2024
af5bee9
Updated catalog resolve command
ElenaKhaustova Sep 5, 2024
acc4d6e
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 5, 2024
e67ff0f
Remove some methods
ElenaKhaustova Sep 5, 2024
7b3afa2
Removed ds configs from catalog
ElenaKhaustova Sep 6, 2024
658a759
Fixed lint
ElenaKhaustova Sep 6, 2024
7be2a8e
Fixed typo
ElenaKhaustova Sep 6, 2024
09f3f26
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 6, 2024
9e43a9a
Added module docstring
ElenaKhaustova Sep 6, 2024
25b6501
Removed None from Pattern type
ElenaKhaustova Sep 6, 2024
3a646de
Fixed docs failing to find class reference
ElenaKhaustova Sep 6, 2024
5e5df4a
Fixed docs failing to find class reference
ElenaKhaustova Sep 6, 2024
aa59a35
Updated Patterns type
ElenaKhaustova Sep 6, 2024
c7efa3e
Fix tests (#4149)
ankatiyar Sep 6, 2024
023ffc6
Returned constants to avoid breaking changes
ElenaKhaustova Sep 6, 2024
585b44f
Minor fix
ElenaKhaustova Sep 6, 2024
e447078
Updated test_sorting_order_with_other_dataset_through_extra_pattern
ElenaKhaustova Sep 9, 2024
beb0165
Merge branch 'main' into 4110-move-pattern-resolution-logic
ElenaKhaustova Sep 9, 2024
975e968
Removed odd properties
ElenaKhaustova Sep 9, 2024
11d782c
Updated tests
ElenaKhaustova Sep 9, 2024
e4abd23
Removed None from _fetch_credentials input
ElenaKhaustova Sep 9, 2024
6433dd8
Renamed DataCatalogConfigResolver to CatalogConfigResolver
ElenaKhaustova Sep 10, 2024
355576f
Renamed _init_configs to _resolve_config_credentials
ElenaKhaustova Sep 10, 2024
39d9ff6
Moved functions to the class
ElenaKhaustova Sep 10, 2024
659c9da
Refactored resolve_dataset_pattern
ElenaKhaustova Sep 10, 2024
840b32a
Fixed refactored part
ElenaKhaustova Sep 10, 2024
77f551c
Changed the order of arguments for DataCatalog constructor
ElenaKhaustova Sep 10, 2024
6e079a1
Replaced __getitem__ with .get()
ElenaKhaustova Sep 10, 2024
1f7e5f8
Updated catalog commands
ElenaKhaustova Sep 10, 2024
80f0e3d
Moved warm up block outside of the try block
ElenaKhaustova Sep 10, 2024
017cda3
Fixed linter
ElenaKhaustova Sep 10, 2024
cab6f06
Removed odd copying
ElenaKhaustova Sep 10, 2024
8f604d1
Updated release notes
ElenaKhaustova Sep 11, 2024
9a4db18
Returned DatasetError
ElenaKhaustova Sep 11, 2024
0a6946a
Added _dataset_patterns and _default_pattern to _config_resolver to a…
ElenaKhaustova Sep 11, 2024
fee7bd6
Made resolve_dataset_pattern return just dict
ElenaKhaustova Sep 11, 2024
f5a7992
Fixed linter
ElenaKhaustova Sep 11, 2024
1c981f3
Added Catalogprotocol draft
ElenaKhaustova Sep 11, 2024
6128be7
Implemented CatalogProtocol
ElenaKhaustova Sep 12, 2024
8c91d0e
Updated types
ElenaKhaustova Sep 12, 2024
18d2ba0
Fixed linter
ElenaKhaustova Sep 12, 2024
d48c6d3
Added _ImplementsCatalogProtocolValidator
ElenaKhaustova Sep 12, 2024
45ce6bc
Updated docstrings
ElenaKhaustova Sep 12, 2024
6ca972f
Fixed tests
ElenaKhaustova Sep 12, 2024
fdce5ea
Fixed docs
ElenaKhaustova Sep 12, 2024
3029963
Excluded Potocol from coverage
ElenaKhaustova Sep 12, 2024
0150a21
Merge branch 'main' into 4138-catalog-protocol
ElenaKhaustova Sep 12, 2024
0833a84
Fixed docs
ElenaKhaustova Sep 12, 2024
0ec1f23
Removed reference to DataCatalog in docstrings
ElenaKhaustova Sep 13, 2024
741b682
Removed add_all from protocol
ElenaKhaustova Sep 13, 2024
78feb51
Updated docstrings
ElenaKhaustova Sep 13, 2024
6bf912c
Updated docstrings
ElenaKhaustova Sep 13, 2024
c9c7c9a
Fixed docstrings
ElenaKhaustova Sep 16, 2024
c66df33
Merge branch 'main' into 4138-catalog-protocol
ElenaKhaustova Sep 16, 2024
4d92c33
Updated RELEASE.md
ElenaKhaustova Sep 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Upcoming Release

## Major features and improvements
* Implemented `Protocol` abstraction for the current `DataCatalog` and adding new catalog implementations.
* Refactored `kedro run` and `kedro catalog` commands.
* Moved pattern resolution logic from `DataCatalog` to a separate component - `CatalogConfigResolver`. Updated `DataCatalog` to use `CatalogConfigResolver` internally.
* Made packaged Kedro projects return `session.run()` output to be used when running it in the interactive environment.
Expand Down
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@
"kedro.io.catalog_config_resolver.CatalogConfigResolver",
"kedro.io.core.AbstractDataset",
"kedro.io.core.AbstractVersionedDataset",
"kedro.io.core.CatalogProtocol",
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
Expand Down Expand Up @@ -170,6 +171,7 @@
"None. Update D from mapping/iterable E and F.",
"Patterns",
"CatalogConfigResolver",
"CatalogProtocol",
),
"py:data": (
"typing.Any",
Expand Down
20 changes: 10 additions & 10 deletions kedro/framework/context/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

from kedro.config import AbstractConfigLoader, MissingConfigException
from kedro.framework.project import settings
from kedro.io import DataCatalog # noqa: TCH001
from kedro.io import CatalogProtocol, DataCatalog # noqa: TCH001
from kedro.pipeline.transcoding import _transcode_split

if TYPE_CHECKING:
Expand Down Expand Up @@ -123,7 +123,7 @@ def _convert_paths_to_absolute_posix(
return conf_dictionary


def _validate_transcoded_datasets(catalog: DataCatalog) -> None:
def _validate_transcoded_datasets(catalog: CatalogProtocol) -> None:
"""Validates transcoded datasets are correctly named
Args:
Expand Down Expand Up @@ -178,13 +178,13 @@ class KedroContext:
)

@property
def catalog(self) -> DataCatalog:
"""Read-only property referring to Kedro's ``DataCatalog`` for this context.
def catalog(self) -> CatalogProtocol:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this how type hint supposed to be done with Protocol? I roughly understand how Protocol works like traits/interface but I haven't seen much in a real codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, cause this is type in the end, see examples here: https://peps.python.org/pep-0544/#protocol-members

"""Read-only property referring to Kedro's catalog` for this context.
Returns:
DataCatalog defined in `catalog.yml`.
catalog defined in `catalog.yml`.
Raises:
KedroContextError: Incorrect ``DataCatalog`` registered for the project.
KedroContextError: Incorrect catalog registered for the project.
"""
return self._get_catalog()
Expand Down Expand Up @@ -213,13 +213,13 @@ def _get_catalog(
self,
save_version: str | None = None,
load_versions: dict[str, str] | None = None,
) -> DataCatalog:
"""A hook for changing the creation of a DataCatalog instance.
) -> CatalogProtocol:
"""A hook for changing the creation of a catalog instance.
Returns:
DataCatalog defined in `catalog.yml`.
catalog defined in `catalog.yml`.
Raises:
KedroContextError: Incorrect ``DataCatalog`` registered for the project.
KedroContextError: Incorrect catalog registered for the project.
"""
# '**/catalog*' reads modular pipeline configs
Expand Down
28 changes: 14 additions & 14 deletions kedro/framework/hooks/specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

if TYPE_CHECKING:
from kedro.framework.context import KedroContext
from kedro.io import DataCatalog
from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline
from kedro.pipeline.node import Node

Expand All @@ -22,7 +22,7 @@ class DataCatalogSpecs:
@hook_spec
def after_catalog_created( # noqa: PLR0913
self,
catalog: DataCatalog,
catalog: CatalogProtocol,
conf_catalog: dict[str, Any],
conf_creds: dict[str, Any],
feed_dict: dict[str, Any],
Expand Down Expand Up @@ -53,7 +53,7 @@ class NodeSpecs:
def before_node_run(
self,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
is_async: bool,
session_id: str,
Expand All @@ -63,7 +63,7 @@ def before_node_run(

Args:
node: The ``Node`` to run.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -81,7 +81,7 @@ def before_node_run(
def after_node_run( # noqa: PLR0913
self,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
outputs: dict[str, Any],
is_async: bool,
Expand All @@ -93,7 +93,7 @@ def after_node_run( # noqa: PLR0913

Args:
node: The ``Node`` that ran.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -110,7 +110,7 @@ def on_node_error( # noqa: PLR0913
self,
error: Exception,
node: Node,
catalog: DataCatalog,
catalog: CatalogProtocol,
inputs: dict[str, Any],
is_async: bool,
session_id: str,
Expand All @@ -122,7 +122,7 @@ def on_node_error( # noqa: PLR0913
Args:
error: The uncaught exception thrown during the node run.
node: The ``Node`` to run.
catalog: A ``DataCatalog`` containing the node's inputs and outputs.
catalog: An implemented instance of ``CatalogProtocol`` containing the node's inputs and outputs.
inputs: The dictionary of inputs dataset.
The keys are dataset names and the values are the actual loaded input data,
not the dataset instance.
Expand All @@ -137,7 +137,7 @@ class PipelineSpecs:

@hook_spec
def before_pipeline_run(
self, run_params: dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
self, run_params: dict[str, Any], pipeline: Pipeline, catalog: CatalogProtocol
) -> None:
"""Hook to be invoked before a pipeline runs.

Expand All @@ -164,7 +164,7 @@ def before_pipeline_run(
}

pipeline: The ``Pipeline`` that will be run.
catalog: The ``DataCatalog`` to be used during the run.
catalog: An implemented instance of ``CatalogProtocol`` to be used during the run.
"""
pass

Expand All @@ -174,7 +174,7 @@ def after_pipeline_run(
run_params: dict[str, Any],
run_result: dict[str, Any],
pipeline: Pipeline,
catalog: DataCatalog,
catalog: CatalogProtocol,
) -> None:
"""Hook to be invoked after a pipeline runs.

Expand Down Expand Up @@ -202,7 +202,7 @@ def after_pipeline_run(

run_result: The output of ``Pipeline`` run.
pipeline: The ``Pipeline`` that was run.
catalog: The ``DataCatalog`` used during the run.
catalog: An implemented instance of ``CatalogProtocol`` used during the run.
"""
pass

Expand All @@ -212,7 +212,7 @@ def on_pipeline_error(
error: Exception,
run_params: dict[str, Any],
pipeline: Pipeline,
catalog: DataCatalog,
catalog: CatalogProtocol,
) -> None:
"""Hook to be invoked if a pipeline run throws an uncaught Exception.
The signature of this error hook should match the signature of ``before_pipeline_run``
Expand Down Expand Up @@ -242,7 +242,7 @@ def on_pipeline_error(
}

pipeline: The ``Pipeline`` that will was run.
catalog: The ``DataCatalog`` used during the run.
catalog: An implemented instance of ``CatalogProtocol`` used during the run.
"""
pass

Expand Down
25 changes: 23 additions & 2 deletions kedro/framework/project/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from dynaconf import LazySettings
from dynaconf.validator import ValidationError, Validator

from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline, pipeline

if TYPE_CHECKING:
Expand Down Expand Up @@ -59,6 +60,25 @@ def validate(
)


class _ImplementsCatalogProtocolValidator(Validator):
"""A validator to check if the supplied setting value is a subclass of the default class"""

def validate(
self, settings: dynaconf.base.Settings, *args: Any, **kwargs: Any
) -> None:
super().validate(settings, *args, **kwargs)

protocol = CatalogProtocol
for name in self.names:
setting_value = getattr(settings, name)
if not isinstance(setting_value(), protocol):
raise ValidationError(
f"Invalid value '{setting_value.__module__}.{setting_value.__qualname__}' "
f"received for setting '{name}'. It must implement "
f"'{protocol.__module__}.{protocol.__qualname__}'."
)


class _HasSharedParentClassValidator(Validator):
"""A validator to check that the parent of the default class is an ancestor of
the settings value."""
Expand Down Expand Up @@ -115,8 +135,9 @@ class _ProjectSettings(LazySettings):
_CONFIG_LOADER_ARGS = Validator(
"CONFIG_LOADER_ARGS", default={"base_env": "base", "default_run_env": "local"}
)
_DATA_CATALOG_CLASS = _IsSubclassValidator(
"DATA_CATALOG_CLASS", default=_get_default_class("kedro.io.DataCatalog")
_DATA_CATALOG_CLASS = _ImplementsCatalogProtocolValidator(
"DATA_CATALOG_CLASS",
default=_get_default_class("kedro.io.DataCatalog"),
)

def __init__(self, *args: Any, **kwargs: Any):
Expand Down
2 changes: 2 additions & 0 deletions kedro/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .core import (
AbstractDataset,
AbstractVersionedDataset,
CatalogProtocol,
DatasetAlreadyExistsError,
DatasetError,
DatasetNotFoundError,
Expand All @@ -23,6 +24,7 @@
"AbstractDataset",
"AbstractVersionedDataset",
"CachedDataset",
"CatalogProtocol",
"DataCatalog",
"CatalogConfigResolver",
"DatasetAlreadyExistsError",
Expand Down
79 changes: 78 additions & 1 deletion kedro/io/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@
from glob import iglob
from operator import attrgetter
from pathlib import Path, PurePath, PurePosixPath
from typing import TYPE_CHECKING, Any, Callable, Generic, TypeVar
from typing import (
TYPE_CHECKING,
Any,
Callable,
Generic,
Protocol,
TypeVar,
runtime_checkable,
)
from urllib.parse import urlsplit

from cachetools import Cache, cachedmethod
Expand All @@ -29,6 +37,8 @@
if TYPE_CHECKING:
import os

from kedro.io.catalog_config_resolver import CatalogConfigResolver, Patterns

VERSION_FORMAT = "%Y-%m-%dT%H.%M.%S.%fZ"
VERSIONED_FLAG_KEY = "versioned"
VERSION_KEY = "version"
Expand Down Expand Up @@ -871,3 +881,70 @@ def validate_on_forbidden_chars(**kwargs: Any) -> None:
raise DatasetError(
f"Neither white-space nor semicolon are allowed in '{key}'."
)


_C = TypeVar("_C")
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved


@runtime_checkable
class CatalogProtocol(Protocol[_C]):
_datasets: dict[str, AbstractDataset]

def __contains__(self, ds_name: str) -> bool:
"""Check if a dataset is in the catalog."""
...

@property
def config_resolver(self) -> CatalogConfigResolver:
"""Return a copy of the datasets dictionary."""
...

@classmethod
def from_config(cls, catalog: dict[str, dict[str, Any]] | None) -> _C:
"""Create a catalog instance from configuration."""
...

def _get_dataset(
ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved
self,
dataset_name: str,
version: Any = None,
suggest: bool = True,
) -> AbstractDataset:
"""Retrieve a dataset by its name."""
...

def list(self, regex_search: str | None = None) -> list[str]:
"""List all dataset names registered in the catalog."""
...

def save(self, name: str, data: Any) -> None:
"""Save data to a registered dataset."""
...

def load(self, name: str, version: str | None = None) -> Any:
"""Load data from a registered dataset."""
...

def add(self, ds_name: str, dataset: Any, replace: bool = False) -> None:
"""Add a new dataset to the catalog."""
...

def add_feed_dict(self, datasets: dict[str, Any], replace: bool = False) -> None:
"""Add datasets to the catalog using the data provided through the `feed_dict`."""
...

def exists(self, name: str) -> bool:
"""Checks whether registered data set exists by calling its `exists()` method."""
...

def release(self, name: str) -> None:
"""Release any cached data associated with a dataset."""
...

def confirm(self, name: str) -> None:
"""Confirm a dataset by its name."""
...

def shallow_copy(self, extra_dataset_patterns: Patterns | None = None) -> _C:
"""Returns a shallow copy of the current object."""
...
Loading