Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: KedroDataCatalog #4151

Merged
merged 176 commits into from
Sep 24, 2024
Merged

[DataCatalog2.0]: KedroDataCatalog #4151

merged 176 commits into from
Sep 24, 2024

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Sep 9, 2024

Description

In this PR we add a new catalog KedroDataCatalog, which uses the DataCatalogConfigResolver and addresses:

Please see the suggested order of work in #3995 (comment) and comment below: #4151 (comment)

This PR is done on top of #4160 and relies on CatalogProtocol.

For the reviewers: this PR does not include unit-tests for KedroDataCatalog, they'll be added after the initial feedback.

Development notes

  • We kept some old DataCatalog API to avoid multiple if/else branches depending on catalog type used in context, runner and session
  • Removed _FrozenDatasets and access datasets as properties
  • Added get dataset by name feature: dedicated function and access by key
  • Added iterate over the datasets feature
  • add_feed_dict() simplified and renamed to add_raw_data()
  • Datasets' init was moved out from from_config() method

To test KedroDataCatalog modify your settings.py and run commands as usual:

# settings.py

from kedro.io import KedroDataCatalog
DATA_CATALOG_CLASS = KedroDataCatalog

kedro run
kedro catalog list/rank/resolve/create

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
@ElenaKhaustova
Copy link
Contributor Author

Update for the reviewers

Since there is a proposal on the new interface (#4175), we removed some methods (__getitem__, __setitem__, __iter__) from the current implementation since they will most probably change during the implementation of the proposal.

If the proposal won't go through, they'll be added in a separate PR, so we do not block the current one.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first review and added (mostly nit) comments. I'll do another review and have a proper look at the tests tomorrow.

tests/io/conftest.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
from kedro.utils import _format_rich, _has_rich_handler


class KedroDataCatalog:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it we should do it in here, since it will make sure we adhere to it. While it's optional, it's beneficial at no cost for us. Other implementers do not need to extend it though.

docs/source/conf.py Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
kedro/io/kedro_data_catalog.py Show resolved Hide resolved
DatasetNotFoundError: When a dataset with the given name
is not in the collection and do not match patterns.
"""
ds_config = self._config_resolver.resolve_dataset_pattern(ds_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we resolve only if ds_name not in self._datasets? Or it's just to make the code a bit simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the point was to prevent you from complaining about nested ifs 😅 I moved it inside the condition now.

Copy link
Member

@idanov idanov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was just a question, either is fine :)


dataset = self._datasets.get(ds_name, None)

if dataset is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rearrange this in such a way that we fail first and then continue with the successful path? Currently the flow is as follows:

  • resolve the dataset pattern
  • if not part of the materialised datasets, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor existing), go with error scenario
  • otherwise continue with non-error scenario

I think we can make the flow a bit less zig-zagy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can only fail after we try to resolve. Otherwise, you get one more layer of if as the logic needs to go inside the if fail [] else [] scenario.

Now the logic is like this:

  • if not part of the materialised datasets, resolve the dataset pattern
  • if resolved, add from config
  • get the dataset
  • if the dataset does not exist (basically if it cannot be resolved nor exists), go with the error scenario
  • otherwise, continue with a non-error scenario

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that we should have as a first line something like:

if ds_name not in self._datasets and self._config_resolver.match_pattern():
    ...

and then continue with the error scenario, and then go on with everything else. It'll be much easier to follow this way. Btw while checking if this is possible, I saw a problem in the resolver - it can fail even if it matches, but that should not happen.

elif isinstance(config, str) and "}" in config:
try:
config = config.format_map(resolved_vars.named)
except KeyError as exc:
raise DatasetError(
f"Unable to resolve '{config}' from the pattern '{pattern}'. Keys used in the configuration "
f"should be present in the dataset factory pattern."
) from exc

This ☝️ should have been checked at the config_resolver init time, basically we should not allow to create a config_resolver with unresolvable configs or add invalid configs that cannot be resolved.

Also there's other changes like e.g. resolve_dataset_pattern should be just resolve_pattern similar to all other public methods there, which never include the word dataset (rightfully). Hopefully the resolver API is not released yet, so we can change it now.

Copy link
Contributor Author

@ElenaKhaustova ElenaKhaustova Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Checking for a pattern match is not enough, as the config could already be resolved at the init time. resolve_pattern method encapsulates this, not to expose this logic outside the config_resolver. So we ask the config_resolver to provide a config for a pattern without bothering about how it's happening inside. I don't think we need to move any resolution logic (including any specific checks) to the catalog level.
  2. The suggestion about configs resolution makes sense to me. We can move this validation to the init time and simplify the resolution method. But would do that in a separate PR as it doesn't touch the catalog and will be done at the level of the config resolver.
  3. Method was renamed.

Copy link
Member

@idanov idanov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?

In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final small comments, but otherwise I'm very happy with how this looks! Great work @ElenaKhaustova ⭐ 🌟

kedro/io/kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
tests/io/test_kedro_data_catalog.py Outdated Show resolved Hide resolved
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
This reverts commit 5208321.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

dataset = self._datasets.get(ds_name, None)

if dataset is None:
Copy link
Member

@idanov idanov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why wouldn't it be enough? We are checking only for failing scenarios, what other failing scenarios would there be apart from not matching a concrete dataset or a pattern? Could we expect other failures of resolution?

In any case, this is a minor thing, let's merge it in as it is and then we can always come back and simplify it.

@ElenaKhaustova ElenaKhaustova merged commit 53280bd into main Sep 24, 2024
41 checks passed
@ElenaKhaustova ElenaKhaustova deleted the 3995-data-catalog-2.0 branch September 24, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants