Improve `_FrozenDatasets` class #3610

merelcht · 2024-02-09T15:40:54Z

Description

The _FrozenDatasets class is the class that returns catalog.datasets, which is supposed to be the official (and public) way for users to access datasets from the catalog through the API. However, it's been flagged by several team members that this class isn't very easy to work with:

It has no simple interface for the datasets it contains. Currently, the only linter-friendly ways are to use vars(catalog.datasets)[dataset_name] or catalog.datasets.__dict__[dataset_name].
The class is poorly documented; it would be good to have docstrings as the purpose of this class is not easy to grok.
There is a lot going on inside __init__, delegating most of this to a few new, well-documented methods would also make this class much easier to understand.

On top of that we have evidence that users frequently resort to using private methods and attributes to access datasets e.g. catalog._datasets and trough _get_dataset(). So it also seems like the class isn't sufficiently allowing users to access datasets through the API. See e.g. https://github.com/Galileo-Galilei/kedro-mlflow/blob/e88679938b1d4c7633c3f631f6b402ff11ab61fe/kedro_mlflow/framework/hooks/mlflow_hook.py#L148

Observations about `_FrozenDatasets`

Not iterable, so can only access when you know the dataset name.
Dataset names are modified to so that all non-letter characters are converted to __: https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L86-L95
Main problem with _FrozenDatasets seems to be access, because of the name conversions.

Context

The reason why it's not straightforward to fetch datasets from the catalog directly, is because the catalog was designed to hide the dataset details and implementation. It's meant for loading and saving the data, but not modify in any way. The _FrozenDatasets class was added to make it possible to have tab completion for catalog datasets in ipython or jupyter sessions. The PR that added this functionality is on private-kedro: https://github.com/quantumblacklabs/private-kedro/pull/84/files. It's important to note that the _FrozenDatasets needs to be immutable, if users want to inject data they should use hooks.

Improvement suggestions

Make _FrozenDatasets inherit from UserDict

Important

The above suggestions are based on ideas from several Kedro engineers see e.g. #1778. However, they are mostly solutions to improve developer experience, but we need a clear view on what user needs are as well. Any implementation should be preceded by user research: #1978

The text was updated successfully, but these errors were encountered:

noklam · 2024-02-09T16:42:23Z

It's already mentioned that FrozenDataset is immutable because it is a public interface. I also want to mention that catalog.datasets.xxx is the easiest way to get auto-completion works on a IDE, which is a lot easy to type catalog.load("namespace.x.y.z.a.b.c") especially on larger pipeline. And because . can be used in catalog, and it conflicts with what catalog.datasets.namespace.dataset mean, so we replace . with __.

See previous PRs:

Gracefully handle non-ASCII chars in dataset names 🌶️ #487

merelcht added this to Kedro Framework Feb 9, 2024

merelcht converted this from a draft issue Feb 9, 2024

merelcht added this to the Redesign the API for IO (catalog) milestone Feb 9, 2024

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3671

Open

kedro-org locked and limited conversation to collaborators Mar 28, 2024

merelcht converted this issue into discussion #3752 Mar 28, 2024

github-project-automation bot moved this to Done in Kedro Framework Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Improve `_FrozenDatasets` class #3610

Improve `_FrozenDatasets` class #3610

merelcht commented Feb 9, 2024

noklam commented Feb 9, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Improve _FrozenDatasets class #3610

Improve _FrozenDatasets class #3610

Comments

merelcht commented Feb 9, 2024

Description

Observations about _FrozenDatasets

Context

Improvement suggestions

noklam commented Feb 9, 2024 • edited Loading

This issue was moved to a discussion.

Improve `_FrozenDatasets` class #3610

Improve `_FrozenDatasets` class #3610

Observations about `_FrozenDatasets`

noklam commented Feb 9, 2024 •

edited

Loading