Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve add_feed_dict docstrings #4009

Merged
merged 8 commits into from
Jul 17, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 27 additions & 12 deletions kedro/io/data_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -681,29 +681,41 @@ def add_all(
self.add(name, dataset, replace)

def add_feed_dict(self, feed_dict: dict[str, Any], replace: bool = False) -> None:
"""Adds instances of ``MemoryDataset``, containing the data provided
through feed_dict.
"""This function adds datasets to the ``DataCatalog`` using the data provided through the `feed_dict`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a style guide for docstrings yet? In any case, following Pydocstyle D404 and D401,

Suggested change
"""This function adds datasets to the ``DataCatalog`` using the data provided through the `feed_dict`.
"""Add datasets to the ``DataCatalog`` using the data provided through the `feed_dict`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add additional ?

Suggested change
"""This function adds datasets to the ``DataCatalog`` using the data provided through the `feed_dict`.
"""Add additional datasets to the ``DataCatalog`` using the data provided through the `feed_dict`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully understand why we need additional here, as we can add them to the empty catalog as well. So, it looks like we do not need to specify any adding conditions.

ElenaKhaustova marked this conversation as resolved.
Show resolved Hide resolved

`feed_dict` dictionary key is used as a dataset name, and a value is used to create an instance of
``MemoryDataset`` before adding to the ``DataCatalog`` for all the value types except of ``AbstractDataset``.
In the last case, the ``AbstractDataset`` is added as it is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

Suggested change
`feed_dict` dictionary key is used as a dataset name, and a value is used to create an instance of
``MemoryDataset`` before adding to the ``DataCatalog`` for all the value types except of ``AbstractDataset``.
In the last case, the ``AbstractDataset`` is added as it is.
`feed_dict` keys are used as dataset names, and values can either be raw data or instances of ``AbstractDataset``.
In the former case, instances of ``MemoryDataset`` are automatically created before adding to the ``DataCatalog``.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't think mentioning AbstractDataset is useful here, we shouldn't really have instances of an interface right? Does the user gain anything for understanding this implementation detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I rephrased it, but I still think it's useful to distinguish between the two cases when adding raw data and a dataset.


Has `from_dict` alias, so it can be called using `from_dict` name instead.
Will be fully renamed to `from_dict` in the `0.20.0` version.

Args:
feed_dict: A feed dict with data to be added in memory.
replace: Specifies whether to replace an existing dataset
with the same name is allowed.
feed_dict: A feed dict with data to be added to the ``DataCatalog``.
replace: Specifies whether to replace an existing dataset with the same name in the ``DataCatalog``.

Example:
::

>>> from kedro_datasets.pandas import CSVDataset
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'col1': [1, 2],
>>> 'col2': [4, 5],
>>> 'col3': [5, 6]})
>>> df = pd.DataFrame({"col1": [1, 2],
>>> "col2": [4, 5],
>>> "col3": [5, 6]})
>>>
>>> io = DataCatalog()
>>> io.add_feed_dict({
>>> 'data': df
>>> catalog = DataCatalog()
>>> catalog.add_feed_dict({
>>> "data_df": df
>>> }, replace=True)
>>>
>>> assert io.load("data").equals(df)
>>> assert catalog.load("data_df").equals(df)
>>>
>>> csv_dataset = CSVDataset(filepath="test.csv")
>>> csv_dataset.save(df)
>>> catalog.add_feed_dict({"data_csv_dataset": csv_dataset})
>>>
>>> assert catalog.load("data_csv_dataset").equals(df)
"""
for dataset_name in feed_dict:
if isinstance(feed_dict[dataset_name], AbstractDataset):
Expand All @@ -713,6 +725,9 @@ def add_feed_dict(self, feed_dict: dict[str, Any], replace: bool = False) -> Non

self.add(dataset_name, dataset, replace)

# Alias for add_feed_dict method
from_dict = add_feed_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic nitpick, but when I read from_dict I assume it's a classmethod, like

class DataCatalog:
  @classmethod
  def from_dict(cls, data_dict: dict[str, AbstractDataset]):
    ...
    return cls

So not sure if I'd add this alias.

(I know this comes from #3612 (comment) and that I asserted, but my response there is incomplete)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite a significant change. I think we should discuss first what a good alternative is, before just adding an alias.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has always been an undiscovered part of the API, I agree with @merelcht that we should think really hard / research what a better name should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed an alias until we are sure about adding it and agree on its name.

To @astrojuanlu's point, if we decide to keep an alias, we can use the add_from_dict name instead.

Or, as an alternative, we can add a new implementation (add_from_dict / add_datasets) and use it inside add_feed_dict with a deprecation warning.


def list(self, regex_search: str | None = None) -> list[str]:
"""
List of all dataset names registered in the catalog.
Expand Down