Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node as an argument to all datasets' hook #2296

Merged
merged 8 commits into from
Feb 13, 2023

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Feb 7, 2023

Signed-off-by: Nok Chan nok.lam.chan@quantumblack.com

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

See #2271 for more details. To support wider use cases with dataset hook.

  • all four dataset hook specifications should be updated to include node: Node as an argument
  • all hook calls (in both _run_node_sequential and _run_node_async) need to be modified to include node
  • update tests

Development notes

  • Update all 4 hooks' spec to include node argument.
  • Update the hook call in runner.py to include node argument
  • Update the tests, and move some fixtures around to make sure it's serialisable.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam requested a review from idanov as a code owner February 7, 2023 06:52
noklam and others added 2 commits February 7, 2023 17:54
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
@noklam noklam linked an issue Feb 8, 2023 that may be closed by this pull request
Copy link
Contributor

@jmholzer jmholzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a clean implementation.

Would you like me to manually test these changes?

@@ -523,21 +523,20 @@ def test_broken_input_update_parallel(
mock_session_with_broken_before_node_run_hooks.run(runner=ParallelRunner())


def wait_and_identity(*args: Any):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nicer implementation of this fixture ✅.

Out of curiosity, could you explain what you meant in the commit message 'move the inner function out to make the node serializable'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmholzer Sure. The original fixture defined an inner function, such object is not picklable and does not work well with multiprocessing. This wasn't an issue until node is added as an argument for hooks. You can simply move the function inside and run the test.

IIRC it's the TestAsyncDataSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are more than welcome to test it manually 😁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmholzer Sure. The original fixture defined an inner function, such object is not picklable and does not work well with multiprocessing. This wasn't an issue until node is added as an argument for hooks. You can simply move the function inside and run the test.

IIRC it's the TestAsyncDataSet.

TIL, thanks 😃

You are more than welcome to test it manually 😁

I tried it out, looks good ✅.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

Only left some small comments on doc strings. This should also be included in the release notes, and perhaps also update this page for consistency: https://kedro.readthedocs.io/en/stable/hooks/common_use_cases.html#use-hooks-to-customise-the-dataset-load-and-save-methods

"""Hook to be invoked before a dataset is loaded from the catalog.

Args:
dataset_name: name of the dataset to be loaded from the catalog.
node: node: The ``Node`` to ran.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node: node: The ``Node`` to ran.
node: The ``Node`` to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some more changes to the documentation, the previous one is buggy and won't even print the log. I also considered updating the example for MemoryProfilingHooks, but decided not as that example is purely about dataset profiling thus the node argument isn't useful.

Couple changes

  • logging.getLogger(self._class.__name__) -> logging.getLogger(__name__) - this makes sure the log will be printed properly with the default logging.yml
  • Instead of printing the absolute timestamp, the hook print how many time it takes to load a dataset instead.

kedro/framework/hooks/specs.py Outdated Show resolved Hide resolved
noklam and others added 2 commits February 13, 2023 14:04
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 😄

@noklam noklam added this pull request to the merge queue Feb 13, 2023
Merged via the queue into main with commit fa8c56f Feb 13, 2023
@noklam noklam deleted the feat/add-new-argument-dataset-hook branch February 13, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add node to dataset hooks
3 participants