-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node as an argument to all datasets' hook #2296
Conversation
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a clean implementation.
Would you like me to manually test these changes?
@@ -523,21 +523,20 @@ def test_broken_input_update_parallel( | |||
mock_session_with_broken_before_node_run_hooks.run(runner=ParallelRunner()) | |||
|
|||
|
|||
def wait_and_identity(*args: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nicer implementation of this fixture ✅.
Out of curiosity, could you explain what you meant in the commit message 'move the inner function out to make the node serializable'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmholzer Sure. The original fixture defined an inner function, such object is not picklable and does not work well with multiprocessing. This wasn't an issue until node is added as an argument for hooks. You can simply move the function inside and run the test.
IIRC it's the TestAsyncDataSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are more than welcome to test it manually 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmholzer Sure. The original fixture defined an inner function, such object is not picklable and does not work well with multiprocessing. This wasn't an issue until node is added as an argument for hooks. You can simply move the function inside and run the test.
IIRC it's the TestAsyncDataSet.
TIL, thanks 😃
You are more than welcome to test it manually 😁
I tried it out, looks good ✅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Only left some small comments on doc strings. This should also be included in the release notes, and perhaps also update this page for consistency: https://kedro.readthedocs.io/en/stable/hooks/common_use_cases.html#use-hooks-to-customise-the-dataset-load-and-save-methods
kedro/framework/hooks/specs.py
Outdated
"""Hook to be invoked before a dataset is loaded from the catalog. | ||
|
||
Args: | ||
dataset_name: name of the dataset to be loaded from the catalog. | ||
node: node: The ``Node`` to ran. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node: node: The ``Node`` to ran. | |
node: The ``Node`` to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some more changes to the documentation, the previous one is buggy and won't even print the log. I also considered updating the example for MemoryProfilingHooks
, but decided not as that example is purely about dataset profiling thus the node argument isn't useful.
Couple changes
logging.getLogger(self._class.__name__)
->logging.getLogger(__name__)
- this makes sure the log will be printed properly with the defaultlogging.yml
- Instead of printing the absolute timestamp, the hook print how many time it takes to load a dataset instead.
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok_lam_chan@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍 😄
Signed-off-by: Nok Chan nok.lam.chan@quantumblack.com
Description
See #2271 for more details. To support wider use cases with dataset hook.
_run_node_sequential
and_run_node_async
) need to be modified to include nodeDevelopment notes
node
argument.runner.py
to includenode
argumentChecklist
RELEASE.md
file