-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR/Train] Add Trainer.restore
API for train experiment-level fault tolerance
#31920
Merged
Merged
Changes from all commits
Commits
Show all changes
94 commits
Select commit
Hold shift + click to select a range
36b607c
Add API skeleton for trainer restore
justinvyu 864f2b1
Fix some errors (imports, wrong return type)
justinvyu 2af61de
Save trainer.pkl
justinvyu fd93dae
Add object ref check utility
justinvyu bbbf5fa
Add `test_trainer_restore`
justinvyu 0e10972
Add preprocessor loading in restore+no new preprocessor case (and als…
justinvyu ded1d27
Fix typo
justinvyu d12e155
Change test to check exception type
justinvyu 0a4bec3
Fix training failed error capture
justinvyu a9f10bb
Fix should fit preprocessor logic for new train run
justinvyu 149863b
Improve check object refs (search for actor handles too)
justinvyu 4ace2c9
Add more unit tests (gbdt, obj ref in train loop/config, obj ref in p…
justinvyu 73d1fde
Remove unused imports
justinvyu 847978e
Fix HF test function if no eval dataset is passed
justinvyu e481610
Add trainer w/ init tests + fix gbdt tests
justinvyu cfb3859
Fix lightgbm test
justinvyu 4f3caf1
Add to bazel build file
justinvyu d180135
Disable mosaic trainer restore functionality
justinvyu d5ab478
Add config validation
justinvyu 41b2e90
Fix case where datasets = None
justinvyu 5de0b9a
Add test restoring from a different trainer class
justinvyu 56c80e4
Fix gbdt test assertion
justinvyu 8c59952
Add new args to dummy trainer used in ingest tests
justinvyu 202ea18
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 1cb225b
Change to specifying optional restore fields
justinvyu 8638d75
Fix for HF special case
justinvyu 484a873
Clean up mosaic restore args
justinvyu b98ae79
Add error message for invalid restore kwargs
justinvyu 61af0ad
Fix 'should fit preprocessor' logic
justinvyu 396fa43
Fix missing preprocessor import
justinvyu 4bbd79c
Revert "Add to bazel build file"
justinvyu 7721ac8
Add back to build file without formatting
justinvyu a486c4e
Revert "Fix for HF special case"
justinvyu 30393d2
Revert "Change to specifying optional restore fields"
justinvyu 2ce1c31
Remove validation logic
justinvyu ecbcd1a
Fix unit tests
justinvyu 5c03050
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 74d99cc
Simplify skipping preprocessor fit logic
justinvyu 3ac12e2
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 964ebe5
Add BaseTrainer.can_restore
justinvyu f55f5cd
Always save the trainer pkl, even on restore
justinvyu d885f15
Fill in restore error messages
justinvyu b070b01
Add tests for can restore utility and restoring from invalid dir
justinvyu e2ff5a9
Add docstrings for tests
justinvyu 9adcd38
New way of doing restore where param_dict actually gets updated
justinvyu 1f9db34
Update tests to actually catch param dict not being updated
justinvyu b3d8d0e
Fix loading logic for restored + re-specified preprocessor
justinvyu 0433128
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu ea122ad
Add restore docstrings
justinvyu 8abe4a7
Revert "Fix training failed error capture"
justinvyu 60cff3f
Fix tests after reverting error handling change
justinvyu 62aed3e
Remove duplicate api ref
justinvyu 3945c70
Fix preprocessor not found error
justinvyu f7c341a
Fix test failures
justinvyu 72cf533
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 28913f3
Update to raise value errors instead of asserts
justinvyu 5080cb3
Update test to expect value errors
justinvyu 543b2a8
Fix lint
justinvyu 8fabbb7
Expand user path in can restore utility
justinvyu c5eeb53
Explicit ray cluster shutdown in tests
justinvyu 1d1e9ba
Fix typo (fit_status -> fit_status())
justinvyu ca2a476
Add a comment about BaseTrainer._save
justinvyu c2e24b0
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 5a75eaa
Apply suggestions from code review
justinvyu 63a7dc3
Add rl trainer restore test
justinvyu 1ed3fc2
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 9c0d8a8
Update trainable param name in tuner restore
justinvyu 43153db
Merge branch 'train/restore' of https://github.com/justinvyu/ray into…
justinvyu 3f17166
Make can restore consistent with the other pr
justinvyu 1f8132e
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 334a69a
Add api stability for session
justinvyu 9f6e04c
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 0529739
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 410a34f
Re-specify param space in trainer restore + reenable test
justinvyu d1c2e11
Improve comments
justinvyu 4196ce3
Add trainer restore to API ref
justinvyu 4e1f532
Add FAQ post
justinvyu 1a415bc
Convert BaseTrainer example to be framework agnostic
justinvyu d865435
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu e1b41e8
Improve faq + make example work
justinvyu 859ecf3
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu da94da8
Add typing imports
justinvyu 5c5c009
Remove accidentally included files
justinvyu a5fde62
Remove ipdb (oops)
justinvyu 4b53d64
Fix duplicate HFTrainer.restore ref
justinvyu 726cf81
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu ae7568e
Shouldn't check for subclass
justinvyu 8b4a876
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu a7e4336
Convert error to warning instead + try/catch new trainer instantiation
justinvyu 61bb672
Add public api alpha decorators
justinvyu 5a62a5e
Update unit test to check for warning
justinvyu f47970f
Remove trailing header chars
justinvyu 94058a0
Merge branch 'master' of https://github.com/ray-project/ray into trai…
justinvyu 4589b52
Update docstring example to define trainer subclass inline
justinvyu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a controversial change? A loaded preprocessor that's been fitted already shouldn't fit again. This way, I don't need to pass a
fit_preprocessor=False
flag in.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable. can you make this an actual comment for this logic?