-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Update run status and add stack trace to TrainRunInfo
#46875
[Train] Update run status and add stack trace to TrainRunInfo
#46875
Conversation
TrainRunInfo
Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
python/ray/air/_internal/util.py
Outdated
def __init__(self, *args: object, tags: Optional[Dict] = None) -> None: | ||
super().__init__(*args) | ||
self.tags = tags if tags else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit weird to add these tags arbitrarily here...
IMO it might be better to have a subclass that's specific to worker failures that requires the worker_rank to be set. This way it's more explicit how these are propagated.
cc @justinvyu does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean subclass the StartTraceback
?
or we can also do
class StartTraceback(Exception):
def __init__(self, *args: object, worker_rank = None) -> None:
super().__init__(*args)
self.worker_rank = worker_rank
or
class StartTraceback(Exception):
def __init__(self, *args: object, **kwargs) -> None:
super().__init__(*args)
self.worker_rank = kwargs.get("worker_rank", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I mean to subclass StartTraceback
. The reason is that the StartTraceback
is a "common" class used generically across a few of places, so adding worker_rank
directly to it doesn't make sense.
Also we want to make it clear that the newly added code that reads the worker_rank
from the exception is reading specifically from the exception that has this value populated, and not another instance of StartTraceback
that doesn't have this value populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i can make it a subclass...
but this StartTraceback
stuff will not be used in Train V2 right? This subclass will still be a temporary solution.
|
||
if errored: | ||
run_status = RunStatusEnum.ERRORED | ||
status_detail = "Terminated due to an error in the training function." | ||
status_detail = f"Rank {failed_rank} worker raised an error: \n" | ||
status_detail += stack_trace[-MAX_ERROR_STACK_TRACE_LENGTH:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here explaining this logic? (Just make it obvious that it's only showing the end)
# (Deprecated) The train run has started | ||
STARTED = "STARTED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: For deprecated things it's recommended to document the replacement (RUNNING in this case)
python/ray/train/trainer.py
Outdated
# If this is a StartTraceback, then this is a user error. | ||
# We raise it directly | ||
self._backend_executor.report_final_run_status(errored=True) | ||
stack_trace = traceback.format_exc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call skip_exceptions to get rid of the StartTraceback
parts.
Signed-off-by: woshiyyya <1085966850@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
Signed-off-by: woshiyyya <1085966850@qq.com>
Addressed the comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: we should sanitize the error message more in the future, to avoid a lot of the internal stacktrace :)
…project#46875) 3 main changes for TrainRunInfo: - Rename `STARTED` status to `RUNNING` status. - Updated the status detail of `ABORTED` status. - Add the stack trace of the failed worker, together with the failed worker rank. - truncate the stack trace to less than 50,000 chars. - Added a new field for the error: `TrainRunInfo.run_error` Signed-off-by: yunxuanx <yunxuanx@anyscale.com> Signed-off-by: woshiyyya <1085966850@qq.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
3 main changes for TrainRunInfo:
STARTED
status toRUNNING
status.ABORTED
status.TrainRunInfo.run_error
Example
The
TrainRunInfo.status_detail
will be populated asRelated issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.