Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Update run status and add stack trace to TrainRunInfo #46875

Merged
merged 13 commits into from
Sep 11, 2024

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Jul 30, 2024

Why are these changes needed?

3 main changes for TrainRunInfo:

  • Rename STARTED status to RUNNING status.
  • Updated the status detail of ABORTED status.
  • Add the stack trace of the failed worker, together with the failed worker rank.
    • truncate the stack trace to less than 50,000 chars.
    • Added a new field for the error: TrainRunInfo.run_error

Example

def train_func():
    ...
    raise RuntimeError("User Application Error")

trainer = TorchTrainer(train_func, ...)
trainer.fit()

The TrainRunInfo.status_detail will be populated as

Rank 0 worker raised an error.
Traceback (most recent call last):
  File "/home/ubuntu/ray/python/ray/train/_internal/utils.py", line 57, in check_for_failure
    ray.get(object_ref)
  File "/home/ubuntu/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/ray/python/ray/_private/worker.py", line 2661, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ubuntu/ray/python/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=1437865, ip=172.31.7.221, actor_id=fa0962e20ddee2bdbf81c5e801000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f4aa2204070>)
  File "/home/ubuntu/ray/python/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ubuntu/ray/python/ray/train/_internal/utils.py", line 176, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ubuntu/ray/python/ray/train/tests/test_state.py", line 316, in train_func
    raise RuntimeError(error_message)
RuntimeError: User Application Error

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
@woshiyyya woshiyyya changed the title [Train] Update Train Run Status [Train] Update run status and add stack trace to TrainRunInfo Jul 30, 2024
Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
woshiyyya and others added 3 commits August 26, 2024 07:57
Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
@woshiyyya woshiyyya added the go add ONLY when ready to merge, run all tests label Aug 28, 2024
Comment on lines 35 to 37
def __init__(self, *args: object, tags: Optional[Dict] = None) -> None:
super().__init__(*args)
self.tags = tags if tags else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to add these tags arbitrarily here...

IMO it might be better to have a subclass that's specific to worker failures that requires the worker_rank to be set. This way it's more explicit how these are propagated.

cc @justinvyu does this make sense?

Copy link
Member Author

@woshiyyya woshiyyya Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean subclass the StartTraceback?

or we can also do

class StartTraceback(Exception):
  def __init__(self, *args: object, worker_rank = None) -> None:
          super().__init__(*args)
          self.worker_rank = worker_rank

or

class StartTraceback(Exception):
  def __init__(self, *args: object, **kwargs) -> None:
          super().__init__(*args)
          self.worker_rank = kwargs.get("worker_rank", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mean to subclass StartTraceback. The reason is that the StartTraceback is a "common" class used generically across a few of places, so adding worker_rank directly to it doesn't make sense.

Also we want to make it clear that the newly added code that reads the worker_rank from the exception is reading specifically from the exception that has this value populated, and not another instance of StartTraceback that doesn't have this value populated.

Copy link
Member Author

@woshiyyya woshiyyya Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i can make it a subclass...

but this StartTraceback stuff will not be used in Train V2 right? This subclass will still be a temporary solution.


if errored:
run_status = RunStatusEnum.ERRORED
status_detail = "Terminated due to an error in the training function."
status_detail = f"Rank {failed_rank} worker raised an error: \n"
status_detail += stack_trace[-MAX_ERROR_STACK_TRACE_LENGTH:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here explaining this logic? (Just make it obvious that it's only showing the end)

Comment on lines 15 to 16
# (Deprecated) The train run has started
STARTED = "STARTED"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For deprecated things it's recommended to document the replacement (RUNNING in this case)

# If this is a StartTraceback, then this is a user error.
# We raise it directly
self._backend_executor.report_final_run_status(errored=True)
stack_trace = traceback.format_exc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call skip_exceptions to get rid of the StartTraceback parts.

Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

python/ray/air/_internal/util.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/trainer.py Show resolved Hide resolved
Signed-off-by: woshiyyya <1085966850@qq.com>
@woshiyyya
Copy link
Member Author

Addressed the comments

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: we should sanitize the error message more in the future, to avoid a lot of the internal stacktrace :)

@matthewdeng matthewdeng merged commit 29a2a91 into ray-project:master Sep 11, 2024
5 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#46875)

3 main changes for TrainRunInfo:
- Rename `STARTED` status to `RUNNING` status.
- Updated the status detail of `ABORTED` status.
- Add the stack trace of the failed worker, together with the failed
worker rank.
  - truncate the stack trace to less than 50,000 chars.
  - Added a new field for the error: `TrainRunInfo.run_error`

Signed-off-by: yunxuanx <yunxuanx@anyscale.com>
Signed-off-by: woshiyyya <1085966850@qq.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants