Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train Log]Ray Train Structured Logging #47806

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

hongpeng-guo
Copy link
Contributor

@hongpeng-guo hongpeng-guo commented Sep 24, 2024

Why are these changes needed?

This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:

  • python/ray/train/_internal/logging.py: this file defines the logging utility functions;
  • python/ray/train/_internal/logging.yaml: this file defines the yaml structure of how logging will be format and handled;
  • python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.

Example

Code snippet:

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory

ray.init()

def train_func_per_worker():
    pass

def train_dummy(num_workers=2):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

if __name__ == "__main__":
    print(f"Log directory: {get_log_directory()}")
    train_dummy()

JSON Logging

RAY_TRAIN_LOG_ENCODING="JSON" python log.py
{"asctime": "2024-09-24 15:04:26,201", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-09-24_15-04-24',\n  trial_dir_name='TorchTrainer_f59af_00000_0_2024-09-24_15-04-24',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154}
{"asctime": "2024-09-24 15:04:26,202", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-09-24_15-04-23_088947_30991/artifacts/2024-09-24_15-04-24/TorchTrainer_2024-09-24_15-04-24/working_dirs/TorchTrainer_f59af_00000_0_2024-09-24_15-04-24", "filename": "session.py", "lineno": 231}
{"asctime": "2024-09-24 15:04:26,214", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202}
{"asctime": "2024-09-24 15:04:26,897", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204}
{"asctime": "2024-09-24 15:04:27,730", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88}
{"asctime": "2024-09-24 15:04:27,730", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83}
{"asctime": "2024-09-24 15:04:27,730", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92}
{"asctime": "2024-09-24 15:04:27,730", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92}
{"asctime": "2024-09-24 15:04:27,742", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=30b789ba1aa103d12cbcdde7af53a84a1d78b6077b0e22fc0dd8e214, ip=127.0.0.1, pid=31052) world_rank=0, local_rank=0, node_rank=0\n- (node_id=30b789ba1aa103d12cbcdde7af53a84a1d78b6077b0e22fc0dd8e214, ip=127.0.0.1, pid=31053) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447}
{"asctime": "2024-09-24 15:04:27,772", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-09-24_15-04-24',\n  trial_dir_name='TorchTrainer_f59af_00000_0_2024-09-24_15-04-24',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154}
{"asctime": "2024-09-24 15:04:27,772", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-09-24_15-04-24',\n  trial_dir_name='TorchTrainer_f59af_00000_0_2024-09-24_15-04-24',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154}
{"asctime": "2024-09-24 15:04:27,772", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-09-24_15-04-23_088947_30991/artifacts/2024-09-24_15-04-24/TorchTrainer_2024-09-24_15-04-24/working_dirs/TorchTrainer_f59af_00000_0_2024-09-24_15-04-24", "filename": "session.py", "lineno": 231}
{"asctime": "2024-09-24 15:04:27,772", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-09-24_15-04-23_088947_30991/artifacts/2024-09-24_15-04-24/TorchTrainer_2024-09-24_15-04-24/working_dirs/TorchTrainer_f59af_00000_0_2024-09-24_15-04-24", "filename": "session.py", "lineno": 231}
{"asctime": "2024-09-24 15:04:28,787", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216}
{"asctime": "2024-09-24 15:04:28,796", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225}
{"asctime": "2024-09-24 15:04:28,797", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230}

Text Logging

RAY_TRAIN_LOG_ENCODING="TEXT" python log.py
2024-09-24 15:06:02,274	DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286	DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973	DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812	INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863	INFO backend_executor.py:447 -- Started distributed worker processes: 
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906	DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912	DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912	DEBUG worker_group.py:230 -- Shutdown successful.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo hongpeng-guo changed the title [Train Log][WIP] Ray Train Structured Logging [Train Log]Ray Train Structured Logging Sep 24, 2024
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant