Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train Log]Ray Train Structured Logging #47806

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

hongpeng-guo
Copy link
Contributor

@hongpeng-guo hongpeng-guo commented Sep 24, 2024

Why are these changes needed?

This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:

  • python/ray/train/_internal/logging.py: this file defines the logging utility functions;
  • python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.

Example

Code snippet:

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory

import logging
logger = logging.getLogger("ray.train")

ray.init()

def train_func_per_worker():
    logger.info("Training function per worker")

def train_dummy(num_workers=2):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

if __name__ == "__main__":
    print(f"Log directory: {get_log_directory()}")
    train_dummy()

JSON Logging

RAY_TRAIN_LOG_ENCODING="JSON" python log.py
{"asctime": "2024-11-08 11:19:58,063", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "ffffffffffffffff83fcb6f07a3a7a63545ea94601000000"}
{"asctime": "2024-11-08 11:19:58,063", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "ffffffffffffffff83fcb6f07a3a7a63545ea94601000000"}
{"asctime": "2024-11-08 11:19:58,077", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:58,710", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "b47a1a2bfd406adacea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "260f31e5317176835d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "b47a1a2bfd406adacea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "260f31e5317176835d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,509", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389, ip=127.0.0.1, pid=88059) world_rank=0, local_rank=0, node_rank=0\n- (node_id=eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389, ip=127.0.0.1, pid=88060) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:59,535", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "c325fd635f23af07cea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,535", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "4a288fd0ddf2520b5d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,536", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "4a288fd0ddf2520b5d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,536", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "c325fd635f23af07cea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,537", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "0cd5197cf7154f3c8587d3ed0f454189e08782b401000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1", "world_size": 2, "world_rank": 0, "local_world_size": 0, "local_rank": 2, "node_rank": 0}
{"asctime": "2024-11-08 11:19:59,537", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "655b1b0fd6924ef51ba24695fb8f2e52545b4feb01000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1", "world_size": 2, "world_rank": 1, "local_world_size": 1, "local_rank": 2, "node_rank": 0, "hide": true}
{"asctime": "2024-11-08 11:20:00,550", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:20:00,565", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:20:00,565", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}

Text Logging

RAY_TRAIN_LOG_ENCODING="TEXT" python log.py
2024-09-24 15:06:02,274	DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286	DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973	DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812	INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863	INFO backend_executor.py:447 -- Started distributed worker processes: 
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906	DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912	DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912	DEBUG worker_group.py:230 -- Shutdown successful.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo hongpeng-guo changed the title [Train Log][WIP] Ray Train Structured Logging [Train Log]Ray Train Structured Logging Sep 24, 2024
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Here's a first pass with some questions.

python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Show resolved Hide resolved
python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
(): ray._private.ray_logging.filters.CoreContextFilter

handlers:
file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: file_text is more descriptive of the encoding mode

python/ray/train/_internal/logging.yaml Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
python/ray/__init__.py Show resolved Hide resolved
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to pieces that I changed during the review for the ray data structured logging, overall lgtm please ping me when ready for a second pass.

python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Show resolved Hide resolved
python/ray/train/_internal/logging.py Show resolved Hide resolved
python/ray/train/tests/test_logging.py Show resolved Hide resolved
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

@omatthew98
Copy link
Contributor

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case.

@omatthew98
Copy link
Contributor

Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

Comments handed, good to take a look for another round.

@aslonnie aslonnie removed their request for review October 28, 2024 17:26
Comment on lines 48 to 50
DEFAULT_LOG_CONFIG_YAML_STRING = """
version: 1
disable_existing_loggers: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have this be a dict similar to Data?

https://github.com/ray-project/ray/pull/48093/files#diff-3a2ffc1cbd2991bc0acf5093b76ff0fbe78e3c441c7c04cd8eb62c8ccf6dbbddR10

Yaml string may have syntax errors and is not native to python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I was thinking yaml files have better readability comparing to json strings. However, after transformed as string, I don't think it's still that readable. I will make changes to update it to json string format.

Comment on lines 133 to 138
# Env. variable to specify the encoding of the file logs when using the default config.
LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"

# Env. variable to specify the logging config path use defaults if not set
LOG_CONFIG_PATH_ENV = "RAY_TRAIN_LOG_CONFIG_PATH"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all these constants into the logging.py file? I'd rather keep this module isolated and have fewer changes in train/* code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to make this module isolated from the other parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the default configuration json string and the default decoding format variables to the logging.py, but still kept the two env var name variables within constant.py. I think it makes more sense to keep all the user controlled env var names within constant.py.

import yaml

import ray
from ray.tests.conftest import * # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't think this is needed actually. conftest fixtures should automatically be visible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I can remove this unnecessary import.

Comment on lines 161 to 164
console_log_output = capsys.readouterr().err
for log_line in console_log_output.splitlines():
with pytest.raises(json.JSONDecodeError):
json.loads(log_line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all logger.warning(...) gets logged in JSON format to the stdout as well? Does it make more sense for console outputs to be in the normal text format, and then only the ray-train.log contains the JSON format?

Also, is it printed to stderr or stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. for logger.warning(...), it is logged in JSON format to ray-train.log file, and logged to console as normal text format. The code here will raise pytest.raises(json.JSONDecodeError) because it is normal text, not json in the console output format.

  2. By default, all the console output in python logger goes to stderr. Reference: https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial. and https://github.com/hongpeng-guo/ray/blob/6a5fc2d39b0265e7b578f069a84ae772c123801b/python/ray/_private/log.py#L30

@alanwguo
Copy link
Contributor

Can we include local, world, and node rank as part of the structure?

@hongpeng-guo
Copy link
Contributor Author

hongpeng-guo commented Oct 31, 2024

Can we include local, world, and node rank as part of the structure?

Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

hongpeng-guo commented Nov 1, 2024

Update: I added a TrainContextFilter that will append rank information (world_rank, local_rank, world_size, local_world_size, and node_rank) to the structured logging records if this piece of log is emitted from a train worker.
cc @alanwguo

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
…XT or JSON to the console, and a ray-train.log that is always being JSON

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Copy link
Contributor Author

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes on this PR:

  1. There are still two handlers writing to console and file ray-train.log
  2. The logs emitted to ray-train.log will always be in JSON. The users don't need to be aware of this. This file could be only for ray.train internal usage.
  3. The logs emitted to the console can be either TEXT or JSON based on the env variable RAY_TRAIN_LOG_ENCODING, similar to that of ray core's structure logging setup.

All the ray train process will have an extra field run_id that is unique for each train job. All the ray train worker process have extra field of world_rank, world_size, local_rank, local_size.
A worker process that is not world_rank=0 also has an extra field hide=true. By default, console logs in TEXT mode will only show rank 0 worker logs.

Please take a look cc @matthewdeng @justinvyu @alanwguo

Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't look too closely at the code, but the high level description sounds good.

Can you update the description example with the latest that includes run_id and and local rank?

Comment on lines 71 to 74
# This key is used to hide the log record if the value is True.
# By default, train workers that are not ranked zero will hide
# the log record.
HIDE = "hide"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think hide=True is necessary. At least, the product won't utilize this field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! A followup question: what's the format we should follow so that the product team can use to filter for logs to be shown /hidden by default in the log viewer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just filter out non "rank 0" or do whatever behavior based on the other fields

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@@ -444,7 +443,7 @@ def training_loop(self) -> None:
driver_ip=ray.util.get_node_ip_address(),
driver_node_id=ray.get_runtime_context().get_node_id(),
experiment_name=session.get_experiment_name(),
run_id=uuid.uuid4().hex,
run_id=session.get_run_id(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, this training_loop function is part of tuner.fit() function that will be called in a tune process. Therefore, the session.get_run_id() will actually get the run_id from a tune process, although this seesion is imported from ray.train._internal. This session is actually initialized inside functional_trainable.py which defined under tune/trainable/function_trainable.py. cc @justinvyu @matthewdeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the training_loop here is the Train driver logic that is running inside the tune.FunctionTrainable. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.

@hongpeng-guo
Copy link
Contributor Author

Updated the output in the PR description.

Note: In current implementation, we cannot assign the world_rank / run_id information to every single pieces of the train worker logs. Basically the worker processes are created first and then assigned ranks information. The logs of each process that happens before this actor got its train_session will not be able to obtain the rank info as well as the run_id. Most logs can be correctly added with the train context info.

In Train V2, the implementation will be a bit different. We will fix the above issue in V2, as there is less coupling with Tune, making it easier to solve this issue.

Comment on lines +70 to +74
WORLD_SIZE = "world_size"
WORLD_RANK = "world_rank"
LOCAL_WORLD_SIZE = "local_world_size"
LOCAL_RANK = "local_rank"
NODE_RANK = "node_rank"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only tag world_rank, local_rank, and node_rank. world_size / local_world_size is confusing to filter by.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, good point. the size information is not useful for log searching / filtering.

Comment on lines +14 to +15
# Env. variable to specify the encoding of the file logs when using the default config.
LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_TRAIN_LOG_ENCODING now controls the console output encoding format between TEXT and JSON, but I think it should always be TEXT.

Our product should probably never set console output to JSON mode automatically, and users should not know about this environment variable.

What about this:

  • Remove this RAY_TRAIN_LOG_ENCODING environment variable, so that console is always TEXT, and file (ray-train.log) is always JSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point cc @justinvyu . There is actually a longer story why I changed to allow JSON mode in console after discussed with @matthewdeng .

  1. I think the general logging is moving towards deprecate log_to_driver feature, that the logs of all works will not be showing in the driver console, but persisted in their local files, i.e., *.err, *.std, *.log. These logs will be ingested and showed using more modern tools like log viewer eventually. In the long run, logging to driver console will not be that useful.
  2. The ray core logger has only one stream handler that writes everything to the console, i.e., *.err file of each node by default. These *.err console file are expected to contain JSON logs anyway if ray core structured logging is enabled. We are also enabling JSON mode of console logs following ray core's pattern.

There could be another design pattern as you suggested. (1) Everything to the console must be TEXT; (2) All the JSON logs go to a separate file ray-core.log, ray-train.log ...etc, if JSON mode is enabled. (3) console log files, *.err, *.std will never be ingested because they are never JSON. I think this also works but may need wider revamp of many ray libraries. We can have a chat on this tomorrow.

Comment on lines +40 to +50
"console_json": {
"class": "logging.StreamHandler",
"formatter": "ray_json",
"filters": ["core_context_filter", "train_context_filter"],
},
"console_text": {
"class": "ray._private.log.PlainRayHandler",
"formatter": "ray",
"level": "INFO",
"filters": ["train_context_filter", "console_filter"],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we switch between the core_context_filter vs. the console_filter?

Let's just use the core_context_filter and remove the console_filter (HiddenRecordFilter), since Alan mentions this HiddenRecordFilter is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think this HiddenRecordFilter is not very useful as well. I will remove it.

Comment on lines +99 to +102

This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id,
and node_id to the log record. This filter adds the rank and size information of
the train context to the log record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id,
and node_id to the log record. This filter adds the rank and size information of
the train context to the log record.



class TrainContextFilter(logging.Filter):
"""Add rank and size information to the log record if the log is from a train worker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Add rank and size information to the log record if the log is from a train worker.
"""Add training worker rank information to the log record.

Comment on lines +60 to +61
"level": "DEBUG",
"handlers": ["file", "console_text"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we limit console output to INFO and above? DEBUG is ok for the log-viewer since people can filter it out, but it will spam the console.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually specified level INFO in the console_text handler.

  1. In text mode, the logger pass every message >=DEBUG, when it passed to console_text handler, the INFO level will filter out the DEBUG messages, making it less spammy.
  2. In json mode, the file handler don't specify extra levels, so the DEBUG level info will show in JSON mode and can be filtered out by the user.

However, if we set the default level of ray.trian logger as INFO, DEBUG info will be removed at the logger level, that the file handler can not ingest extra information in the JSON mode.

@@ -444,7 +443,7 @@ def training_loop(self) -> None:
driver_ip=ray.util.get_node_ip_address(),
driver_node_id=ray.get_runtime_context().get_node_id(),
experiment_name=session.get_experiment_name(),
run_id=uuid.uuid4().hex,
run_id=session.get_run_id(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the training_loop here is the Train driver logic that is running inside the tune.FunctionTrainable. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved
python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved
hongpeng-guo and others added 2 commits November 13, 2024 15:55
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants