[Train Log]Ray Train Structured Logging #47806

hongpeng-guo · 2024-09-24T06:50:42Z

Why are these changes needed?

This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:

python/ray/train/_internal/logging.py: this file defines the logging utility functions;
python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.

Example

Code snippet:

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory

import logging
logger = logging.getLogger("ray.train")

ray.init()

def train_func_per_worker():
    logger.info("Training function per worker")

def train_dummy(num_workers=2):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

if __name__ == "__main__":
    print(f"Log directory: {get_log_directory()}")
    train_dummy()

JSON Logging

RAY_TRAIN_LOG_ENCODING="JSON" python log.py

{"asctime": "2024-11-08 11:19:58,063", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "ffffffffffffffff83fcb6f07a3a7a63545ea94601000000"}
{"asctime": "2024-11-08 11:19:58,063", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "ffffffffffffffff83fcb6f07a3a7a63545ea94601000000"}
{"asctime": "2024-11-08 11:19:58,077", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:58,710", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "b47a1a2bfd406adacea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "260f31e5317176835d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "b47a1a2bfd406adacea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,499", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "260f31e5317176835d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,509", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389, ip=127.0.0.1, pid=88059) world_rank=0, local_rank=0, node_rank=0\n- (node_id=eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389, ip=127.0.0.1, pid=88060) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:19:59,535", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "c325fd635f23af07cea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,535", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-08_11-19-56',\n  trial_dir_name='TorchTrainer_70293_00000_0_2024-11-08_11-19-56',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "4a288fd0ddf2520b5d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,536", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "4a288fd0ddf2520b5d53407d44cd0a285004087301000000"}
{"asctime": "2024-11-08 11:19:59,536", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-08_11-19-54_509643_87981/artifacts/2024-11-08_11-19-56/TorchTrainer_2024-11-08_11-19-56/working_dirs/TorchTrainer_70293_00000_0_2024-11-08_11-19-56", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "c325fd635f23af07cea44824e6909a3c3093cc1601000000"}
{"asctime": "2024-11-08 11:19:59,537", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "job_id": "01000000", "worker_id": "6fb73cd9160368fb01de57987c130e85154b568d7e216483dab06ccc", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "5d53407d44cd0a285004087301000000", "task_id": "0cd5197cf7154f3c8587d3ed0f454189e08782b401000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1", "world_size": 2, "world_rank": 0, "local_world_size": 0, "local_rank": 2, "node_rank": 0}
{"asctime": "2024-11-08 11:19:59,537", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "job_id": "01000000", "worker_id": "eed87fdf3a2fb18ebba0ba4a367323070e52eb86add3f43c9a9a3c18", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "cea44824e6909a3c3093cc1601000000", "task_id": "655b1b0fd6924ef51ba24695fb8f2e52545b4feb01000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1", "world_size": 2, "world_rank": 1, "local_world_size": 1, "local_rank": 2, "node_rank": 0, "hide": true}
{"asctime": "2024-11-08 11:20:00,550", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:20:00,565", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}
{"asctime": "2024-11-08 11:20:00,565", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230, "job_id": "01000000", "worker_id": "cc277b45c69ca143be22e9895867bc7d259eb13fa29b7e275297cb66", "node_id": "eb3c53c7f408a227b40380fc80e2d50ccc98aaa45bda0a17e9036389", "actor_id": "83fcb6f07a3a7a63545ea94601000000", "task_id": "000685f8e792b9e7d5e21a98a52e99685b9417e901000000", "run_id": "78ae3ffdd2a34b6ba4e748993b9254f1"}

Text Logging

RAY_TRAIN_LOG_ENCODING="TEXT" python log.py

2024-09-24 15:06:02,274	DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286	DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973	DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812	INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863	INFO backend_executor.py:447 -- Started distributed worker processes: 
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906	DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912	DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912	DEBUG worker_group.py:230 -- Shutdown successful.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

justinvyu

This is nice! Here's a first pass with some questions.

python/ray/train/constants.py

python/ray/train/_internal/logging.py

justinvyu · 2024-10-17T01:28:24Z

python/ray/train/_internal/logging.yaml

+    (): ray._private.ray_logging.filters.CoreContextFilter
+
+handlers:
+  file:


Nit: file_text is more descriptive of the encoding mode

python/ray/train/_internal/logging.yaml

python/ray/train/_internal/logging.py

python/ray/__init__.py

omatthew98

Added some comments to pieces that I changed during the review for the ray data structured logging, overall lgtm please ping me when ready for a second pass.

python/ray/train/_internal/logging.py

python/ray/train/tests/test_logging.py

justinvyu

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

omatthew98 · 2024-10-17T20:30:26Z

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case.

omatthew98 · 2024-10-17T22:49:39Z

Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo · 2024-10-28T05:58:33Z

Comments handed, good to take a look for another round.

justinvyu · 2024-10-30T20:49:17Z

python/ray/train/constants.py

+DEFAULT_LOG_CONFIG_YAML_STRING = """
+version: 1
+disable_existing_loggers: False


Can we just have this be a dict similar to Data?

https://github.com/ray-project/ray/pull/48093/files#diff-3a2ffc1cbd2991bc0acf5093b76ff0fbe78e3c441c7c04cd8eb62c8ccf6dbbddR10

Yaml string may have syntax errors and is not native to python.

Good point! I was thinking yaml files have better readability comparing to json strings. However, after transformed as string, I don't think it's still that readable. I will make changes to update it to json string format.

justinvyu · 2024-10-30T20:50:28Z

python/ray/train/constants.py

+# Env. variable to specify the encoding of the file logs when using the default config.
+LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"
+
+# Env. variable to specify the logging config path use defaults if not set
+LOG_CONFIG_PATH_ENV = "RAY_TRAIN_LOG_CONFIG_PATH"
+


Can you move all these constants into the logging.py file? I'd rather keep this module isolated and have fewer changes in train/* code.

Why do we want to make this module isolated from the other parts?

I moved the default configuration json string and the default decoding format variables to the logging.py, but still kept the two env var name variables within constant.py. I think it makes more sense to keep all the user controlled env var names within constant.py.

justinvyu · 2024-10-30T21:04:06Z

python/ray/train/tests/test_logging.py

+import yaml
+
+import ray
+from ray.tests.conftest import *  # noqa


nit: don't think this is needed actually. conftest fixtures should automatically be visible

Good point! I can remove this unnecessary import.

justinvyu · 2024-10-30T21:08:20Z

python/ray/train/tests/test_logging.py

+    console_log_output = capsys.readouterr().err
+    for log_line in console_log_output.splitlines():
+        with pytest.raises(json.JSONDecodeError):
+            json.loads(log_line)


So all logger.warning(...) gets logged in JSON format to the stdout as well? Does it make more sense for console outputs to be in the normal text format, and then only the ray-train.log contains the JSON format?

Also, is it printed to stderr or stdout?

for logger.warning(...), it is logged in JSON format to ray-train.log file, and logged to console as normal text format. The code here will raise pytest.raises(json.JSONDecodeError) because it is normal text, not json in the console output format.

By default, all the console output in python logger goes to stderr. Reference: https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial. and https://github.com/hongpeng-guo/ray/blob/6a5fc2d39b0265e7b578f069a84ae772c123801b/python/ray/_private/log.py#L30

alanwguo · 2024-10-30T21:27:44Z

Can we include local, world, and node rank as part of the structure?

hongpeng-guo · 2024-10-31T00:54:57Z

Can we include local, world, and node rank as part of the structure?

Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo · 2024-11-01T21:48:19Z

Update: I added a TrainContextFilter that will append rank information (world_rank, local_rank, world_size, local_world_size, and node_rank) to the structured logging records if this piece of log is emitted from a train worker.
cc @alanwguo

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

…XT or JSON to the console, and a ray-train.log that is always being JSON Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo

Made some changes on this PR:

There are still two handlers writing to console and file ray-train.log
The logs emitted to ray-train.log will always be in JSON. The users don't need to be aware of this. This file could be only for ray.train internal usage.
The logs emitted to the console can be either TEXT or JSON based on the env variable RAY_TRAIN_LOG_ENCODING, similar to that of ray core's structure logging setup.

All the ray train process will have an extra field run_id that is unique for each train job. All the ray train worker process have extra field of world_rank, world_size, local_rank, local_size.
A worker process that is not world_rank=0 also has an extra field hide=true. By default, console logs in TEXT mode will only show rank 0 worker logs.

Please take a look cc @matthewdeng @justinvyu @alanwguo

alanwguo

didn't look too closely at the code, but the high level description sounds good.

Can you update the description example with the latest that includes run_id and and local rank?

alanwguo · 2024-11-08T03:38:12Z

python/ray/train/_internal/logging.py

+    # This key is used to hide the log record if the value is True.
+    # By default, train workers that are not ranked zero will hide
+    # the log record.
+    HIDE = "hide"


I don't think hide=True is necessary. At least, the product won't utilize this field.

Got it! A followup question: what's the format we should follow so that the product team can use to filter for logs to be shown /hidden by default in the log viewer?

we can just filter out non "rank 0" or do whatever behavior based on the other fields

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo · 2024-11-09T02:08:38Z

python/ray/train/data_parallel_trainer.py

@@ -444,7 +443,7 @@ def training_loop(self) -> None:
            driver_ip=ray.util.get_node_ip_address(),
            driver_node_id=ray.get_runtime_context().get_node_id(),
            experiment_name=session.get_experiment_name(),
-            run_id=uuid.uuid4().hex,
+            run_id=session.get_run_id(),


Just for my own understanding, this training_loop function is part of tuner.fit() function that will be called in a tune process. Therefore, the session.get_run_id() will actually get the run_id from a tune process, although this seesion is imported from ray.train._internal. This session is actually initialized inside functional_trainable.py which defined under tune/trainable/function_trainable.py. cc @justinvyu @matthewdeng

That's right, the training_loop here is the Train driver logic that is running inside the tune.FunctionTrainable. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.

hongpeng-guo · 2024-11-09T12:43:11Z

Updated the output in the PR description.

Note: In current implementation, we cannot assign the world_rank / run_id information to every single pieces of the train worker logs. Basically the worker processes are created first and then assigned ranks information. The logs of each process that happens before this actor got its train_session will not be able to obtain the rank info as well as the run_id. Most logs can be correctly added with the train context info.

In Train V2, the implementation will be a bit different. We will fix the above issue in V2, as there is less coupling with Tune, making it easier to solve this issue.

justinvyu · 2024-11-12T23:24:14Z

python/ray/train/_internal/logging.py

+    WORLD_SIZE = "world_size"
+    WORLD_RANK = "world_rank"
+    LOCAL_WORLD_SIZE = "local_world_size"
+    LOCAL_RANK = "local_rank"
+    NODE_RANK = "node_rank"


I think we should only tag world_rank, local_rank, and node_rank. world_size / local_world_size is confusing to filter by.

emm, good point. the size information is not useful for log searching / filtering.

justinvyu · 2024-11-12T23:29:32Z

python/ray/train/_internal/logging.py

+# Env. variable to specify the encoding of the file logs when using the default config.
+LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"


RAY_TRAIN_LOG_ENCODING now controls the console output encoding format between TEXT and JSON, but I think it should always be TEXT.

Our product should probably never set console output to JSON mode automatically, and users should not know about this environment variable.

What about this:

Remove this RAY_TRAIN_LOG_ENCODING environment variable, so that console is always TEXT, and file (ray-train.log) is always JSON.

I see the point cc @justinvyu . There is actually a longer story why I changed to allow JSON mode in console after discussed with @matthewdeng .

I think the general logging is moving towards deprecate log_to_driver feature, that the logs of all works will not be showing in the driver console, but persisted in their local files, i.e., *.err, *.std, *.log. These logs will be ingested and showed using more modern tools like log viewer eventually. In the long run, logging to driver console will not be that useful.

The ray core logger has only one stream handler that writes everything to the console, i.e., *.err file of each node by default. These *.err console file are expected to contain JSON logs anyway if ray core structured logging is enabled. We are also enabling JSON mode of console logs following ray core's pattern.

There could be another design pattern as you suggested. (1) Everything to the console must be TEXT; (2) All the JSON logs go to a separate file ray-core.log, ray-train.log ...etc, if JSON mode is enabled. (3) console log files, *.err, *.std will never be ingested because they are never JSON. I think this also works but may need wider revamp of many ray libraries. We can have a chat on this tomorrow.

justinvyu · 2024-11-12T23:32:26Z

python/ray/train/_internal/logging.py

+        "console_json": {
+            "class": "logging.StreamHandler",
+            "formatter": "ray_json",
+            "filters": ["core_context_filter", "train_context_filter"],
+        },
+        "console_text": {
+            "class": "ray._private.log.PlainRayHandler",
+            "formatter": "ray",
+            "level": "INFO",
+            "filters": ["train_context_filter", "console_filter"],
+        },


Why do we switch between the core_context_filter vs. the console_filter?

Let's just use the core_context_filter and remove the console_filter (HiddenRecordFilter), since Alan mentions this HiddenRecordFilter is not needed.

Sure, I think this HiddenRecordFilter is not very useful as well. I will remove it.

justinvyu · 2024-11-12T23:34:30Z

python/ray/train/_internal/logging.py

+
+    This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id,
+    and node_id to the log record. This filter adds the rank and size information of
+    the train context to the log record.


Suggested change

This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id,

and node_id to the log record. This filter adds the rank and size information of

the train context to the log record.

justinvyu · 2024-11-12T23:35:29Z

python/ray/train/_internal/logging.py

+
+
+class TrainContextFilter(logging.Filter):
+    """Add rank and size information to the log record if the log is from a train worker.


Suggested change

"""Add rank and size information to the log record if the log is from a train worker.

"""Add training worker rank information to the log record.

justinvyu · 2024-11-12T23:40:39Z

python/ray/train/_internal/logging.py

+            "level": "DEBUG",
+            "handlers": ["file", "console_text"],


Can we limit console output to INFO and above? DEBUG is ok for the log-viewer since people can filter it out, but it will spam the console.

We actually specified level INFO in the console_text handler.

In text mode, the logger pass every message >=DEBUG, when it passed to console_text handler, the INFO level will filter out the DEBUG messages, making it less spammy.

In json mode, the file handler don't specify extra levels, so the DEBUG level info will show in JSON mode and can be filtered out by the user.

However, if we set the default level of ray.trian logger as INFO, DEBUG info will be removed at the logger level, that the file handler can not ingest extra information in the JSON mode.

justinvyu · 2024-11-12T23:43:14Z

python/ray/train/data_parallel_trainer.py

@@ -444,7 +443,7 @@ def training_loop(self) -> None:
            driver_ip=ray.util.get_node_ip_address(),
            driver_node_id=ray.get_runtime_context().get_node_id(),
            experiment_name=session.get_experiment_name(),
-            run_id=uuid.uuid4().hex,
+            run_id=session.get_run_id(),


That's right, the training_loop here is the Train driver logic that is running inside the tune.FunctionTrainable. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.

python/ray/tune/trainable/function_trainable.py

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Hongpeng Guo <hg5@illinois.edu>

hongpeng-guo added 5 commits September 19, 2024 22:28

copy over ray data structure log

48c0961

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

remove TRACE level for ray train

90b6893

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

move constant variables to the constant.py file

dc79a60

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

fix default yaml config path

d388da5

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

fix unittests

ace76a7

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo requested review from justinvyu, matthewdeng, raulchen and woshiyyya as code owners September 24, 2024 06:50

hongpeng-guo added 3 commits September 23, 2024 23:51

Merge branch 'master' into hpguo/train_structured_log

d6b64c0

add unittest to BUILD file

ddadc1c

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

modify setup.py to include yaml file into the python package

6216d4f

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo requested review from richardliaw, ericl and edoakes as code owners September 24, 2024 19:56

hongpeng-guo changed the title ~~[Train Log][WIP] Ray Train Structured Logging~~ [Train Log]Ray Train Structured Logging Sep 24, 2024

hongpeng-guo added 4 commits September 27, 2024 13:42

update log

c77a181

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

update logging

427b861

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge branch 'master' into hpguo/train_structured_log

58bbf5d

use () key instead if class for filter

e4d8b63

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo requested review from alanwguo, kevin85421, GeneDer and omatthew98 October 10, 2024 00:35

justinvyu reviewed Oct 17, 2024

View reviewed changes

omatthew98 reviewed Oct 17, 2024

View reviewed changes

python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved

python/ray/train/_internal/logging.py Show resolved Hide resolved

python/ray/train/_internal/logging.py Show resolved Hide resolved

python/ray/train/tests/test_logging.py Show resolved Hide resolved

justinvyu reviewed Oct 17, 2024

View reviewed changes

handle comments

99266ae

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo added 4 commits October 27, 2024 22:09

remove reduant __init__ part

c3adf12

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

add back the change on __init__.py

0fed499

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

change yaml file to constant yaml string

35249c8

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

clean code chanegs

6a5fc2d

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

aslonnie removed their request for review October 28, 2024 17:26

justinvyu reviewed Oct 30, 2024

View reviewed changes

hongpeng-guo added 2 commits October 30, 2024 21:57

handle comments

1768dc6

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

handle comments

c1a9955

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo added 10 commits November 1, 2024 14:50

add TrainContextFilter that add ranks information to worker nodes

82b23ad

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge remote-tracking branch 'origin' into hpguo/train_structured_log

d79119f

fix corner cases

b6c8cbd

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

split core context filter and train context filter into two filters

1576354

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

Merge remote-tracking branch 'origin' into hpguo/train_structured_log

441b043

Merge remote-tracking branch 'origin' into hpguo/train_structured_log

6bcb2a2

add run_id and hidden_default keys

7abb76d

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

change where the run_id is introduced

50d2067

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

add run_id to function_trainable

dee3fde

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

change the logging behavior similar to ray core, that we may write TE…

921fd4c

…XT or JSON to the console, and a ray-train.log that is always being JSON Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo commented Nov 8, 2024

View reviewed changes

alanwguo reviewed Nov 8, 2024

View reviewed changes

remove constans from the constants file

6335cf0

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>

hongpeng-guo commented Nov 9, 2024

View reviewed changes

justinvyu reviewed Nov 12, 2024

View reviewed changes

hongpeng-guo and others added 2 commits November 13, 2024 15:55

Update python/ray/tune/trainable/function_trainable.py

2a4a9fb

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Hongpeng Guo <hg5@illinois.edu>

Update python/ray/tune/trainable/function_trainable.py

263d2b9

Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Hongpeng Guo <hg5@illinois.edu>

		# Env. variable to specify the encoding of the file logs when using the default config.
		LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"



		class TrainContextFilter(logging.Filter):
		"""Add rank and size information to the log record if the log is from a train worker.

	"""Add rank and size information to the log record if the log is from a train worker.
	"""Add training worker rank information to the log record.

[Train Log]Ray Train Structured Logging #47806

Are you sure you want to change the base?

[Train Log]Ray Train Structured Logging #47806

Conversation

hongpeng-guo commented Sep 24, 2024 • edited Loading

Why are these changes needed?

Example

JSON Logging

Text Logging

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

omatthew98 commented Oct 17, 2024

omatthew98 commented Oct 17, 2024

hongpeng-guo commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwguo commented Oct 30, 2024

hongpeng-guo commented Oct 31, 2024 • edited Loading

hongpeng-guo commented Nov 1, 2024 • edited Loading

hongpeng-guo left a comment • edited Loading

Choose a reason for hiding this comment

alanwguo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo commented Nov 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo commented Sep 24, 2024 •

edited

Loading

hongpeng-guo commented Oct 31, 2024 •

edited

Loading

hongpeng-guo commented Nov 1, 2024 •

edited

Loading

hongpeng-guo left a comment •

edited

Loading