Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train Log]Ray Train Structured Logging #47806

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

hongpeng-guo
Copy link
Contributor

@hongpeng-guo hongpeng-guo commented Sep 24, 2024

Why are these changes needed?

This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:

  • python/ray/train/_internal/logging.py: this file defines the logging utility functions;
  • python/ray/train/tests/test_logging.py: this file provides the corresponding unit tests for the logging utilities.

Example

Code snippet:

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
from ray.train._internal.logging import get_log_directory

import logging
logger = logging.getLogger("ray.train")

ray.init()

def train_func_per_worker():
    logger.info("Training function per worker")

def train_dummy(num_workers=2):
    scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=False)

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

if __name__ == "__main__":
    print(f"Log directory: {get_log_directory()}")
    train_dummy()

JSON Logging

RAY_TRAIN_LOG_ENCODING="JSON" python log.py
{"asctime": "2024-11-01 14:42:01,772", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=None):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-01_14-42-00',\n  trial_dir_name='TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fffffffffffffffffa30a41e3d1adbcbe21f0b7c01000000"}
{"asctime": "2024-11-01 14:42:01,772", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-01_14-41-58_674237_12567/artifacts/2024-11-01_14-42-00/TorchTrainer_2024-11-01_14-42-00/working_dirs/TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fffffffffffffffffa30a41e3d1adbcbe21f0b7c01000000"}
{"asctime": "2024-11-01 14:42:01,784", "levelname": "DEBUG", "message": "Starting 2 workers.", "filename": "worker_group.py", "lineno": 202, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}
{"asctime": "2024-11-01 14:42:02,464", "levelname": "DEBUG", "message": "2 workers have successfully started.", "filename": "worker_group.py", "lineno": 204, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}
{"asctime": "2024-11-01 14:42:03,296", "levelname": "DEBUG", "message": "Setting up process group for: env:// [rank=1, world_size=2]", "filename": "config.py", "lineno": 88, "job_id": "01000000", "worker_id": "d6f9fed0728edd8b2e6ccfef40e940342871b55f65108a1b3d1935ce", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "e29110a539fe0ce526aae31501000000", "task_id": "b99edc226fd7de6ae29110a539fe0ce526aae31501000000"}
{"asctime": "2024-11-01 14:42:03,297", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "d6f9fed0728edd8b2e6ccfef40e940342871b55f65108a1b3d1935ce", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "e29110a539fe0ce526aae31501000000", "task_id": "b99edc226fd7de6ae29110a539fe0ce526aae31501000000"}
{"asctime": "2024-11-01 14:42:03,297", "levelname": "INFO", "message": "Setting up process group for: env:// [rank=0, world_size=2]", "filename": "config.py", "lineno": 83, "job_id": "01000000", "worker_id": "143d46e13a39a6e868d52278b020e1ec26e49458dc6ca2dfe050bfb3", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "9934b6c1b68ac88adfffafca01000000", "task_id": "c8a4749a72b733e29934b6c1b68ac88adfffafca01000000"}
{"asctime": "2024-11-01 14:42:03,297", "levelname": "DEBUG", "message": "using gloo", "filename": "config.py", "lineno": 92, "job_id": "01000000", "worker_id": "143d46e13a39a6e868d52278b020e1ec26e49458dc6ca2dfe050bfb3", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "9934b6c1b68ac88adfffafca01000000", "task_id": "c8a4749a72b733e29934b6c1b68ac88adfffafca01000000"}
{"asctime": "2024-11-01 14:42:03,318", "levelname": "INFO", "message": "Started distributed worker processes: \n- (node_id=8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440, ip=127.0.0.1, pid=12614) world_rank=0, local_rank=0, node_rank=0\n- (node_id=8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440, ip=127.0.0.1, pid=12615) world_rank=1, local_rank=1, node_rank=0", "filename": "backend_executor.py", "lineno": 447, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}
{"asctime": "2024-11-01 14:42:03,347", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=0):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-01_14-42-00',\n  trial_dir_name='TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "143d46e13a39a6e868d52278b020e1ec26e49458dc6ca2dfe050bfb3", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "9934b6c1b68ac88adfffafca01000000", "task_id": "b195c8cb65d475ad9934b6c1b68ac88adfffafca01000000"}
{"asctime": "2024-11-01 14:42:03,347", "levelname": "DEBUG", "message": "StorageContext on SESSION (rank=1):\nStorageContext<\n  storage_filesystem='local',\n  storage_fs_path='/Users/hpguo/ray_results',\n  experiment_dir_name='TorchTrainer_2024-11-01_14-42-00',\n  trial_dir_name='TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00',\n  current_checkpoint_index=-1,\n>", "filename": "session.py", "lineno": 154, "job_id": "01000000", "worker_id": "d6f9fed0728edd8b2e6ccfef40e940342871b55f65108a1b3d1935ce", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "e29110a539fe0ce526aae31501000000", "task_id": "51fe9f4d40583850e29110a539fe0ce526aae31501000000"}
{"asctime": "2024-11-01 14:42:03,347", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-01_14-41-58_674237_12567/artifacts/2024-11-01_14-42-00/TorchTrainer_2024-11-01_14-42-00/working_dirs/TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "143d46e13a39a6e868d52278b020e1ec26e49458dc6ca2dfe050bfb3", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "9934b6c1b68ac88adfffafca01000000", "task_id": "b195c8cb65d475ad9934b6c1b68ac88adfffafca01000000"}
{"asctime": "2024-11-01 14:42:03,347", "levelname": "DEBUG", "message": "Changing the working directory to: /tmp/ray/session_2024-11-01_14-41-58_674237_12567/artifacts/2024-11-01_14-42-00/TorchTrainer_2024-11-01_14-42-00/working_dirs/TorchTrainer_1ff8a_00000_0_2024-11-01_14-42-00", "filename": "session.py", "lineno": 231, "job_id": "01000000", "worker_id": "d6f9fed0728edd8b2e6ccfef40e940342871b55f65108a1b3d1935ce", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "e29110a539fe0ce526aae31501000000", "task_id": "51fe9f4d40583850e29110a539fe0ce526aae31501000000"}
{"asctime": "2024-11-01 14:42:03,348", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "world_size": 2, "world_rank": 0, "local_world_size": 0, "local_rank": 2, "node_rank": 0, "job_id": "01000000", "worker_id": "143d46e13a39a6e868d52278b020e1ec26e49458dc6ca2dfe050bfb3", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "9934b6c1b68ac88adfffafca01000000", "task_id": "b0afaa137412fbc0f3140d3262f7958a2ab66f6101000000"}
{"asctime": "2024-11-01 14:42:03,348", "levelname": "INFO", "message": "Training function per worker", "filename": "_log.py", "lineno": 12, "world_size": 2, "world_rank": 1, "local_world_size": 1, "local_rank": 2, "node_rank": 0, "job_id": "01000000", "worker_id": "d6f9fed0728edd8b2e6ccfef40e940342871b55f65108a1b3d1935ce", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "e29110a539fe0ce526aae31501000000", "task_id": "76ae67616ee4fe95a95e77da9b57b3b753efbf5301000000"}
{"asctime": "2024-11-01 14:42:04,563", "levelname": "DEBUG", "message": "Shutting down 2 workers.", "filename": "worker_group.py", "lineno": 216, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}
{"asctime": "2024-11-01 14:42:04,571", "levelname": "DEBUG", "message": "Graceful termination failed. Falling back to force kill.", "filename": "worker_group.py", "lineno": 225, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}
{"asctime": "2024-11-01 14:42:04,572", "levelname": "DEBUG", "message": "Shutdown successful.", "filename": "worker_group.py", "lineno": 230, "job_id": "01000000", "worker_id": "fbb4d4ba1a20870da058447259990e6b9feb17045e59aef3db06bb41", "node_id": "8bb78ec9d3c95a59e53764d8510d5b669918f128ac8da7d1c4b4d440", "actor_id": "fa30a41e3d1adbcbe21f0b7c01000000", "task_id": "fa78bc53cbcd3906b97af0a2c723c9306db701aa01000000"}

Text Logging

RAY_TRAIN_LOG_ENCODING="TEXT" python log.py
2024-09-24 15:06:02,274	DEBUG session.py:154 -- StorageContext on SESSION (rank=None):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:02,274	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:02,286	DEBUG worker_group.py:202 -- Starting 2 workers.
2024-09-24 15:06:02,973	DEBUG worker_group.py:204 -- 2 workers have successfully started.
2024-09-24 15:06:03,812	INFO config.py:83 -- Setting up process group for: env:// [rank=0, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:88 -- Setting up process group for: env:// [rank=1, world_size=2]
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,812	DEBUG config.py:92 -- using gloo
2024-09-24 15:06:03,863	INFO backend_executor.py:447 -- Started distributed worker processes: 
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31968) world_rank=0, local_rank=0, node_rank=0
- (node_id=f4b1ea9c06ed3425b929fb70ede36ada34e3e3131b0c00318a7dee8a, ip=127.0.0.1, pid=31969) world_rank=1, local_rank=1, node_rank=0
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=0):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:154 -- StorageContext on SESSION (rank=1):
StorageContext<
  storage_filesystem='local',
  storage_fs_path='/Users/hpguo/ray_results',
  experiment_dir_name='TorchTrainer_2024-09-24_15-06-00',
  trial_dir_name='TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00',
  current_checkpoint_index=-1,
>
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:03,893	DEBUG session.py:231 -- Changing the working directory to: /tmp/ray/session_2024-09-24_15-05-58_999739_31907/artifacts/2024-09-24_15-06-00/TorchTrainer_2024-09-24_15-06-00/working_dirs/TorchTrainer_2edb7_00000_0_2024-09-24_15-06-00
2024-09-24 15:06:04,906	DEBUG worker_group.py:216 -- Shutting down 2 workers.
2024-09-24 15:06:04,912	DEBUG worker_group.py:225 -- Graceful termination failed. Falling back to force kill.
2024-09-24 15:06:04,912	DEBUG worker_group.py:230 -- Shutdown successful.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo hongpeng-guo changed the title [Train Log][WIP] Ray Train Structured Logging [Train Log]Ray Train Structured Logging Sep 24, 2024
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Here's a first pass with some questions.

python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Show resolved Hide resolved
python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
(): ray._private.ray_logging.filters.CoreContextFilter

handlers:
file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: file_text is more descriptive of the encoding mode

python/ray/train/_internal/logging.yaml Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
python/ray/__init__.py Show resolved Hide resolved
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to pieces that I changed during the review for the ray data structured logging, overall lgtm please ping me when ready for a second pass.

python/ray/train/_internal/logging.py Outdated Show resolved Hide resolved
python/ray/train/_internal/logging.py Show resolved Hide resolved
else:
config = _load_logging_config(DEFAULT_RAY_TRAIN_LOG_CONFIG_PATH)
if RAY_TRAIN_LOG_ENCODING == RAY_TRAIN_JSON_LOG_ENCODING_FORMAT:
for logger in config["loggers"].values():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ray data I updated this to use a dictionary to replace the loggers (e.g. here and here).

python/ray/train/tests/test_logging.py Show resolved Hide resolved
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

@omatthew98
Copy link
Contributor

@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext / ray.train.RunConfig) a future plan?

That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case.

@omatthew98
Copy link
Contributor

Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

Comments handed, good to take a look for another round.

@aslonnie aslonnie removed their request for review October 28, 2024 17:26
Comment on lines 48 to 50
DEFAULT_LOG_CONFIG_YAML_STRING = """
version: 1
disable_existing_loggers: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have this be a dict similar to Data?

https://github.com/ray-project/ray/pull/48093/files#diff-3a2ffc1cbd2991bc0acf5093b76ff0fbe78e3c441c7c04cd8eb62c8ccf6dbbddR10

Yaml string may have syntax errors and is not native to python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I was thinking yaml files have better readability comparing to json strings. However, after transformed as string, I don't think it's still that readable. I will make changes to update it to json string format.

Comment on lines +133 to +138
# Env. variable to specify the encoding of the file logs when using the default config.
LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING"

# Env. variable to specify the logging config path use defaults if not set
LOG_CONFIG_PATH_ENV = "RAY_TRAIN_LOG_CONFIG_PATH"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all these constants into the logging.py file? I'd rather keep this module isolated and have fewer changes in train/* code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to make this module isolated from the other parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the default configuration json string and the default decoding format variables to the logging.py, but still kept the two env var name variables within constant.py. I think it makes more sense to keep all the user controlled env var names within constant.py.

import yaml

import ray
from ray.tests.conftest import * # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't think this is needed actually. conftest fixtures should automatically be visible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I can remove this unnecessary import.

Comment on lines +161 to +164
console_log_output = capsys.readouterr().err
for log_line in console_log_output.splitlines():
with pytest.raises(json.JSONDecodeError):
json.loads(log_line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all logger.warning(...) gets logged in JSON format to the stdout as well? Does it make more sense for console outputs to be in the normal text format, and then only the ray-train.log contains the JSON format?

Also, is it printed to stderr or stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. for logger.warning(...), it is logged in JSON format to ray-train.log file, and logged to console as normal text format. The code here will raise pytest.raises(json.JSONDecodeError) because it is normal text, not json in the console output format.

  2. By default, all the console output in python logger goes to stderr. Reference: https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial. and https://github.com/hongpeng-guo/ray/blob/6a5fc2d39b0265e7b578f069a84ae772c123801b/python/ray/_private/log.py#L30

@alanwguo
Copy link
Contributor

Can we include local, world, and node rank as part of the structure?

@hongpeng-guo
Copy link
Contributor Author

hongpeng-guo commented Oct 31, 2024

Can we include local, world, and node rank as part of the structure?

Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes.

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

hongpeng-guo commented Nov 1, 2024

Update: I added a TrainContextFilter that will append rank information (world_rank, local_rank, world_size, local_world_size, and node_rank) to the structured logging records if this piece of log is emitted from a train worker.
cc @alanwguo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants