-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train Log]Ray Train Structured Logging #47806
base: master
Are you sure you want to change the base?
[Train Log]Ray Train Structured Logging #47806
Conversation
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! Here's a first pass with some questions.
(): ray._private.ray_logging.filters.CoreContextFilter | ||
|
||
handlers: | ||
file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: file_text
is more descriptive of the encoding mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments to pieces that I changed during the review for the ray data structured logging, overall lgtm please ping me when ready for a second pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omatthew98 @hongpeng-guo One other question I had: Why is the configuration via an environment variable rather than a config somewhere in the API? Is adding a config in the API (ex: DataContext
/ ray.train.RunConfig
) a future plan?
That is a fair question, I was mostly going off what we already had in place which already used environment variables. I think there might be some argument to us wanting to use environment variables to ensure logging is configured as early as possible (e.g. on module initialization before A DataContext or train.RunConfig might exist), but not sure if that is the case. |
Just a heads up, based on this thread we are going to move our yaml configurations to python dictionary configurations. |
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Comments handed, good to take a look for another round. |
python/ray/train/constants.py
Outdated
DEFAULT_LOG_CONFIG_YAML_STRING = """ | ||
version: 1 | ||
disable_existing_loggers: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just have this be a dict similar to Data?
Yaml string may have syntax errors and is not native to python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I was thinking yaml
files have better readability comparing to json strings. However, after transformed as string, I don't think it's still that readable. I will make changes to update it to json string format.
python/ray/train/constants.py
Outdated
# Env. variable to specify the encoding of the file logs when using the default config. | ||
LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING" | ||
|
||
# Env. variable to specify the logging config path use defaults if not set | ||
LOG_CONFIG_PATH_ENV = "RAY_TRAIN_LOG_CONFIG_PATH" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move all these constants into the logging.py
file? I'd rather keep this module isolated and have fewer changes in train/*
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to make this module isolated from the other parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the default configuration json string and the default decoding format variables to the logging.py
, but still kept the two env var name variables within constant.py
. I think it makes more sense to keep all the user controlled env var names within constant.py
.
import yaml | ||
|
||
import ray | ||
from ray.tests.conftest import * # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't think this is needed actually. conftest fixtures should automatically be visible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I can remove this unnecessary import.
console_log_output = capsys.readouterr().err | ||
for log_line in console_log_output.splitlines(): | ||
with pytest.raises(json.JSONDecodeError): | ||
json.loads(log_line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all logger.warning(...)
gets logged in JSON format to the stdout as well? Does it make more sense for console outputs to be in the normal text format, and then only the ray-train.log
contains the JSON format?
Also, is it printed to stderr
or stdout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
for
logger.warning(...)
, it is logged in JSON format toray-train.log
file, and logged to console as normal text format. The code here will raisepytest.raises(json.JSONDecodeError)
because it is normal text, not json in the console output format. -
By default, all the console output in python logger goes to
stderr
. Reference: https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial. and https://github.com/hongpeng-guo/ray/blob/6a5fc2d39b0265e7b578f069a84ae772c123801b/python/ray/_private/log.py#L30
Can we include local, world, and node rank as part of the structure? |
Sure, I am thinking about adding more train only context in a followup PR. We also need to differentiate between driver, controller, and worker processes. All the ranks related concepts are only defined on worker processes. |
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Update: I added a |
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
…XT or JSON to the console, and a ray-train.log that is always being JSON Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some changes on this PR:
- There are still two handlers writing to console and file
ray-train.log
- The logs emitted to
ray-train.log
will always be in JSON. The users don't need to be aware of this. This file could be only forray.train
internal usage. - The logs emitted to the console can be either TEXT or JSON based on the env variable
RAY_TRAIN_LOG_ENCODING
, similar to that of ray core's structure logging setup.
All the ray train process will have an extra field run_id
that is unique for each train job. All the ray train worker process have extra field of world_rank
, world_size
, local_rank
, local_size
.
A worker process that is not world_rank=0
also has an extra field hide=true
. By default, console logs in TEXT mode will only show rank 0 worker logs.
Please take a look cc @matthewdeng @justinvyu @alanwguo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't look too closely at the code, but the high level description sounds good.
Can you update the description example with the latest that includes run_id and and local rank?
# This key is used to hide the log record if the value is True. | ||
# By default, train workers that are not ranked zero will hide | ||
# the log record. | ||
HIDE = "hide" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think hide=True
is necessary. At least, the product won't utilize this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! A followup question: what's the format we should follow so that the product team can use to filter for logs to be shown /hidden by default in the log viewer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can just filter out non "rank 0" or do whatever behavior based on the other fields
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@@ -444,7 +443,7 @@ def training_loop(self) -> None: | |||
driver_ip=ray.util.get_node_ip_address(), | |||
driver_node_id=ray.get_runtime_context().get_node_id(), | |||
experiment_name=session.get_experiment_name(), | |||
run_id=uuid.uuid4().hex, | |||
run_id=session.get_run_id(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own understanding, this training_loop
function is part of tuner.fit()
function that will be called in a tune process. Therefore, the session.get_run_id()
will actually get the run_id
from a tune process, although this seesion
is imported from ray.train._internal
. This session is actually initialized inside functional_trainable.py
which defined under tune/trainable/function_trainable.py
. cc @justinvyu @matthewdeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the training_loop
here is the Train driver logic that is running inside the tune.FunctionTrainable
. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.
Updated the output in the PR description. Note: In current implementation, we cannot assign the In Train V2, the implementation will be a bit different. We will fix the above issue in V2, as there is less coupling with Tune, making it easier to solve this issue. |
WORLD_SIZE = "world_size" | ||
WORLD_RANK = "world_rank" | ||
LOCAL_WORLD_SIZE = "local_world_size" | ||
LOCAL_RANK = "local_rank" | ||
NODE_RANK = "node_rank" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only tag world_rank
, local_rank
, and node_rank
. world_size
/ local_world_size
is confusing to filter by.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm, good point. the size information is not useful for log searching / filtering.
# Env. variable to specify the encoding of the file logs when using the default config. | ||
LOG_ENCODING_ENV = "RAY_TRAIN_LOG_ENCODING" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_TRAIN_LOG_ENCODING
now controls the console output encoding format between TEXT and JSON, but I think it should always be TEXT
.
Our product should probably never set console output to JSON mode automatically, and users should not know about this environment variable.
What about this:
- Remove this
RAY_TRAIN_LOG_ENCODING
environment variable, so that console is always TEXT, and file (ray-train.log
) is always JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point cc @justinvyu . There is actually a longer story why I changed to allow JSON mode in console after discussed with @matthewdeng .
- I think the general logging is moving towards deprecate
log_to_driver
feature, that the logs of all works will not be showing in the driver console, but persisted in their local files, i.e., *.err, *.std, *.log. These logs will be ingested and showed using more modern tools like log viewer eventually. In the long run, logging to driver console will not be that useful. - The ray core logger has only one stream handler that writes everything to the console, i.e., *.err file of each node by default. These *.err console file are expected to contain JSON logs anyway if ray core structured logging is enabled. We are also enabling JSON mode of console logs following ray core's pattern.
There could be another design pattern as you suggested. (1) Everything to the console must be TEXT; (2) All the JSON logs go to a separate file ray-core.log
, ray-train.log
...etc, if JSON mode is enabled. (3) console log files, *.err, *.std will never be ingested because they are never JSON. I think this also works but may need wider revamp of many ray libraries. We can have a chat on this tomorrow.
"console_json": { | ||
"class": "logging.StreamHandler", | ||
"formatter": "ray_json", | ||
"filters": ["core_context_filter", "train_context_filter"], | ||
}, | ||
"console_text": { | ||
"class": "ray._private.log.PlainRayHandler", | ||
"formatter": "ray", | ||
"level": "INFO", | ||
"filters": ["train_context_filter", "console_filter"], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we switch between the core_context_filter
vs. the console_filter
?
Let's just use the core_context_filter
and remove the console_filter
(HiddenRecordFilter
), since Alan mentions this HiddenRecordFilter
is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think this HiddenRecordFilter is not very useful as well. I will remove it.
|
||
This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id, | ||
and node_id to the log record. This filter adds the rank and size information of | ||
the train context to the log record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filter is a subclass of CoreContextFilter, which adds the job_id, worker_id, | |
and node_id to the log record. This filter adds the rank and size information of | |
the train context to the log record. |
|
||
|
||
class TrainContextFilter(logging.Filter): | ||
"""Add rank and size information to the log record if the log is from a train worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Add rank and size information to the log record if the log is from a train worker. | |
"""Add training worker rank information to the log record. |
"level": "DEBUG", | ||
"handlers": ["file", "console_text"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we limit console output to INFO and above? DEBUG is ok for the log-viewer since people can filter it out, but it will spam the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually specified level INFO
in the console_text
handler.
- In text mode, the logger pass every message
>=DEBUG
, when it passed toconsole_text
handler, theINFO
level will filter out theDEBUG
messages, making it less spammy. - In json mode, the
file
handler don't specify extra levels, so theDEBUG
level info will show in JSON mode and can be filtered out by the user.
However, if we set the default level of ray.trian
logger as INFO
, DEBUG info will be removed at the logger level, that the file handler can not ingest extra information in the JSON mode.
@@ -444,7 +443,7 @@ def training_loop(self) -> None: | |||
driver_ip=ray.util.get_node_ip_address(), | |||
driver_node_id=ray.get_runtime_context().get_node_id(), | |||
experiment_name=session.get_experiment_name(), | |||
run_id=uuid.uuid4().hex, | |||
run_id=session.get_run_id(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the training_loop
here is the Train driver logic that is running inside the tune.FunctionTrainable
. So the "session" refers to the Ray Tune session, not the Ray Train Worker session.
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Co-authored-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Hongpeng Guo <hg5@illinois.edu>
Why are these changes needed?
This PR creates the structured logging for ray train. The main structure follows the implementation of Ray Data's structured logging PR. Main components include:
python/ray/train/_internal/logging.py
: this file defines the logging utility functions;python/ray/train/tests/test_logging.py
: this file provides the corresponding unit tests for the logging utilities.Example
Code snippet:
JSON Logging
Text Logging
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.