Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Refactor OpRuntimeMetrics to support properties #47800

Merged
merged 15 commits into from
Sep 25, 2024

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Sep 23, 2024

Why are these changes needed?

Metrics that are implemented as properties (e.g., average_bytes_inputs_per_task) aren't shown on the Ray Data dashboard. This PR refactors the implementation of OpRuntimeMetrics to fix the issue.

Additional details: _StatsActor uses fields (a function that returns all of the fields in a dataclass) to generate a list of metrics to show in the Dashboard.

def _create_prometheus_metrics_for_execution_metrics(
self, metrics_group: str, tag_keys: Tuple[str, ...]
) -> Dict[str, Gauge]:
metrics = {}
for field in fields(OpRuntimeMetrics):
if not field.metadata.get("metrics_group") == metrics_group:
continue
metric_name = f"data_{field.name}"
metric_description = field.metadata.get("description")
metrics[field.name] = Gauge(
metric_name,
description=metric_description,
tag_keys=tag_keys,
)
return metrics

This is an issue because _StatsActor is assumes that 1) all metrics are implemented as fields, and 2) all fields represent metrics. This refactor makes the interface more explicit by introducing a OpRuntimeMetrics.get_metrics method.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Comment on lines -316 to -320
@classmethod
def get_metric_keys(cls):
"""Return a list of metric keys."""
return [f.name for f in fields(cls)] + ["cpu_usage", "gpu_usage"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

Comment on lines -398 to -409
@property
def average_bytes_change_per_task(self) -> Optional[float]:
"""Average size difference in bytes of input ref bundles and output ref
bundles per task."""
if (
self.average_bytes_inputs_per_task is None
or self.average_bytes_outputs_per_task is None
):
return None

return self.average_bytes_outputs_per_task - self.average_bytes_inputs_per_task

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@dataclass
class RunningTaskInfo:
inputs: RefBundle
num_outputs: int
bytes_outputs: int


class OpRuntimesMetricsMeta(type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to just add the metric to the list in metricfield. So we don't need this meta class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, i think we can add that logic in OpRuntimeMetrics.__post_init__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we get the name of the attribute in metricfield? Field.name is None in metricfield. I'm not sure how name exactly gets set, but I think it's some dataclass magic that occurs after the class is defined.

One alternative is to explicitly specify the name. Advantage is that we don't have to use metaclasses; disadvantage is that you need to duplicate the name:

num_inputs_received: int = metricfield(
        name="num_inputs_received",
        default=0,
        description="Number of input blocks received by operator.",
        metrics_group="inputs",
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, i prefer to only specify the name only when needed. if you put the logic in __post_init__, that should have a valid name attribute i think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, i think we can add that logic in OpRuntimeMetrics.post_init

We could do this. We'd just need to make sure we don't add duplicate items since post_init is called once per instance and not once when the class is defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have a strong preference for post_init vs. metaclass. Down for either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, we propose putting the logic in post_init, and making _METRICS a set so that we don't keep duplicates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing it out, I think __post_init__ might not work. __post_init__ only works if you don't override __init__, and we currently override __init__ to pass the Operator to the OpRuntimeMetrics instance.

bveeramani and others added 4 commits September 23, 2024 15:53
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ics.py

Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…into refactor-metrics

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
metrics_group=value.metadata[_METRIC_FIELD_METRICS_GROUP_KEY],
map_only=value.metadata[_METRIC_FIELD_IS_MAP_ONLY_KEY],
)
_METRICS.append(metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding to _METRICS twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class OpRuntimeMetrics:
"""Runtime metrics for a PhysicalOperator.
class OpRuntimeMetrics(metaclass=OpRuntimesMetricsMeta):
"""Runtime metrics for a 'PhysicalOperator'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are metric definitions not metrics themselves. Let's make that clear from the description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin OpRuntimeMetrics contains the metric definitions as well as the metric values. What's the difference between what's in OpRuntimeMetrics and what you mean by metric?

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) September 24, 2024 22:26
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 24, 2024
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) September 25, 2024 01:25
@bveeramani bveeramani merged commit 7591f91 into master Sep 25, 2024
6 checks passed
@bveeramani bveeramani deleted the refactor-metrics branch September 25, 2024 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants