[Data] Refactor `OpRuntimeMetrics` to support properties #47800

bveeramani · 2024-09-23T21:02:00Z

Why are these changes needed?

Metrics that are implemented as properties (e.g., average_bytes_inputs_per_task) aren't shown on the Ray Data dashboard. This PR refactors the implementation of OpRuntimeMetrics to fix the issue.

Additional details: _StatsActor uses fields (a function that returns all of the fields in a dataclass) to generate a list of metrics to show in the Dashboard.

ray/python/ray/data/_internal/stats.py

Lines 266 to 280 in 1c80db5

    
           def _create_prometheus_metrics_for_execution_metrics( 
        
               self, metrics_group: str, tag_keys: Tuple[str, ...] 
        
           ) -> Dict[str, Gauge]: 
        
               metrics = {} 
        
               for field in fields(OpRuntimeMetrics): 
        
                   if not field.metadata.get("metrics_group") == metrics_group: 
        
                       continue 
        
                   metric_name = f"data_{field.name}" 
        
                   metric_description = field.metadata.get("description") 
        
                   metrics[field.name] = Gauge( 
        
                       metric_name, 
        
                       description=metric_description, 
        
                       tag_keys=tag_keys, 
        
                   ) 
        
               return metrics

This is an issue because _StatsActor is assumes that 1) all metrics are implemented as fields, and 2) all fields represent metrics. This refactor makes the interface more explicit by introducing a OpRuntimeMetrics.get_metrics method.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani · 2024-09-23T21:23:48Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

-    @classmethod
-    def get_metric_keys(cls):
-        """Return a list of metric keys."""
-        return [f.name for f in fields(cls)] + ["cpu_usage", "gpu_usage"]
-


bveeramani · 2024-09-23T21:23:56Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

-    @property
-    def average_bytes_change_per_task(self) -> Optional[float]:
-        """Average size difference in bytes of input ref bundles and output ref
-        bundles per task."""
-        if (
-            self.average_bytes_inputs_per_task is None
-            or self.average_bytes_outputs_per_task is None
-        ):
-            return None
-
-        return self.average_bytes_outputs_per_task - self.average_bytes_inputs_per_task
-


Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

raulchen · 2024-09-23T22:31:46Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

 @dataclass
 class RunningTaskInfo:
    inputs: RefBundle
    num_outputs: int
    bytes_outputs: int


+class OpRuntimesMetricsMeta(type):


is it possible to just add the metric to the list in metricfield. So we don't need this meta class?

alternatively, i think we can add that logic in OpRuntimeMetrics.__post_init__

How would we get the name of the attribute in metricfield? Field.name is None in metricfield. I'm not sure how name exactly gets set, but I think it's some dataclass magic that occurs after the class is defined.

One alternative is to explicitly specify the name. Advantage is that we don't have to use metaclasses; disadvantage is that you need to duplicate the name:

num_inputs_received: int = metricfield( name="num_inputs_received", default=0, description="Number of input blocks received by operator.", metrics_group="inputs", )

personally, i prefer to only specify the name only when needed. if you put the logic in __post_init__, that should have a valid name attribute i think?

alternatively, i think we can add that logic in OpRuntimeMetrics.post_init

We could do this. We'd just need to make sure we don't add duplicate items since post_init is called once per instance and not once when the class is defined.

Don't have a strong preference for post_init vs. metaclass. Down for either

Discussed offline, we propose putting the logic in post_init, and making _METRICS a set so that we don't keep duplicates.

After testing it out, I think __post_init__ might not work. __post_init__ only works if you don't override __init__, and we currently override __init__ to pass the Operator to the OpRuntimeMetrics instance.

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ics.py Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…into refactor-metrics Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

alexeykudinkin · 2024-09-23T23:17:20Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

+                    metrics_group=value.metadata[_METRIC_FIELD_METRICS_GROUP_KEY],
+                    map_only=value.metadata[_METRIC_FIELD_IS_MAP_ONLY_KEY],
+                )
+                _METRICS.append(metric)


We're adding to _METRICS twice

@alexeykudinkin wdym?

alexeykudinkin · 2024-09-23T23:19:49Z

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

-class OpRuntimeMetrics:
-    """Runtime metrics for a PhysicalOperator.
+class OpRuntimeMetrics(metaclass=OpRuntimesMetricsMeta):
+    """Runtime metrics for a 'PhysicalOperator'.


These are metric definitions not metrics themselves. Let's make that clear from the description

@alexeykudinkin OpRuntimeMetrics contains the metric definitions as well as the metric values. What's the difference between what's in OpRuntimeMetrics and what you mean by metric?

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani added 4 commits September 23, 2024 14:00

Initial commit

3fc2e24

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Fix syntax error

8c84864

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Appease lint

f7e7ef3

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Format files

fcee904

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners September 23, 2024 21:02

bveeramani assigned raulchen and scottjlee Sep 23, 2024

Fix type error

93136ac

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani commented Sep 23, 2024

View reviewed changes

bveeramani added 2 commits September 23, 2024 14:24

Remove print statement

4db6eed

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Add constants

3b5f7d8

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

raulchen reviewed Sep 23, 2024

View reviewed changes

scottjlee reviewed Sep 23, 2024

View reviewed changes

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py Outdated Show resolved Hide resolved

bveeramani and others added 4 commits September 23, 2024 15:53

Address review comments

9d89c29

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Update python/ray/data/_internal/execution/interfaces/op_runtime_metr…

42f51da

…ics.py Co-authored-by: Scott Lee <scottjlee@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Format files

78a0abf

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'refactor-metrics' of https://github.com/ray-project/ray …

83ff00c

…into refactor-metrics Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

alexeykudinkin reviewed Sep 23, 2024

View reviewed changes

bveeramani added 3 commits September 24, 2024 13:29

Address review comments

8c2d56f

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'master' into refactor-metrics

e0eda4c

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Fix bug

44d26d1

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

raulchen approved these changes Sep 24, 2024

View reviewed changes

bveeramani enabled auto-merge (squash) September 24, 2024 22:26

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 24, 2024

Fix test

b607715

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

github-actions bot disabled auto-merge September 25, 2024 01:24

bveeramani enabled auto-merge (squash) September 25, 2024 01:25

bveeramani merged commit 7591f91 into master Sep 25, 2024
6 checks passed

bveeramani deleted the refactor-metrics branch September 25, 2024 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Refactor `OpRuntimeMetrics` to support properties #47800

[Data] Refactor `OpRuntimeMetrics` to support properties #47800

bveeramani commented Sep 23, 2024 •

edited

Loading

bveeramani Sep 23, 2024

bveeramani Sep 23, 2024

raulchen Sep 23, 2024

scottjlee Sep 23, 2024

bveeramani Sep 23, 2024

scottjlee Sep 23, 2024

bveeramani Sep 23, 2024

bveeramani Sep 23, 2024

scottjlee Sep 23, 2024

bveeramani Sep 24, 2024

alexeykudinkin Sep 23, 2024

bveeramani Sep 23, 2024

alexeykudinkin Sep 23, 2024

bveeramani Sep 23, 2024

	def _create_prometheus_metrics_for_execution_metrics(
	self, metrics_group: str, tag_keys: Tuple[str, ...]
	) -> Dict[str, Gauge]:
	metrics = {}
	for field in fields(OpRuntimeMetrics):
	if not field.metadata.get("metrics_group") == metrics_group:
	continue
	metric_name = f"data_{field.name}"
	metric_description = field.metadata.get("description")
	metrics[field.name] = Gauge(
	metric_name,
	description=metric_description,
	tag_keys=tag_keys,
	)
	return metrics

[Data] Refactor OpRuntimeMetrics to support properties #47800

[Data] Refactor OpRuntimeMetrics to support properties #47800

Conversation

bveeramani commented Sep 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Refactor `OpRuntimeMetrics` to support properties #47800

[Data] Refactor `OpRuntimeMetrics` to support properties #47800

bveeramani commented Sep 23, 2024 •

edited

Loading