[Datasets] Improved naming of Ray Data `map` tasks #32585

scottjlee · 2023-02-15T22:18:51Z

Why are these changes needed?

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard.

For example, we can run the following script:

ray.init()
ds = ray.data.read_csv("input.csv")

def fn_tasks(x):
	time.sleep(30)
	return x

def fn_actors(x):
	time.sleep(30)
	return x

ds_tasks = ds.map_batches(fn_tasks, compute="tasks")
ds_actors = ds.map_batches(fn_actors, compute="actors")
assert ds_tasks.take_all() == ds_actors.take_all()

which will yield these task names on the dashboard:

Related issue number

Closes #32753

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2023-02-17T20:37:23Z

Thanks @scottjlee! cc @clarkzinzow to review as well.

python/ray/data/dataset.py

c21 · 2023-02-17T20:40:43Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -117,7 +117,7 @@ def _dispatch_tasks(self):
            bundle = self._bundle_queue.popleft()
            input_blocks = [block for block, _ in bundle.blocks]
            ctx = TaskContext(task_idx=self._next_task_idx)
-            ref = actor.submit.options(num_returns="dynamic").remote(
+            ref = actor.submit.options(num_returns="dynamic", name=self.name).remote(


@rkooo567 - I thought you mentioned there's problem that actor should not be set name here, right?

So if we look at the dash screenshot above, I believe this isn't actually setting the name of the actor -- as we can see _MapWorker still being used in the Active Actors by Name chart. I think this part instead sets the name of the submitted task instead (e.g. fn_actors in above example).

Also the name of actors should be unique across the cluster. So let's avoid setting the name for now.

Yep this is just setting the name of the actor task, so this should be fine, right?

c21 · 2023-02-17T20:41:31Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

@@ -290,6 +293,9 @@ def submit(
    ) -> Iterator[Union[Block, List[BlockMetadata]]]:
        yield from _map_task(fn, ctx, *blocks)

+    def __repr__(self):
+        return f"MapWorker({self.src_fn_name})"


I am actually feeling the MapWorker naming might be confusing for user, shall we rename it to be more specific - like MapBatchesActor? @clarkzinzow

Agreed, but I think that the current MapWorker(MapBatches(fn_name)) is pretty good!

Once #32922 is implemented, this should allow us to send this actor name to the dash

Signed-off-by: Scott Lee <sjl@anyscale.com>

rkooo567 · 2023-02-20T07:06:28Z

I don't have specific comments to the PR itself, but I'd like to do some dogfooding together with a couple of data team members after merging this PR!

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2023-02-22T01:55:52Z

Assuming CI test will be fixed, the change itself looks good to me. WDYT? @clarkzinzow.

Signed-off-by: Scott Lee <sjl@anyscale.com>

clarkzinzow · 2023-02-27T21:38:25Z

@scottjlee Ah it looks like some AIR tests that check the stats string content need to be updated as well. https://buildkite.com/ray-project/oss-ci-build-pr/builds/13201#018694ad-6e8b-4256-aba0-5a70395be959

Signed-off-by: Scott Lee <sjl@anyscale.com>

python/ray/data/datasource/datasource.py

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2023-03-03T18:31:54Z

The test failure looks irrelevant.

rkooo567 · 2023-03-11T02:12:57Z

awesome!

rkooo567 · 2023-03-11T02:13:28Z

@scottjlee let's try some dogfooding? I can schedule a meeting

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard. Signed-off-by: Scott Lee <sjl@anyscale.com>

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

On the Ray Dashboard, Data-related tasks/actors are currently summarized based on their name, which currently is generic and unhelpful in distinguishing between multiple Ray Data usages. We can include a more detailed name which contains the underlying function name when scheduling the task, so that they are easily summarized on the dashboard. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

Scott Lee added 11 commits February 7, 2023 11:33

initial jam

3bb31a5

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

b82b028

Signed-off-by: Scott Lee <sjl@anyscale.com>

setup path for improved task names

5756e47

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

9467992

Signed-off-by: Scott Lee <sjl@anyscale.com>

add read fn name to LazyBlockList instead of read_datasource

9f07bc8

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

9b49e12

Signed-off-by: Scott Lee <sjl@anyscale.com>

update class detection logic

b977c4c

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

16295f8

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

48a67d9

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

4dc1ea2

Signed-off-by: Scott Lee <sjl@anyscale.com>

format

003c521

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review February 17, 2023 20:29

scottjlee requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners February 17, 2023 20:29

scottjlee assigned c21 and rkooo567 Feb 17, 2023

c21 assigned clarkzinzow Feb 17, 2023

c21 reviewed Feb 17, 2023

View reviewed changes

update union naming scheme

7770385

Signed-off-by: Scott Lee <sjl@anyscale.com>

Scott Lee added 2 commits February 21, 2023 14:39

remove unnecessary options

92aa93c

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

501372d

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee changed the title ~~[Datasets] Improved naming of Ray Data map tasks/actors~~ [Datasets] Improved naming of Ray Data map tasks Feb 22, 2023

Merge branch 'master' into improved-data-dash-names

05adb73

Signed-off-by: Scott Lee <sjl@anyscale.com>

Scott Lee added 4 commits February 24, 2023 15:53

rewrite Datasource get_name method

20200a8

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

66f3668

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

58bb391

Signed-off-by: Scott Lee <sjl@anyscale.com>

update tests

cf2de1f

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Feb 27, 2023

View reviewed changes

clarkzinzow approved these changes Feb 27, 2023

View reviewed changes

clarkzinzow added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 27, 2023

straggler tests

2f5845c

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 28, 2023

Scott Lee added 2 commits February 28, 2023 10:55

Merge branch 'master' into improved-data-dash-names

b1b3fbe

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

01eac28

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 reviewed Feb 28, 2023

View reviewed changes

python/ray/data/datasource/datasource.py Outdated Show resolved Hide resolved

Scott Lee added 3 commits February 28, 2023 15:18

update datasource naming

69c6d1c

Signed-off-by: Scott Lee <sjl@anyscale.com>

update manual datasource names

40fd9ea

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into improved-data-dash-names

6f70ef8

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 mentioned this pull request Mar 1, 2023

[Datasets] Add telemetry for Ray Data #32896

Merged

7 tasks

Merge branch 'master' into improved-data-dash-names

d2df035

Signed-off-by: Scott Lee <sjl@anyscale.com>

clarkzinzow merged commit 39ae7b6 into ray-project:master Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Improved naming of Ray Data `map` tasks #32585

[Datasets] Improved naming of Ray Data `map` tasks #32585

scottjlee commented Feb 15, 2023 •

edited

Loading

c21 commented Feb 17, 2023

c21 Feb 17, 2023

scottjlee Feb 17, 2023

rkooo567 Feb 20, 2023

clarkzinzow Feb 24, 2023

c21 Feb 17, 2023

clarkzinzow Feb 24, 2023

scottjlee Feb 28, 2023

rkooo567 commented Feb 20, 2023

c21 commented Feb 22, 2023

clarkzinzow commented Feb 27, 2023

c21 commented Mar 3, 2023

rkooo567 commented Mar 11, 2023

rkooo567 commented Mar 11, 2023

[Datasets] Improved naming of Ray Data map tasks #32585

[Datasets] Improved naming of Ray Data map tasks #32585

Conversation

scottjlee commented Feb 15, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 commented Feb 17, 2023

c21 Feb 17, 2023

Choose a reason for hiding this comment

scottjlee Feb 17, 2023

Choose a reason for hiding this comment

rkooo567 Feb 20, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 24, 2023

Choose a reason for hiding this comment

c21 Feb 17, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 24, 2023

Choose a reason for hiding this comment

scottjlee Feb 28, 2023

Choose a reason for hiding this comment

rkooo567 commented Feb 20, 2023

c21 commented Feb 22, 2023

clarkzinzow commented Feb 27, 2023

c21 commented Mar 3, 2023

rkooo567 commented Mar 11, 2023

rkooo567 commented Mar 11, 2023

[Datasets] Improved naming of Ray Data `map` tasks #32585

[Datasets] Improved naming of Ray Data `map` tasks #32585

scottjlee commented Feb 15, 2023 •

edited

Loading