[Datasets] shuffle performance regression from Ray 1.11 to Ray 2.0 #29408

jianoaix · 2022-10-17T20:11:46Z

There is performance regression in Dataset.random_shuffle() from Ray 1.11 to Ray 2.0.

Observations:

The spread scheduling worked in both case
The reduce phase took particularly longer than before

Before: Ray 1.11 perf test result

time python test_shuffle.py 
Shuffle Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:02<00:00,  8.18it/s]
Shuffle Reduce: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:01<00:00, 11.99it/s]
Stage 0 read: 21/21 blocks executed in 0.57s
* Remote wall time: 3.37ms min, 260.57ms max, 230.15ms mean, 4.83s total
* Remote cpu time: 3.81ms min, 261.21ms max, 231.16ms mean, 4.85s total
* Output num rows: 16 min, 52428 max, 49932 mean, 1048576 total
* Output size bytes: 163908 min, 537072436 max, 511505363 mean, 10741612628 total
* Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

Stage 1 random_shuffle_map: 21/21 blocks executed in 4.36s
* Remote wall time: 8.25ms min, 1.15s max, 1.06s mean, 22.24s total
* Remote cpu time: 8.27ms min, 1.15s max, 1.06s mean, 22.3s total
* Output num rows: 16 min, 52428 max, 49932 mean, 1048576 total
* Output size bytes: 166470 min, 545467470 max, 519500755 mean, 10909515870 total
* Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

Stage 1 random_shuffle_reduce: 21/21 blocks executed in 4.36s
* Remote wall time: 4.56ms min, 9.21ms max, 5.84ms mean, 122.74ms total
* Remote cpu time: 4.56ms min, 9.2ms max, 5.87ms mean, 123.26ms total
* Output num rows: 49913 min, 49941 max, 49932 mean, 1048576 total
* Output size bytes: 511308856 min, 511595688 max, 511505443 mean, 10741614308 total
* Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

real    0m7.714s
user    0m1.030s
sys     0m0.235s

After #1: Ray 2.0 perf test result, without fusion

time python test_shuffle.py 
2022-10-17 12:56:03,020 INFO worker.py:1224 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-10-17 12:56:03,338 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.109.143:9031...
2022-10-17 12:56:03,345 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_1QLzwmUP8hLTTqBu2xCe2YiK/services?redirect_to=dashboard 
2022-10-17 12:56:03,347 INFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_081214977a98448ba1a00790cc66edcb.zip' (0.01MiB) to Ray cluster...
2022-10-17 12:56:03,348 INFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_081214977a98448ba1a00790cc66edcb.zip'.
Read progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:03<00:00,  6.65it/s]
Shuffle Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:02<00:00,  7.06it/s]
Shuffle Reduce: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:04<00:00,  4.61it/s]
Stage 0 read: 21/21 blocks executed in 1.92s
* Remote wall time: 47.43ms min, 278.87ms max, 259.49ms mean, 5.45s total
* Remote cpu time: 47.62ms min, 279.45ms max, 260.06ms mean, 5.46s total
* Peak heap memory usage (MiB): 1551732000.0 min, 2068036000.0 max, 1793434666 mean
* Output num rows: 16 min, 52428 max, 49932 mean, 1048576 total
* Output size bytes: 163908 min, 537072436 max, 511505363 mean, 10741612628 total
* Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

Stage 1 random_shuffle: executed in 10.72s

        Substage 0 random_shuffle_map: 21/21 blocks executed
        * Remote wall time: 10.16ms min, 1.35s max, 1.27s mean, 26.65s total
        * Remote cpu time: 10.68ms min, 1.35s max, 1.27s mean, 26.75s total
        * Peak heap memory usage (MiB): 1551732000.0 min, 3170420000.0 max, 2568943809 mean
        * Output num rows: 16 min, 52428 max, 49932 mean, 1048576 total
        * Output size bytes: 166470 min, 545467470 max, 519500755 mean, 10909515870 total
        * Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

        Substage 1 random_shuffle_reduce: 21/21 blocks executed
        * Remote wall time: 1.09s min, 1.2s max, 1.12s mean, 23.5s total
        * Remote cpu time: 1.09s min, 1.2s max, 1.12s mean, 23.57s total
        * Peak heap memory usage (MiB): 2029100000.0 min, 4026192000.0 max, 3361593523 mean
        * Output num rows: 49905 min, 49941 max, 49932 mean, 1048576 total
        * Output size bytes: 519217863 min, 519592411 max, 519500755 mean, 10909515871 total
        * Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

real    0m15.498s
user    0m1.668s
sys     0m0.296s

After #2: Ray 2.0 perf test result, with fusion

time python test_shuffle.py 
2022-10-17 13:07:47,557 INFO worker.py:1224 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
2022-10-17 13:07:47,858 INFO worker.py:1333 -- Connecting to existing Ray cluster at address: 172.31.109.143:9031...
2022-10-17 13:07:47,865 INFO worker.py:1515 -- Connected to Ray cluster. View the dashboard at https://console.anyscale-staging.com/api/v2/sessions/ses_1QLzwmUP8hLTTqBu
2xCe2YiK/services?redirect_to=dashboard 
2022-10-17 13:07:47,867 INFO packaging.py:342 -- Pushing file package 'gcs://_ray_pkg_6193deabd41a06b8ef2bfb8348f42df9.zip' (0.00MiB) to Ray cluster...
2022-10-17 13:07:47,868 INFO packaging.py:351 -- Successfully pushed file package 'gcs://_ray_pkg_6193deabd41a06b8ef2bfb8348f42df9.zip'.
Shuffle Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:03<00:00,  6.29it/s]
Shuffle Reduce: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:04<00:00,  4.64it/s]
Stage 1 read->random_shuffle: executed in 7.87s

        Substage 0 read->random_shuffle_map: 21/21 blocks executed
        * Remote wall time: 329.4ms min, 1.87s max, 1.76s mean, 36.95s total
        * Remote cpu time: 310.82ms min, 1.86s max, 1.76s mean, 36.91s total
        * Peak heap memory usage (MiB): 1997432000.0 min, 2122136000.0 max, 2115597714 mean
        * Output num rows: 16 min, 52428 max, 49932 mean, 1048576 total
        * Output size bytes: 166470 min, 545467470 max, 519500755 mean, 10909515870 total
        * Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

        Substage 1 random_shuffle_reduce: 21/21 blocks executed
        * Remote wall time: 1.09s min, 1.43s max, 1.14s mean, 23.88s total
        * Remote cpu time: 1.09s min, 1.51s max, 1.14s mean, 24.03s total
        * Peak heap memory usage (MiB): 2029064000.0 min, 2535736000.0 max, 2449107809 mean
        * Output num rows: 49905 min, 49941 max, 49932 mean, 1048576 total
        * Output size bytes: 519217863 min, 519592411 max, 519500755 mean, 10909515871 total
        * Tasks per node: 4 min, 5 max, 4 mean; 5 nodes used

real    0m12.369s
user    0m1.578s
sys     0m0.337s

Test cluster:

1.11: https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_ZJzzErKK4DqBBig1uyWs5uGw/ses_VNPqCJiy9j6pv1b8KjKS17Mu
2.0: https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_zPackPsUDUiVSrUSgmAeN1ic/ses_1QLzwmUP8hLTTqBu2xCe2YiK
They have the same setup: 1 head node + 5 worker nodes, with the same machine types.

Script for 1.11 (slightly different than 2.0 because we changed how to use spread scheduling):

import ray

# Dataset of 10KiB tensor records.
size_gb = 10
total_size = 1024 * 1024 * 1024 * size_gb
record_dim = 1280
record_size = record_dim * 8
num_records = int(total_size / record_size)

ds = ray.data.read_datasource(
        ray.data.datasource.RangeDatasource(),
        parallelism=20,
        n=num_records,
        block_format="tensor",
        _spread_resource_prefix="DATA:",
        tensor_shape=tuple((record_dim,)))
ds = ds.random_shuffle(_spread_resource_prefix="DATA:")
print(ds.stats())

Script for 2.0, without stage fusion

import ray
from ray.data.context import DatasetContext

context = DatasetContext.get_current()
context.optimize_fuse_stages = False

# Dataset of 10KiB tensor records.
size_gb = 10
total_size = 1024 * 1024 * 1024 * size_gb
record_dim = 1280
record_size = record_dim * 8
num_records = int(total_size / record_size)

ds = ray.data.range_tensor(num_records, shape=(record_dim,), parallelism=20)
ds = ds.random_shuffle()
print(ds.stats())

Script for 2.0, with stage fusion

import ray

# Dataset of 10KiB tensor records.
size_gb = 10
total_size = 1024 * 1024 * 1024 * size_gb
record_dim = 1280
record_size = record_dim * 8
num_records = int(total_size / record_size)

ds = ray.data.range_tensor(num_records, shape=(record_dim,), parallelism=20)
ds = ds.random_shuffle()
print(ds.stats())

The text was updated successfully, but these errors were encountered:

clarkzinzow · 2022-10-17T21:54:26Z

The level of randomness in the shuffle changed between Ray 1.11 and Ray 2.0, so I don't think that this is a true performance regression warranting investigation in a perf fix.

The Datasets shuffle in Ray 1.11 wasn't performing a fully global/random shuffle, since it wasn't mixing samples across mapper chunks in the reducers. This was fixed in Ray 1.13 as a drive-by, via ensuring that each reducer concatenates the mapper chunks and then does a full shuffle of the concatenated block before sending it downstream. So, the reducers in Ray 1.13+ are doing far, far more work than the reducers in Ray < 1.13.

We can definitely explore making this shuffle more performant, but we shouldn't use the shuffle in 1.11 as a baseline, since it wasn't a functionally correct shuffle.

jianoaix · 2022-10-17T22:16:52Z

Thanks Clark for shedding light. Indeed the 1.11 shuffle-reduce looks too cheap if it actually did anything. Glad this isn't a real regression!

jianoaix added the P1 Issue that should be fixed within a few weeks label Oct 17, 2022

matthewdeng assigned jianoaix Oct 31, 2022

jianoaix closed this as completed Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] shuffle performance regression from Ray 1.11 to Ray 2.0 #29408

[Datasets] shuffle performance regression from Ray 1.11 to Ray 2.0 #29408

jianoaix commented Oct 17, 2022

clarkzinzow commented Oct 17, 2022

jianoaix commented Oct 17, 2022

[Datasets] shuffle performance regression from Ray 1.11 to Ray 2.0 #29408

[Datasets] shuffle performance regression from Ray 1.11 to Ray 2.0 #29408

Comments

jianoaix commented Oct 17, 2022

clarkzinzow commented Oct 17, 2022

jianoaix commented Oct 17, 2022