You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The level of randomness in the shuffle changed between Ray 1.11 and Ray 2.0, so I don't think that this is a true performance regression warranting investigation in a perf fix.
The Datasets shuffle in Ray 1.11 wasn't performing a fully global/random shuffle, since it wasn't mixing samples across mapper chunks in the reducers. This was fixed in Ray 1.13 as a drive-by, via ensuring that each reducer concatenates the mapper chunks and then does a full shuffle of the concatenated block before sending it downstream. So, the reducers in Ray 1.13+ are doing far, far more work than the reducers in Ray < 1.13.
We can definitely explore making this shuffle more performant, but we shouldn't use the shuffle in 1.11 as a baseline, since it wasn't a functionally correct shuffle.
There is performance regression in Dataset.random_shuffle() from Ray 1.11 to Ray 2.0.
Observations:
Before: Ray 1.11 perf test result
After #1: Ray 2.0 perf test result, without fusion
After #2: Ray 2.0 perf test result, with fusion
Test cluster:
They have the same setup: 1 head node + 5 worker nodes, with the same machine types.
Script for 1.11 (slightly different than 2.0 because we changed how to use spread scheduling):
Script for 2.0, without stage fusion
Script for 2.0, with stage fusion
The text was updated successfully, but these errors were encountered: