Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable shuffle of slice data source at the end of epoch #1160

Merged

Conversation

TomonobuTsujikawa
Copy link
Contributor

This PR tends to fix the bug that simple data source does not shuffle at the end of epoch.

The following explain the change:
The feature needs to be kept is shown as the following:

For example, (shuffle mode) (e.g. data iterator, using slice data source --> simple data source)

                                        Slice 1      Slice 2      Slice 3
epoch_number 0             d_0_1      d_0_2         d_0_3
epoch_number 1             d_1_1      d_1_2         d_1_3
epoch_number 2             d_2_1      d_2_2         d_2_3

Here, d_?_? represents the dataset retrieved from a specified slice iterator.

REQUIREMENT:
Requirement 1: d_0_1 & d_0_2 == empty, d_0_1 & d_0_3 == empty, d_0_1 & d_0_3 == empty
Requirement 2: d_0_1 != d_1_1 != d_2_1, d_0_2 != d_1_2 != d_2_2, d_0_3 != d_1_3 != d_2_3,

The dropped feature is shown as the following:

NO REQUIREMENT:
Requirement 3: d_0_1 & d_1_2 == empty, or d_0_1 & d_2_2 == empty, other is similar.

The change tends to resolve the problem that different slice with different epoch generation should share same random order.
It means that:

                                        Slice 1                            Slice 2                             Slice 3
epoch_number 0                  << Random Order Generation 0>>
epoch_number 1                  << Random Order Generation 1>>
epoch_number 2                  << Random Order Generation 2>>

When Slice1, Slice 2 or Slice3 is exactly synchronized, it is no problem. But if they are not exactly synchronized, for example, Slice1 is running on epoch 1, Slice 2 is running on epoch 0, the random order 0 MUST be saved for the access of slice 2. When Slice 1 upgraded from epoch 0 to epoch 1, the new random order is generated, the old random order cannot be overwritten, instead that it should be kept. Thus, at that time, there are 2 random order coexists.

In fact, in multi-node distribution computation, this condition will never happen. Because we keep exactly synchronization by MPI primitives, and data need to be exchanged with strictly synchronization. Our code change is still necessary, because a slice iterator cannot assume how developers use it, since slice iterator looks not only for distribution computation.

@TomonobuTsujikawa TomonobuTsujikawa added the release-note-utility Auto-release; Utilities label Jan 24, 2023
@YukioOobuchi YukioOobuchi merged commit 76268f2 into master Jan 24, 2023
@YukioOobuchi YukioOobuchi deleted the feature/20221222-fix-data-source-simple-shuffle branch January 24, 2023 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-utility Auto-release; Utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants