Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update training_rules.adoc #448

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mrinal-gc
Copy link

packing rule update

@github-actions
Copy link

github-actions bot commented Apr 29, 2021

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@guschmue
Copy link
Contributor

recheck

1 similar comment
@guschmue
Copy link
Contributor

recheck

@mrinal-gc
Copy link
Author

@johntran-nv @petermattson Could you review this?

@petermattson
Copy link
Contributor

Closer! :-)

IMO, the only change needed is to add this paragraph:

(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.

I'd revert the changes and stick this on the end of the first CLOSED: para in the section. WDYT?

@mrinal-gc
Copy link
Author

mrinal-gc commented Apr 30, 2021 via email

@mrinal-gc
Copy link
Author

@petermattson done! Thanks

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.
Copy link
Contributor

@ShriyaPalsamudram ShriyaPalsamudram Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmitam can we state explictly that this only applies to Bert because this rule does not apply to any other benchmark?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This section should say only that padding/un-padding is allowed but that packing should be done if and only if it is done by the reference. And the packing algorithm should be the one the reference uses.

This is an exception that was added for the bert benchmark since GraphCore needed it at the last minute, and unfortunately the packing code was never put into the reference. This paragraph should be moved to Section 14, "Appendix: Benchmark Specific Rules".

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the wording from "be as a random as the reference" to "be at least as random as the reference". There are bugs in the BERT reference where it does not fully randomize when run on a small number of accelerators (I think the crossover point is 32 accelerators).

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using packing the number of samples per batch becomes variable. And the batch size impacts (a) which RCP is used, (b) the LR schedule, (c) the eval schedule. With the packing algorithm proposed by GraphCore (and used by NVIDIA and NVIDIA's partners since 2021), it was empirically measured that ~2.0x as many samples are processed per batch, and so it was agreed by the cmte that for GraphCore's packing algorithm the code would report batch size as 2x larger when using the packed data set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphCore's algorithm uses "Non-negative Least Squares Histogram-Packing", which is described in a power point slide that was shared with the cmte in 2021. I don't think that slide ever got uploaded to the Google Drive, so I've forwarded a copy of it to Shriya. There may have also been a simpler greedy algorithm evaluated at the same time that achieved similar packing ratios, but I can't find any documentation about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants