Update training_rules.adoc #448

mrinal-gc · 2021-04-29T20:52:27Z

packing rule update

github-actions · 2021-04-29T20:52:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

guschmue · 2021-04-29T21:23:39Z

recheck

guschmue · 2021-04-29T21:28:28Z

recheck

mrinal-gc · 2021-04-29T21:52:11Z

@johntran-nv @petermattson Could you review this?

petermattson · 2021-04-30T01:53:42Z

Closer! :-)

IMO, the only change needed is to add this paragraph:

(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.

I'd revert the changes and stick this on the end of the first CLOSED: para in the section. WDYT?

mrinal-gc · 2021-04-30T02:51:49Z

Way more elegant! I’ll send an update Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Peter Mattson ***@***.***> Sent: Thursday, April 29, 2021 6:53:58 PM To: mlcommons/training_policies ***@***.***> Cc: Mrinal Iyer ***@***.***>; Mention ***@***.***> Subject: Re: [mlcommons/training_policies] Update training_rules.adoc (#448) Closer! :-) IMO, the only change needed is to add this paragraph: (Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run. I'd revert the changes and stick this on the end of the first CLOSED: para in the section. WDYT? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#448 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AT4F5X3SNNAKF6YFT5KWY63TLIETNANCNFSM432V5XAQ>. ** We have updated our privacy policy, which contains important information about how we collect and process your personal data. To read the policy, please click here<http://www.graphcore.ai/privacy> ** This email and its attachments are intended solely for the addressed recipients and may contain confidential or legally privileged information. If you are not the intended recipient you must not copy, distribute or disseminate this email in any way; to do so may be unlawful. Any personal data/special category personal data herein are processed in accordance with UK data protection legislation. All associated feasible security measures are in place. Further details are available from the Privacy Notice on the website and/or from the Company. Graphcore Limited (registered in England and Wales with registration number 10185006) is registered at 107 Cheapside, London, UK, EC2V 6DN. This message was scanned for viruses upon transmission. However Graphcore accepts no liability for any such transmission.

This reverts commit 7eec2fd.

mrinal-gc · 2021-04-30T05:40:30Z

@petermattson done! Thanks

ShriyaPalsamudram · 2024-09-19T15:10:28Z

training_rules.adoc

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
 CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

 Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
+(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.


@parmitam can we state explictly that this only applies to Bert because this rule does not apply to any other benchmark?

I agree. This section should say only that padding/un-padding is allowed but that packing should be done if and only if it is done by the reference. And the packing algorithm should be the one the reference uses.

This is an exception that was added for the bert benchmark since GraphCore needed it at the last minute, and unfortunately the packing code was never put into the reference. This paragraph should be moved to Section 14, "Appendix: Benchmark Specific Rules".

matthew-frank · 2024-09-19T23:53:32Z

training_rules.adoc

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
 CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

 Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
+(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.


can we change the wording from "be as a random as the reference" to "be at least as random as the reference". There are bugs in the BERT reference where it does not fully randomize when run on a small number of accelerators (I think the crossover point is 32 accelerators).

matthew-frank · 2024-09-20T00:09:21Z

training_rules.adoc

@@ -215,6 +215,7 @@ OPEN: If applicable, the test dataset must be extracted in the same manner as th
 CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

 Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.
+(Un)padding or (un)packing are both allowed as offline or online preprocessing steps, including removal or addition of zero tokens. When packing, It is permitted to reorder and compress the dataset. However, the overall data traversal order, taking into account any packing, must still be as a random as the reference application. For instance: It is allowed to (a) pack items into groups offline then to randomly reorder the groups each run or to (b) randomly order the items then pack them into groups as traversed online provided that in both cases the groups are much smaller than the overall dataset. It is not allowed to sort for packing and use the same sorted order for every run.


When using packing the number of samples per batch becomes variable. And the batch size impacts (a) which RCP is used, (b) the LR schedule, (c) the eval schedule. With the packing algorithm proposed by GraphCore (and used by NVIDIA and NVIDIA's partners since 2021), it was empirically measured that ~2.0x as many samples are processed per batch, and so it was agreed by the cmte that for GraphCore's packing algorithm the code would report batch size as 2x larger when using the packed data set.

GraphCore's algorithm uses "Non-negative Least Squares Histogram-Packing", which is described in a power point slide that was shared with the cmte in 2021. I don't think that slide ever got uploaded to the Google Drive, so I've forwarded a copy of it to Shriya. There may have also been a simpler greedy algorithm evaluated at the same time that achieved similar packing ratios, but I can't find any documentation about that.

Update training_rules.adoc

7eec2fd

mrinal-gc added 2 commits April 29, 2021 22:36

Revert "Update training_rules.adoc"

ac45e75

This reverts commit 7eec2fd.

Rule Update

8014951

parmitam approved these changes Sep 18, 2024

View reviewed changes

ShriyaPalsamudram reviewed Sep 19, 2024

View reviewed changes

matthew-frank reviewed Sep 19, 2024

View reviewed changes

matthew-frank reviewed Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update training_rules.adoc #448

Update training_rules.adoc #448

mrinal-gc commented Apr 29, 2021

github-actions bot commented Apr 29, 2021 •

edited

Loading

guschmue commented Apr 29, 2021

guschmue commented Apr 29, 2021

mrinal-gc commented Apr 29, 2021

petermattson commented Apr 30, 2021

mrinal-gc commented Apr 30, 2021 via email

mrinal-gc commented Apr 30, 2021

ShriyaPalsamudram Sep 19, 2024 •

edited

Loading

matthew-frank Sep 20, 2024

matthew-frank Sep 19, 2024

matthew-frank Sep 20, 2024

matthew-frank Sep 20, 2024

Update training_rules.adoc #448

Are you sure you want to change the base?

Update training_rules.adoc #448

Conversation

mrinal-gc commented Apr 29, 2021

github-actions bot commented Apr 29, 2021 • edited Loading

guschmue commented Apr 29, 2021

guschmue commented Apr 29, 2021

mrinal-gc commented Apr 29, 2021

petermattson commented Apr 30, 2021

mrinal-gc commented Apr 30, 2021 via email

mrinal-gc commented Apr 30, 2021

ShriyaPalsamudram Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

matthew-frank Sep 20, 2024

Choose a reason for hiding this comment

matthew-frank Sep 19, 2024

Choose a reason for hiding this comment

matthew-frank Sep 20, 2024

Choose a reason for hiding this comment

matthew-frank Sep 20, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 29, 2021 •

edited

Loading

ShriyaPalsamudram Sep 19, 2024 •

edited

Loading