[SPMD] input sharding for both train & test data loading in imagenet example #6515

yeounoh · 2024-02-10T00:36:21Z

Use MpDeviceLoader and input sharding for train & eval.

yeounoh · 2024-02-10T00:39:48Z

This only touches the example script, and I tested locally

| Training Device=xla:0/0 Epoch=1 Step=9300 Loss=0.00135 Rate=1624.32 GlobalRate=1511.17 Time=00:34:55
| Training Device=xla:0/0 Epoch=1 Step=9320 Loss=0.00135 Rate=1628.15 GlobalRate=1511.41 Time=00:34:57
| Training Device=xla:0/0 Epoch=1 Step=9340 Loss=0.00135 Rate=1626.28 GlobalRate=1511.64 Time=00:34:58
| Training Device=xla:0/0 Epoch=1 Step=9360 Loss=0.00135 Rate=1618.30 GlobalRate=1511.84 Time=00:35:00
Epoch 1 train end 00:35:01
| Test Device=xla:0/0 Step=0 Epoch=1 Time=00:35:05
| Test Device=xla:0/0 Step=20 Epoch=1 Time=00:35:10
| Test Device=xla:0/0 Step=40 Epoch=1 Time=00:35:10
| Test Device=xla:0/0 Step=60 Epoch=1 Time=00:35:11
| Test Device=xla:0/0 Step=80 Epoch=1 Time=00:35:11
| Test Device=xla:0/0 Step=100 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=120 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=140 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=160 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=180 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=200 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=220 Epoch=1 Time=00:35:14
| Test Device=xla:0/0 Step=240 Epoch=1 Time=00:35:14
| Test Device=xla:0/0 Step=260 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=280 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=300 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=320 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=340 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=360 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=380 Epoch=1 Time=00:35:17
Epoch 1 test end 00:35:17, Accuracy=100.00
Epoch 2 train begin 00:35:17
| Training Device=xla:0/0 Epoch=2 Step=0 Loss=0.00135 Rate=538.75 GlobalRate=538.72 Time=00:35:17
| Training Device=xla:0/0 Epoch=2 Step=20 Loss=0.00135 Rate=1175.06 GlobalRate=1462.19 Time=00:35:19
| Training Device=xla:0/0 Epoch=2 Step=40 Loss=0.00135 Rate=1427.16 GlobalRate=1524.20 Time=00:35:20
| Training Device=xla:0/0 Epoch=2 Step=60 Loss=0.00135 Rate=1526.29 GlobalRate=1545.90 Time=00:35:22
| Training Device=xla:0/0 Epoch=2 Step=80 Loss=0.00135 Rate=1560.74 GlobalRate=1555.07 Time=00:35:24
| Training Device=xla:0/0 Epoch=2 Step=100 Loss=0.00135 Rate=1578.82 GlobalRate=1562.03 Time=00:35:25

cc @vanbasten23

vanbasten23

LGTM. Thanks

…example (pytorch#6515)

…example (#6515)

Input sharding should be applied to both train and test data

4782b06

yeounoh added the SPMD / Distributed label Feb 10, 2024

yeounoh self-assigned this Feb 10, 2024

yeounoh requested a review from vanbasten23 February 10, 2024 00:36

vanbasten23 approved these changes Feb 10, 2024

View reviewed changes

yeounoh merged commit a5692c2 into master Feb 10, 2024
2 of 3 checks passed

amithrm pushed a commit to amithrm/xla that referenced this pull request Mar 1, 2024

[SPMD] input sharding for both train & test data loading in imagenet …

e4923e4

…example (pytorch#6515)

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

[SPMD] input sharding for both train & test data loading in imagenet …

0fea081

…example (#6515)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPMD] input sharding for both train & test data loading in imagenet example #6515

[SPMD] input sharding for both train & test data loading in imagenet example #6515

yeounoh commented Feb 10, 2024

yeounoh commented Feb 10, 2024

vanbasten23 left a comment

[SPMD] input sharding for both train & test data loading in imagenet example #6515

[SPMD] input sharding for both train & test data loading in imagenet example #6515

Conversation

yeounoh commented Feb 10, 2024

yeounoh commented Feb 10, 2024

vanbasten23 left a comment

Choose a reason for hiding this comment