Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPMD] input sharding for both train & test data loading in imagenet example #6515

Merged
merged 1 commit into from
Feb 10, 2024

Conversation

yeounoh
Copy link
Contributor

@yeounoh yeounoh commented Feb 10, 2024

Use MpDeviceLoader and input sharding for train & eval.

@yeounoh
Copy link
Contributor Author

yeounoh commented Feb 10, 2024

This only touches the example script, and I tested locally

| Training Device=xla:0/0 Epoch=1 Step=9300 Loss=0.00135 Rate=1624.32 GlobalRate=1511.17 Time=00:34:55
| Training Device=xla:0/0 Epoch=1 Step=9320 Loss=0.00135 Rate=1628.15 GlobalRate=1511.41 Time=00:34:57
| Training Device=xla:0/0 Epoch=1 Step=9340 Loss=0.00135 Rate=1626.28 GlobalRate=1511.64 Time=00:34:58
| Training Device=xla:0/0 Epoch=1 Step=9360 Loss=0.00135 Rate=1618.30 GlobalRate=1511.84 Time=00:35:00
Epoch 1 train end 00:35:01
| Test Device=xla:0/0 Step=0 Epoch=1 Time=00:35:05
| Test Device=xla:0/0 Step=20 Epoch=1 Time=00:35:10
| Test Device=xla:0/0 Step=40 Epoch=1 Time=00:35:10
| Test Device=xla:0/0 Step=60 Epoch=1 Time=00:35:11
| Test Device=xla:0/0 Step=80 Epoch=1 Time=00:35:11
| Test Device=xla:0/0 Step=100 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=120 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=140 Epoch=1 Time=00:35:12
| Test Device=xla:0/0 Step=160 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=180 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=200 Epoch=1 Time=00:35:13
| Test Device=xla:0/0 Step=220 Epoch=1 Time=00:35:14
| Test Device=xla:0/0 Step=240 Epoch=1 Time=00:35:14
| Test Device=xla:0/0 Step=260 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=280 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=300 Epoch=1 Time=00:35:15
| Test Device=xla:0/0 Step=320 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=340 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=360 Epoch=1 Time=00:35:16
| Test Device=xla:0/0 Step=380 Epoch=1 Time=00:35:17
Epoch 1 test end 00:35:17, Accuracy=100.00
Epoch 2 train begin 00:35:17
| Training Device=xla:0/0 Epoch=2 Step=0 Loss=0.00135 Rate=538.75 GlobalRate=538.72 Time=00:35:17
| Training Device=xla:0/0 Epoch=2 Step=20 Loss=0.00135 Rate=1175.06 GlobalRate=1462.19 Time=00:35:19
| Training Device=xla:0/0 Epoch=2 Step=40 Loss=0.00135 Rate=1427.16 GlobalRate=1524.20 Time=00:35:20
| Training Device=xla:0/0 Epoch=2 Step=60 Loss=0.00135 Rate=1526.29 GlobalRate=1545.90 Time=00:35:22
| Training Device=xla:0/0 Epoch=2 Step=80 Loss=0.00135 Rate=1560.74 GlobalRate=1555.07 Time=00:35:24
| Training Device=xla:0/0 Epoch=2 Step=100 Loss=0.00135 Rate=1578.82 GlobalRate=1562.03 Time=00:35:25

cc @vanbasten23

Copy link
Collaborator

@vanbasten23 vanbasten23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@yeounoh yeounoh merged commit a5692c2 into master Feb 10, 2024
2 of 3 checks passed
amithrm pushed a commit to amithrm/xla that referenced this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants