-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The pytorchJob training is slow #1532
Comments
Pytorch Version 1.7.1+cu102 Training code
yaml
|
How about the network in the cluster? |
The time difference is mainly reflected in the optimizer |
qperf test |
Hello, have you solve the training speed issues? |
After testing, I think it is a bandwidth problem |
put aside the bandwidth problem, I have a question about the ddp torch.distributed.launch usage: I have tried yaml 1master+2workers, workers as follows Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: repository/kubeflow/arrikto-playground/dimpo/ranzcr-dist:latest
command: ["sh","-c",
"python -m torch.distributed.launch
--nnodes=3
--nproc_per_node=8
--node_rank=${RANK}
--master_addr=${MASTER_ADDR}
--master_port=${MASTER_PORT}
/home/jovyan/ddp/ddp-mul-gpu.py"
]
imagePullPolicy: "Always"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /home/jovyan
name: workspace-kaggle
resources:
limits:
memory: "10Gi"
cpu: "8"
nvidia.com/gpu: 8 the log seems showed:shell can not get env from pod. |
I use this repository https://github.com/Shuai-Xie/mnist-pytorchjob-example for speed testing, I have more than two k8s nodes, and test the speed difference between two machines with four cards on the same node and two machines with four cards on different nodes. Each node has 4 Tesla V100 (32G Version) cards. Below is the comparison result.
@Shuai-Xie
The different nodes.
accuracy=0.0000
time >>: 148.1217818260193
----forwardtime: 4.269739866256714 ----backwardtime 99.48042893409729
testtime: 5.997300148010254
accuracy=0.0000
time >>: 148.90791296958923
----forwardtime: 7.287083387374878 ----backwardtime 98.69852066040039
testtime: 6.474621534347534
The same nodes
Train Epoch: 1 [0/15018 (0%)] loss=7.1878
Train Epoch: 1 [0/15018 (0%)] loss=7.0561
Train Epoch: 1 [1280/15018 (33%)] loss=3.3408
Train Epoch: 1 [1280/15018 (33%)] loss=3.3714
Train Epoch: 1 [2560/15018 (67%)] loss=3.2577
Train Epoch: 1 [2560/15018 (67%)] loss=3.3267
accuracy=0.0000
time >>: 73.05248022079468
----forwardtime: 4.658957004547119 ----backwardtime 26.686575412750244
testtime: 7.037155389785767
accuracy=0.0000
time >>: 73.65188813209534
----forwardtime: 5.2202184200286865 ----backwardtime 26.243772506713867
testtime: 6.42363977432251
Please ignore the accuracy(I used a special dataset)
System Information
Linux version 3.10.0-1062.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Wed Aug 7 18:08:02 UTC 2019
K8s Version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:25:59Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Kubeflow Version
v1.3
The text was updated successfully, but these errors were encountered: