This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 143
PytorchJob replicas has different node affinity behaviors compared with Deployment #344
Comments
Can you please show us |
Sure. The YAML response is below. Thanks a lot. apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.100.103.167/32
cni.projectcalico.org/podIPs: 10.100.103.167/32
sidecar.istio.io/inject: "false"
creationTimestamp: "2021-07-22T02:16:05Z"
labels:
controller-name: pytorch-operator
group-name: kubeflow.org
job-name: pytorch-dist-mnist-gloo
pytorch-job-name: pytorch-dist-mnist-gloo
pytorch-replica-index: "0"
pytorch-replica-type: worker
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:sidecar.istio.io/inject: {}
f:labels:
.: {}
f:controller-name: {}
f:group-name: {}
f:job-name: {}
f:pytorch-job-name: {}
f:pytorch-replica-index: {}
f:pytorch-replica-type: {}
f:ownerReferences:
.: {}
k:{"uid":"773627d5-b463-45c9-9a17-134aec4c2b80"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:affinity:
.: {}
f:nodeAffinity:
.: {}
f:preferredDuringSchedulingIgnoredDuringExecution: {}
f:requiredDuringSchedulingIgnoredDuringExecution:
.: {}
f:nodeSelectorTerms: {}
f:containers:
k:{"name":"pytorch"}:
.: {}
f:args: {}
f:env:
.: {}
k:{"name":"MASTER_ADDR"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"MASTER_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PYTHONUNBUFFERED"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"RANK"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"WORLD_SIZE"}:
.: {}
f:name: {}
f:value: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:resources:
.: {}
f:limits:
.: {}
f:nvidia.com/gpu: {}
f:requests:
.: {}
f:nvidia.com/gpu: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:initContainers:
.: {}
k:{"name":"init-pytorch"}:
.: {}
f:command: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:terminationGracePeriodSeconds: {}
manager: pytorch-operator.v1
operation: Update
time: "2021-07-22T02:16:05Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:cni.projectcalico.org/podIP: {}
f:cni.projectcalico.org/podIPs: {}
manager: calico
operation: Update
time: "2021-07-22T02:16:06Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:initContainerStatuses: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.100.103.167"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
time: "2021-07-22T02:16:38Z"
name: pytorch-dist-mnist-gloo-worker-0
namespace: default
ownerReferences:
- apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: PyTorchJob
name: pytorch-dist-mnist-gloo
uid: 773627d5-b463-45c9-9a17-134aec4c2b80
resourceVersion: "21438615"
selfLink: /api/v1/namespaces/default/pods/pytorch-dist-mnist-gloo-worker-0
uid: c7c73a69-a9c2-415d-aa14-56c93a24bc1b
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: machine
operator: In
values:
- A1
weight: 1
- preference:
matchExpressions:
- key: machine
operator: In
values:
- A2
weight: 2
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: machine
operator: In
values:
- A1
- A2
containers:
- args:
- --backend
- gloo
- --epochs
- "2"
env:
- name: MASTER_PORT
value: "23456"
- name: MASTER_ADDR
value: pytorch-dist-mnist-gloo-master-0
- name: WORLD_SIZE
value: "4"
- name: RANK
value: "1"
- name: PYTHONUNBUFFERED
value: "0"
image: shuaix/pytorch-dist-mnist:1.0
imagePullPolicy: IfNotPresent
name: pytorch
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-p2txv
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
initContainers:
- command:
- sh
- -c
- until nslookup pytorch-dist-mnist-gloo-master-0; do echo waiting for master;
sleep 2; done;
image: alpine:3.10
imagePullPolicy: IfNotPresent
name: init-pytorch
resources:
limits:
cpu: 100m
memory: 20Mi
requests:
cpu: 50m
memory: 10Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-p2txv
readOnly: true
nodeName: A2
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: default-token-p2txv
secret:
defaultMode: 420
secretName: default-token-p2txv
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-07-22T02:16:59Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-07-22T02:17:00Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-07-22T02:17:00Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-07-22T02:16:05Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://8126bb20e0426584ca420352cc9684b25a555700dde4cba8cb242f6d3bb875c5
image: shuaix/pytorch-dist-mnist:1.0
imageID: docker-pullable://shuaix/pytorch-dist-mnist@sha256:e2b5a55c6a2c372620f951584e888e0f933b5a6c14f918f38ede10bd6de3f47c
lastState: {}
name: pytorch
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-07-22T02:16:59Z"
hostIP: 10.252.192.43
initContainerStatuses:
- containerID: docker://d51564e9ee09fa847e245ead062b40db9764b5df776b5d819a1f4542744dfa89
image: alpine:3.10
imageID: docker-pullable://alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
lastState: {}
name: init-pytorch
ready: true
restartCount: 0
state:
terminated:
containerID: docker://d51564e9ee09fa847e245ead062b40db9764b5df776b5d819a1f4542744dfa89
exitCode: 0
finishedAt: "2021-07-22T02:16:59Z"
reason: Completed
startedAt: "2021-07-22T02:16:28Z"
phase: Running
podIP: 10.100.103.167
podIPs:
- ip: 10.100.103.167
qosClass: Burstable
startTime: "2021-07-22T02:16:26Z" |
As shown in the pod spec, the nodeaffinity is set. Thus I think it should be executed by the scheduler. Is there enough resource in A1? |
Yes. Both A1 and A2 have 4 unallocated GPUs. $ k describe nodes
Name: A1
...
Capacity:
cpu: 48
ephemeral-storage: 22888456Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263722000Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 48
ephemeral-storage: 21600257039028
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263619600Ki
nvidia.com/gpu: 4
pods: 110
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default dcgm-exporter-1625661096-p8qwm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
ingress-nginx nginx1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43h
ingress-nginx nginx2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43h
kube-system calico-node-7x6t4 250m (0%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system kube-proxy-7ldzj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system nvidia-device-plugin-daemonset-2f4tq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
logging elasticsearch-logging-1 100m (0%) 1 (2%) 3Gi (1%) 3Gi (1%) 44h
logging fluentd-v2.8.0-9pg47 100m (0%) 0 (0%) 200Mi (0%) 500Mi (0%) 147m
prometheus alertmanager-kube-prometheus-stack-1625-alertmanager-0 100m (0%) 100m (0%) 250Mi (0%) 50Mi (0%) 2d1h
prometheus kube-prometheus-stack-1625-operator-764ddc77-2hk4p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d1h
prometheus kube-prometheus-stack-1625714272-prometheus-node-exporter-xps7v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 550m (1%) 1100m (2%)
memory 3522Mi (1%) 3622Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events: <none>
Name: A2
...
Capacity:
cpu: 48
ephemeral-storage: 22888456Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263722000Ki
nvidia.com/gpu: 4
pods: 110
Allocatable:
cpu: 48
ephemeral-storage: 21600257039028
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263619600Ki
nvidia.com/gpu: 4
pods: 110
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default dcgm-exporter-1625661096-bxggr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
elastic-job etcd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9d
istio-system cluster-local-gateway-6b6cb58745-fzqbr 100m (0%) 2 (4%) 128Mi (0%) 1Gi (0%) 7d16h
knative-serving autoscaler-5888bf7697-gj989 30m (0%) 300m (0%) 40Mi (0%) 400Mi (0%) 3d1h
knative-serving istio-webhook-7db84bf7bf-d5jc5 20m (0%) 200m (0%) 20Mi (0%) 200Mi (0%) 7d16h
knative-serving networking-istio-55d86868c6-wzh6h 30m (0%) 300m (0%) 40Mi (0%) 400Mi (0%) 7d16h
kube-system calico-node-rgmfx 250m (0%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system kube-proxy-8rv4g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
kube-system nvidia-device-plugin-daemonset-wdkbt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
kubeflow kfserving-controller-manager-0 100m (0%) 100m (0%) 200Mi (0%) 300Mi (0%) 7d16h
logging fluentd-v2.8.0-z65nz 100m (0%) 0 (0%) 200Mi (0%) 500Mi (0%) 147m
logging kibana-7d5cc86845-ntz9t 100m (0%) 1 (2%) 0 (0%) 0 (0%) 44h
prometheus kube-prometheus-stack-1625714272-prometheus-node-exporter-wsqv4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 730m (1%) 3900m (8%)
memory 628Mi (0%) 2824Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events: <none> |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello.
Dear developers, I find a problem when using pytorchjob.
Problem
I notice that PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the pods of a pytorchjob replica tend to be scheduled on the same node. And the preferred weights set in the node affinity seem to not affect.
Example: PytorchJob vs. Deployment
For example, the PytorchJob and Deployment replica pods are expected to be scheduled to
A1
andA2
nodes, and 1 Pod is onA1
while 2 Pods are onA2
.The Yaml files are below.
It can be seen in the result that the preferred weights in the node affinity work well in Deployment but fail in PytorchJob.
I guess maybe the default pytorchjob replica pods scheduling strategy tends to schedule pods to as fewer nodes ad possible so that it can benefit from the GPU communication when distributed training.
Thanks a lot.
The text was updated successfully, but these errors were encountered: