You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instantiating a PyTorchJob that specifies a PyTorchJob.metadata.name not complying with RFC-1035 but valid in every other way results in the following:
pytorchjob creation
missing pytorchjob state / nil status
successful pod creation
failed service creation (service names must be RFC-1035 compliant)
This behavior was noticed in a production environment where it led to distributed pytorch jobs failing to make progress (failing to rendezvous during initialization due to the service creation failure, and not transitioning due to the missing status).
I have provided a very much simplified case for reproduction below.
Questions
Is the community aware of this problem?
Has a resolution been proposed?
What is the best way to follow or participate in a resolution?
Version
training-operator Release v1.5
Reproducing
Note: kubectl can replace oc in the following.
test-env# Baseline, create a pytorchjob with a valid name (all should go well)
bash-3.2$ oc get pytorchjob
No resources found
bash-3.2$ oc create -f test.yaml
pytorchjob.kubeflow.org/test created
bash-3.2$ oc get pytorchjob
NAME STATE AGE
test Running 8s
bash-3.2$ oc describe pytorchjob test
Name: test
Namespace: test-env
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: PyTorchJob
Metadata:
Creation Timestamp: 2023-01-27T21:02:39Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:pytorchReplicaSpecs:
.:
f:Master:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:spec:
.:
f:containers:
f:imagePullSecrets:
f:volumes:
Manager: kubectl-create
Operation: Update
Time: 2023-01-27T21:02:39Z
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:completionTime:
f:conditions:
f:replicaStatuses:
.:
f:Master:
.:
f:labelSelector:
.:
f:matchLabels:
.:
f:group-name:
f:job-name:
f:training.kubeflow.org/job-name:
f:training.kubeflow.org/operator-name:
f:training.kubeflow.org/replica-type:
f:succeeded:
f:startTime:
Manager: manager
Operation: Update
Subresource: status
Time: 2023-01-27T21:03:15Z
Resource Version: 232228338
UID: ff185e74-3cff-417f-92e8-d8adb578fd1a
Spec:
Pytorch Replica Specs:
Master:
Replicas: 1
Restart Policy: Never
Template:
Spec:
Containers:
Command:
bash
-c
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
#
# User commands
#
echo "Container started!" && sleep 30 && echo "Bye now"
Env:
Image: bash
Image Pull Policy: IfNotPresent
Name: pytorch
Resources:
Limits:
Cpu: 1
Memory: 1Gi
nvidia.com/gpu: 0
Requests:
Cpu: 1
Memory: 1Gi
nvidia.com/gpu: 0
Volume Mounts:
Image Pull Secrets:
Volumes:
Status:
Completion Time: 2023-01-27T21:03:15Z
Conditions:
Last Transition Time: 2023-01-27T21:02:39Z
Last Update Time: 2023-01-27T21:02:39Z
Message: PyTorchJob test is created.
Reason: PyTorchJobCreated
Status: True
Type: Created
Last Transition Time: 2023-01-27T21:02:43Z
Last Update Time: 2023-01-27T21:02:43Z
Message: PyTorchJob test is running.
Reason: JobRunning
Status: False
Type: Running
Last Transition Time: 2023-01-27T21:03:15Z
Last Update Time: 2023-01-27T21:03:15Z
Message: PyTorchJob test is successfully completed.
Reason: JobSucceeded
Status: True
Type: Succeeded
Replica Statuses:
Master:
Label Selector:
Match Labels:
Group - Name: kubeflow.org
Job - Name: test
training.kubeflow.org/job-name: test
training.kubeflow.org/operator-name: pytorchjob-controller
training.kubeflow.org/replica-type: Master
Succeeded: 1
Start Time: 2023-01-27T21:02:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 41s pytorchjob-controller Created pod: test-master-0
Normal SuccessfulCreateService 41s pytorchjob-controller Created service: test-master-0
Normal ExitedWithCode 5s (x2 over 7s) pytorchjob-controller Pod: test-env.test-master-0 exited with code 0
Normal JobSucceeded 5s pytorchjob-controller PyTorchJob test is successfully completed.
bash-3.2$ # All went well as expected
bash-3.2$ # Now prefix a number to the name to trigger the reported issue
bash-3.2$ vi test.yaml
bash-3.2$ oc create -f test.yaml
pytorchjob.kubeflow.org/1test created
bash-3.2$ oc get pytorchjob
NAME STATE AGE
1test 7s
test Succeeded 2m19s
bash-3.2$ # No state on "1test" !
bash-3.2$ oc describe pytorchjob 1test
Name: 1test
Namespace: test-env
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: PyTorchJob
Metadata:
Creation Timestamp: 2023-01-27T21:04:51Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:pytorchReplicaSpecs:
.:
f:Master:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:spec:
.:
f:containers:
f:imagePullSecrets:
f:volumes:
Manager: kubectl-create
Operation: Update
Time: 2023-01-27T21:04:51Z
Resource Version: 232230735
UID: 33d2d935-3da5-42d2-ba0a-3931f1a7928b
Spec:
Pytorch Replica Specs:
Master:
Replicas: 1
Restart Policy: Never
Template:
Spec:
Containers:
Command:
bash
-c
echo "Environment variables set by the kubeflow training operator:"
echo ${MASTER_ADDR}:${MASTER_PORT}
echo "PYTHONUNBUFFERED:"${PYTHONUNBUFFERED}
echo My global rank is ${RANK} / ${WORLD_SIZE}
#
# User commands
#
echo "Container started!" && sleep 30 && echo "Bye now"
Env:
Image: bash
Image Pull Policy: IfNotPresent
Name: pytorch
Resources:
Limits:
Cpu: 1
Memory: 1Gi
nvidia.com/gpu: 0
Requests:
Cpu: 1
Memory: 1Gi
nvidia.com/gpu: 0
Volume Mounts:
Image Pull Secrets:
Volumes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 18s pytorchjob-controller Created pod: 1test-master-0
Warning FailedCreateService 13s (x13 over 18s) pytorchjob-controller Error creating: Service "1test-master-0" is invalid: metadata.name: Invalid value: "1test-master-0": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
bash-3.2$ # service creation failure!
bash-3.2$ # And nil status!
Note: in this greatly simplified test case, the pod can actually complete because it does not rely on the service unlike general PyTorchJob use-cases (it's just running echo and sleep), but the behaviors noted are observable.
The text was updated successfully, but these errors were encountered:
The training-operator doesn't verify whether the CustomJob (e.g., PyTorchJob) name meets DNS-1035.
However, we may want to validate the CustomJob name so that the name meets DNS-1035. Or it might be better to convert the CustomJob name following DNS-1035 only when we operate (CRUD) the Service.
Overview
Instantiating a PyTorchJob that specifies a PyTorchJob.metadata.name not complying with RFC-1035 but valid in every other way results in the following:
This behavior was noticed in a production environment where it led to distributed pytorch jobs failing to make progress (failing to rendezvous during initialization due to the service creation failure, and not transitioning due to the missing status).
I have provided a very much simplified case for reproduction below.
Questions
Is the community aware of this problem?
Has a resolution been proposed?
What is the best way to follow or participate in a resolution?
Version
training-operator Release v1.5
Reproducing
Note:
kubectl
can replaceoc
in the following.Note: in this greatly simplified test case, the pod can actually complete because it does not rely on the service unlike general PyTorchJob use-cases (it's just running echo and sleep), but the behaviors noted are observable.
The text was updated successfully, but these errors were encountered: