-
Notifications
You must be signed in to change notification settings - Fork 661
Issues: kubeflow/training-operator
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Docs: reference architecture for fault tolerance capabilities
area/docs
good first issue
help wanted
kind/feature
#2157
opened Jul 4, 2024 by
StefanoFioravanzo
[GSOC] Project 7 Tracking Issue: Automate docs generation for Training-operator Python SDK
area/gsoc
#2156
opened Jun 26, 2024 by
shivas1516
7 tasks
Improve Training Operator release process
good first issue
help wanted
#2155
opened Jun 25, 2024 by
andreyvelich
Tracking Issue: Integrate JAX in Kubeflow Training Operator
area/gsoc
#2145
opened Jun 13, 2024 by
sandipanpanda
1 of 18 tasks
spatial dataset training functions
kind/feature
lifecycle/needs-triage
#2141
opened Jun 7, 2024 by
Jo316
The actual default RestartPolicy of PyTorch is inconsistent with its description in the CRD
#2127
opened May 27, 2024 by
Eslody
mpijob will stuck if LastReconcileTime is updated in 1 second
#2118
opened May 17, 2024 by
shadowdsp
Export Fine-Tuned LLM after Trainer is Complete
kind/discussion
#2101
opened May 6, 2024 by
andreyvelich
fix(compatability): match-case syntax only compatible with Python3.10
release/1.8
#2096
opened May 2, 2024 by
PantherHawk
chore(style): provide type for
STORAGE_INITIALIZER_VOLUME
constant
#2093
opened May 2, 2024 by
PantherHawk
Add DeepSpeed Example with MPI Operator
area/example
good first issue
help wanted
#2091
opened Apr 29, 2024 by
andreyvelich
Flaky Test: [It] should create desired Pods and Services: Distributed TFJob (4 workers, 2 PS) is succeeded
#2086
opened Apr 27, 2024 by
tenzen-y
Not getting Kubeflow Training SDK v1.7 when installing
kubeflow-training
#2082
opened Apr 24, 2024 by
JamesKunstle
Update pytorch launcher component in Kubeflow Pipelines repository
good first issue
help wanted
kind/feature
#2068
opened Apr 17, 2024 by
anishasthana
Support CertManager for the Webhook cert generation
kind/feature
#2049
opened Apr 10, 2024 by
tenzen-y
PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors
kind/feature
#2045
opened Apr 5, 2024 by
kellyaa
Previous Next
ProTip!
no:milestone will show everything without a milestone.