You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Defined the following pipeline with retry settings:
fromkfpimportdsl, compiler@dsl.componentdefrandom_failure_op(exit_codes: str):
"""A component that fails randomly."""importrandomimportsysexit_code=int(random.choice(exit_codes.split(",")))
print(exit_code)
sys.exit(exit_code)
@dsl.pipeline(name='retry-random-failures',description='The pipeline includes two steps which fail randomly. It shows how to use ContainerOp(...).set_retry(...).')defretry_sample_pipeline():
op1=random_failure_op(exit_codes='0,1,2,3').set_retry(10)
op2=random_failure_op(exit_codes='0,1').set_retry(5)
if__name__=='__main__':
compiler.Compiler().compile(retry_sample_pipeline, __file__+'.yaml')
When the task fails, retries are not executed as expected.
Logs from the dag-driver POD where Minio is referenced incorrectly:
// log of retry-random-failures-tw2sp-system-dag-driver-3164055661
I1016 10:45:56.065592 17 main.go:121] input ContainerSpec:{}
I1016 10:45:56.065684 17 main.go:128] input RuntimeConfig:{}
I1016 10:45:56.065742 17 main.go:136] input kubernetesConfig:{}
I1016 10:45:56.066311 17 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I1016 10:45:56.066337 17 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
I1016 10:45:56.084156 17 env.go:65] cannot find launcher configmap: name="kfp-launcher" namespace="kubeflow-user-example-com", will use default config
I1016 10:45:56.084194 17 driver.go:153] PipelineRoot="minio://mlpipeline/v2/artifacts" from default config
I1016 10:45:56.084219 17 config.go:166] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint.
I1016 10:45:56.226159 17 client.go:302] Pipeline Context: id:806 name:"retry-random-failures" type_id:20 type:"system.Pipeline" create_
time_since_epoch:1729047175840 last_update_time_since_epoch:1729047175840
Expected result
Retry logic should be triggered when tasks fail, and retries should be reflected in the flow diagram on the dashboard. The custom Minio endpoint should be referenced correctly.
Materials and Reference
Logs indicating the incorrect reference to the Minio service.
Custom pipeline code with retry logic.
ConfigMap for the custom Minio setup.
In addition, I would like to know the specific steps or configurations required to make Kubeflow Pipelines (KFP) refer to the custom Minio we have prepared.
Let me know if any further information is needed. Thank you for your help!
The text was updated successfully, but these errors were encountered:
Thank you for the responses and insights provided. After further investigation, we have confirmed that the retry functionality does not work as expected in this Kubeflow Pipeline setup, irrespective of the Minio configuration. It appears that similar issues have been reported, so I will comment on those instead.
Hello,
I am using Kubeflow in an EKS environment and am encountering an issue where the retry settings are not functioning as expected.
Environment
How did you deploy Kubeflow Pipelines (KFP)?
Using a custom deployment on AWS EKS.
KFP version:
Kubeflow version: 1.9.0
KFP SDK version:
Steps to reproduce
I am using a custom Minio setup instead of the default one.
ConfigMap for Minio setup:
Defined the following pipeline with retry settings:
When the task fails, retries are not executed as expected.
Logs from the
dag-driver
POD where Minio is referenced incorrectly:Expected result
Retry logic should be triggered when tasks fail, and retries should be reflected in the flow diagram on the dashboard. The custom Minio endpoint should be referenced correctly.
Materials and Reference
In addition, I would like to know the specific steps or configurations required to make Kubeflow Pipelines (KFP) refer to the custom Minio we have prepared.
Let me know if any further information is needed. Thank you for your help!
The text was updated successfully, but these errors were encountered: