Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Retry setting for components not functioning as expected in custom Minio setup #11301

Open
kunihito opened this issue Oct 16, 2024 · 2 comments

Comments

@kunihito
Copy link

kunihito commented Oct 16, 2024

Hello,
I am using Kubeflow in an EKS environment and am encountering an issue where the retry settings are not functioning as expected.

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Using a custom deployment on AWS EKS.

  • KFP version:
    Kubeflow version: 1.9.0

  • KFP SDK version:

$ pip list | grep kfp
kfp                         2.9.0

Steps to reproduce

  1. I am using a custom Minio setup instead of the default one.

  2. ConfigMap for Minio setup:

    # kubectl describe configmap workflow-controller-configmap -n kubeflow
    
    Name:         workflow-controller-configmap
    Namespace:    kubeflow
    
    Data
    ====
    artifactRepository:
    ----
    archiveLogs: true
    s3:
      endpoint: "minio.kf-storage.svc.cluster.local:80"
      bucket: "mlpipeline"
  3. Defined the following pipeline with retry settings:

    from kfp import dsl, compiler
    
    @dsl.component
    def random_failure_op(exit_codes: str):
        """A component that fails randomly."""
        import random
        import sys
    
        exit_code = int(random.choice(exit_codes.split(",")))
        print(exit_code)
        sys.exit(exit_code)
    
    @dsl.pipeline(
        name='retry-random-failures',
        description='The pipeline includes two steps which fail randomly. It shows how to use ContainerOp(...).set_retry(...).'
    )
    def retry_sample_pipeline():
        op1 = random_failure_op(exit_codes='0,1,2,3').set_retry(10)
        op2 = random_failure_op(exit_codes='0,1').set_retry(5)
    
    if __name__ == '__main__':
        compiler.Compiler().compile(retry_sample_pipeline, __file__ + '.yaml')
  4. When the task fails, retries are not executed as expected.

  5. Logs from the dag-driver POD where Minio is referenced incorrectly:

    // log of retry-random-failures-tw2sp-system-dag-driver-3164055661
    I1016 10:45:56.065592      17 main.go:121] input ContainerSpec:{}
    I1016 10:45:56.065684      17 main.go:128] input RuntimeConfig:{}
    I1016 10:45:56.065742      17 main.go:136] input kubernetesConfig:{}
    I1016 10:45:56.066311      17 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
    I1016 10:45:56.066337      17 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
    I1016 10:45:56.084156      17 env.go:65] cannot find launcher configmap: name="kfp-launcher" namespace="kubeflow-user-example-com", will use default config
    I1016 10:45:56.084194      17 driver.go:153] PipelineRoot="minio://mlpipeline/v2/artifacts" from default config
    I1016 10:45:56.084219      17 config.go:166] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint. 
    I1016 10:45:56.226159      17 client.go:302] Pipeline Context: id:806  name:"retry-random-failures"  type_id:20 type:"system.Pipeline"  create_
    time_since_epoch:1729047175840  last_update_time_since_epoch:1729047175840
    

Expected result

Retry logic should be triggered when tasks fail, and retries should be reflected in the flow diagram on the dashboard. The custom Minio endpoint should be referenced correctly.

Materials and Reference

  • Logs indicating the incorrect reference to the Minio service.
  • Custom pipeline code with retry logic.
  • ConfigMap for the custom Minio setup.

In addition, I would like to know the specific steps or configurations required to make Kubeflow Pipelines (KFP) refer to the custom Minio we have prepared.

Let me know if any further information is needed. Thank you for your help!

@pschoen-itsc
Copy link

Have you checked out https://www.kubeflow.org/docs/components/pipelines/operator-guides/configure-object-store/ already? The configmap you changed is only used by argocd, but not by KFP. You have to change / create the kfp-launcher configmap to use a custom s3 / minio backend.

@ishisakok-nttd
Copy link

@pschoen-itsc

Thank you for the responses and insights provided. After further investigation, we have confirmed that the retry functionality does not work as expected in this Kubeflow Pipeline setup, irrespective of the Minio configuration. It appears that similar issues have been reported, so I will comment on those instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants