Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e-wine test failed with kfp run in error state #38

Closed
NohaIhab opened this issue Oct 2, 2023 · 8 comments
Closed

e2e-wine test failed with kfp run in error state #38

NohaIhab opened this issue Oct 2, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@NohaIhab
Copy link
Contributor

NohaIhab commented Oct 2, 2023

channel: 1.7/edge

e2e-wine notebook test fails with AssertionError: KFP run in Error state.
The Preprocess step in the pipeline fails with the logs:

This step is in Error state with this message: Error (exit code 1): tar (child): gzip: Cannot exec: No such file or directory tar (child): Error is not recoverable: exiting now tar: Child returned status 2 tar: Error is not recoverable: exiting now
@NohaIhab NohaIhab added the bug Something isn't working label Oct 2, 2023
@DnPlas
Copy link
Contributor

DnPlas commented Oct 4, 2023

The error message suggests that either tar or gzip (probably the latter) does not exist in the container where you are running this. Do you know exactly which part of the notebook is actually returning this?

@NohaIhab
Copy link
Contributor Author

@DnPlas as mentioned in the issue, it's the preprocess step, probably when passing the data from the step before (download)

@gustavosr98
Copy link

gustavosr98 commented Oct 11, 2023

@DnPlas this happens on any step that is passing the output of one step as the input for the next.

Versions

In my case I have the following versions

Microk8s: 1.25/stable
Charmed Kubeflow: 1.7/stable
Charmed MLflow: 2.1/stable
resource-dspatcher: 1.0/edge

Juju applications -> https://pastebin.ubuntu.com/p/MWRPjpkP3P/

Scenarios

Test A - Fails on "preprocess_task" step

rendered yaml file -> https://pastebin.ubuntu.com/p/XkWrDvbgGX/

@dsl.pipeline(
    name="e2e_wine_pipeline",
    description="WINE pipeline",
)
def wine_pipeline(url):
    web_downloader_task = web_downloader_op(url=url)
    preprocess_task = preprocess_op(file=web_downloader_task.outputs['data'])
    
    train_task = (training_op(file=preprocess_task.outputs['output'])
                 .add_env_variable(V1EnvVar(name='MLFLOW_TRACKING_URI', value='http://mlflow-server.kubeflow.svc.cluster.local:5000'))
                 .add_env_variable(V1EnvVar(name='MLFLOW_S3_ENDPOINT_URL', value='http://minio.kubeflow.svc.cluster.local:9000'))
                 .add_env_variable(V1EnvVar(name='PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION', value='python')) 
                 #https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.extensions.html#kfp.onprem.use_k8s_secret
                  .apply(use_k8s_secret(secret_name='mlpipeline-minio-artifact', k8s_secret_key_to_env={
                     'accesskey': 'AWS_ACCESS_KEY_ID',
                     'secretkey': 'AWS_SECRET_ACCESS_KEY',
                 })))
    deploy_task = deploy_op(model_uri=train_task.output)

Test B - Fails on "train_task" step

@dsl.pipeline(
    name="e2e_wine_pipeline",
    description="WINE pipeline",
)
def wine_pipeline(url):
    #web_downloader_task = web_downloader_op(url=url)
    preprocess_task = preprocess_op(file=url)
    
    train_task = (training_op(file=preprocess_task.outputs['output'])
                 .add_env_variable(V1EnvVar(name='MLFLOW_TRACKING_URI', value='http://mlflow-server.kubeflow.svc.cluster.local:5000'))
                 .add_env_variable(V1EnvVar(name='MLFLOW_S3_ENDPOINT_URL', value='http://minio.kubeflow.svc.cluster.local:9000'))
                 .add_env_variable(V1EnvVar(name='PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION', value='python')) 
                 #https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.extensions.html#kfp.onprem.use_k8s_secret
                  .apply(use_k8s_secret(secret_name='mlpipeline-minio-artifact', k8s_secret_key_to_env={
                     'accesskey': 'AWS_ACCESS_KEY_ID',
                     'secretkey': 'AWS_SECRET_ACCESS_KEY',
                 })))
    deploy_task = deploy_op(model_uri=train_task.output)

In my case minio seems okay so I am guessing the bug is around Argo.
Even before actually running the steps from what I can see on the logs.

Logs

k get all -n admin -> https://pastebin.ubuntu.com/p/WRXzxPCF3j/
pod logs -> https://pastebin.ubuntu.com/p/8YwWJq7vk5/

@NohaIhab
Copy link
Contributor Author

I see that the error is in the initContainer of the run pod

  initContainerStatuses:
    - name: init
      state:
        terminated:
          exitCode: 1
          reason: Error
          message: |-
            tar (child): gzip: Cannot exec: No such file or directory
            tar (child): Error is not recoverable: exiting now
            tar: Child returned status 2
            tar: Error is not recoverable: exiting now
          startedAt: '2023-10-11T12:40:55Z'
          finishedAt: '2023-10-11T12:40:55Z'
          containerID: >-
            containerd://f9499abc54b8bed7d51a5ea6b70dbf70d2da0fdfd5d19deec814aa6e445a596b
      lastState: {}
      ready: false
      restartCount: 0
      image: 'docker.io/charmedkubeflow/argoexec:v3.3.9_22.04_1'
      imageID: >-
        docker.io/charmedkubeflow/argoexec@sha256:3a869b98ca71e0927ee293ed78b266b4227bfa2153a45b7e9803ae6b0e39a0d8
      containerID: >-
        containerd://f9499abc54b8bed7d51a5ea6b70dbf70d2da0fdfd5d19deec814aa6e445a596b

the argo-controller charm is setting the image of this initContainer i.e. the executor image. I tried replacing the argoexec image with upstream and it succeded, replaced the image using the config:

juju config argo-controller executor-image=argoproj/argoexec:v3.3.9

@gustavosr98 you can use this ^ as a temporary workaround.
My initial thought is that the gzip package is missing from the rock, if that is the case then we need to update the rock and republish it.

@gustavosr98
Copy link

gustavosr98 commented Oct 11, 2023

Thanks @NohaIhab!

Please keep me updated on any other bug report that would track this or the final patch on the OCI image we would need to provide for the customer

@NohaIhab
Copy link
Contributor Author

hi @gustavosr98

We've patched the rock and re-published the charm with the new rock to 3.3/stable, so all you need to do is refresh the charm.
The new published rock is charmedkubeflow/argoexec:v3.3.9_22.04_2.

@gustavosr98
Copy link

Awesome, thanks @NohaIhab !


Btw, We should definitely add a big note on the readme of the repo canonical/kubeflow-examples that is no longer maintained and add a pointer to this repo

I spend quite some time trying to make the e2e-wine sample work on the older repo while here it worked perfectly on the first run


Big thanks for this repo 🚀 !
It is useful for testing kubeflow is doing what is suposted to do, as well for demos with potencial customers

DnPlas added a commit to canonical/argo-operators that referenced this issue Oct 18, 2023
This commit defaults the executor image to argoproj/argoexec:v3.3.10 to bump
the version and avoid canonical/charmed-kubeflow-uats#38.
DnPlas added a commit to canonical/argo-operators that referenced this issue Oct 19, 2023
This commit defaults the executor image to argoproj/argoexec:v3.3.10 to bump
the version and avoid canonical/charmed-kubeflow-uats#38.
@kimwnasptd
Copy link
Contributor

@gustavosr98 good point! We can follow up on that. Yes the goal of this repo is to have place were we CKF team define our tests in a way you can also use them, since they are notebooks.

I'll go on and close this issue for now, since updating the Argo Exec image solved the issue. And we'll work afterwards in converting that image back to a ROCK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants