Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KatibClient().list_experiments/list_trials fails with RuntimeError #2110

Closed
votti opened this issue Feb 10, 2023 · 13 comments · Fixed by #2148
Closed

KatibClient().list_experiments/list_trials fails with RuntimeError #2110

votti opened this issue Feb 10, 2023 · 13 comments · Fixed by #2148
Labels

Comments

@votti
Copy link

votti commented Feb 10, 2023

/kind bug

What steps did you take and what happened:
I tried to list the experiments/trials using the Katib Client:

from kubeflow.katib import KatibClient

client=KatibClient()

my_namespace="my-namespace"
client.list_experiments(namespace=my_namespace)
client.list_trials(namespace=my_namespace)

raises the error

~/miniconda3/envs/katib-exp/lib/python3.11/site-packages/kubeflow/katib/api_client.py in __deserialize_model(self, data, klass)
    657                     value = data[klass.attribute_map[attr]]
--> 658                     kwargs[attr] = self.__deserialize(value, attr_type)
...
--> 285             raise RuntimeError(
    286                 "There was a problem to get experiments in namespace {0}. Exception: \
    287           {1} ".format(namespace, e))

RuntimeError: There was a problem to get experiments in namespace my-namespace. Exception:           module 'kubeflow.katib.models' has no attribute 'V1ResourceRequirements' 

What did you expect to happen:
No error.

Anything else you would like to add:
I was able to temporarily fix this by patching the kubeflow.katib.models namespace:

import kubeflow.katib.models as models
from kubernetes import client as kclient
models.V1ResourceRequirements = kclient.V1ResourceRequirements

Environment:

  • Katib version (check the Katib controller image version): 
     python sdk: katib==0.14.0
     docker.io/kubeflowkatib/katib-controller:v0.14.0"
    
  • Kubernetes version: (kubectl version):
    
  • Client Version: v1.26.0
    
  • Kustomize Version: v4.5.7
    
  • Server Version: v1.23.14-gke.1800
    
  • OS (uname -a): Linux LAPTOP-SS9UALAT 4.19.128-microsoft-standard SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
    

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@votti votti changed the title KatibClient().list_experiments/list_trials fails with KatibClient().list_experiments/list_trials fails with RuntimeError Feb 10, 2023
@johnugeorge
Copy link
Member

/cc @andreyvelich

votti added a commit to d-one/katib that referenced this issue Feb 10, 2023
It seems that if there are kubernetes.client classes used
in any experiment/trial the deserialization fails.

This rather quick fix tries to get classes from `kubernetes.client`
as well for deserialization.

Addresses: kubeflow#2110
@votti
Copy link
Author

votti commented Feb 10, 2023

I guess the issue is that I have a V1ResourceRequirements in one of my experiments/trials.
This causes the deserialization from str to class to fail.

Not sure if this is common enough to need to be supported.
This fix would work: 815b34d

@andreyvelich
Copy link
Member

@votti Please can you show your Experiment ? How it can contain V1ResourceRequirements ?

@andreyvelich
Copy link
Member

We can try to import all Kubernetes modules to Katib SDK, similar to Training Operator: https://github.com/kubeflow/training-operator/blob/master/sdk/python/kubeflow/training/models/__init__.py#L16-L17.
But I thought it is not necessary, since Katib object doesn't have such Kubernetes structure.

@tenzen-y
Copy link
Member

I'm not sure if we must import all K8s modules to Katib SDK since our CI works well.

@votti
Copy link
Author

votti commented Feb 10, 2023

Maybe what would be an improvement is if there would be a warning if some experiments/trials cannot be loaded (for whatever reason) and still returning the experiments that are deserializable?

Having an invalid/strange experiment/trial in a namespace breaking the list functionality completely is a bit unfortunate.

@votti
Copy link
Author

votti commented Feb 10, 2023

I think this is an offending Experiment:
experiment_offending.txt

I suspect this is because I am using an Argo workflow as Trial spec.

@tenzen-y
Copy link
Member

I think this is an offending Experiment: experiment_offending.txt

I suspect this is because I am using an Argo workflow as Trial spec.

I see. It might be better to add E2E for Argo workflow integration.

@andreyvelich
Copy link
Member

I am using an Argo workflow as Trial spec.

No, Trial spec is just an object: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/kubeflow/katib/models/v1beta1_trial_template.py#L43

It fails because of Custom Collector has Kubernetes Container APIs.

@andreyvelich
Copy link
Member

Maybe what would be an improvement is if there would be a warning if some experiments/trials cannot be loaded (for whatever reason) and still returning the experiments that are deserializable?

I think, once we import all Kubernetes modules we can always deserialise Experiment. WDYT @tenzen-y @votti ?

@tenzen-y
Copy link
Member

I think, once we import all Kubernetes modules we can always deserialise Experiment. WDYT @tenzen-y @votti ?

sgtm.

@votti
Copy link
Author

votti commented Feb 17, 2023

Maybe what would be an improvement is if there would be a warning if some experiments/trials cannot be loaded (for whatever reason) and still returning the experiments that are deserializable?

I think, once we import all Kubernetes modules we can always deserialise Experiment. WDYT @tenzen-y @votti ?

Also a good option.
I find it maybe a bit clearer if the modules would be kept in a different namespace and during deseriation the katib as well as the k8s namespace would be checked (like in 815b34d). Then it is clearer what the katib specific vs k8s modules are.

But in the end both seems fine.

@andreyvelich
Copy link
Member

@votti I guess, the problem is that api_client.py file is created by OpenAPI generator automatically. .
I think, as a temporary solution we should modify Katib Post SDK Gen script to import all Kubernetes modules, similar as Training Operator.

We can track better solution in this issue: kubeflow/training-operator#1723.

WDYT ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants