-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Use Katib SDK for E2E Tests #2075
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I've changed tune and train and CMA-ES SDK examples. |
@andreyvelich Thanks for improving our E2E. Also, it would be good to modify the E2E test to verify Python SDK operations in various Python versions, as discussed in #2057 (comment). Although, we can follow up on that in another PR. |
max_trial_count = experiment.spec.max_trial_count + 1 | ||
parallel_trial_count = experiment.spec.parallel_trial_count + 1 | ||
print( | ||
f"Restarting Experiment {exp_namespace}/{exp_name} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this Restarting experiment
for random
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge We also test random search Experiment to test LongRunning
Experiment.
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace) | ||
|
||
# Describe the Experiment and Suggestion. | ||
print(os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add verbose
as an optional parameter that can be enabled/disabled to show verbose logs? (Enabled by default)
This will help in tests like conformance tests where we are only interested in final experiment success/failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge I enabled logging for Katib E2Es, what do you think about it ?
) | ||
) | ||
try: | ||
response = utils.FakeResponse(thread.get(constants.APISERVER_TIMEOUT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It might help to make configurable TIMEOUT
. So, can we add duration until timeout to function args?
WDYT?
def get_experiment(
self, name: str, namespace: str = utils.get_default_target_namespace(), duration,
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me update it.
""" | ||
|
||
try: | ||
self.api_instance.delete_namespaced_custom_object( | ||
self.custom_api.delete_namespaced_custom_object( | ||
constants.KUBEFLOW_GROUP, | ||
constants.KATIB_VERSION, | ||
namespace, | ||
constants.EXPERIMENT_PLURAL, | ||
name, | ||
body=client.V1DeleteOptions(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make configurable client.V1DeleteOptions()
? It helps users to set grace_period_seconds
and more.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can allow it.
args = parser.parse_args() | ||
|
||
if args.verbose == "False": | ||
logging.getLogger().setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, set default to Debug and change to INFO when verbose is False?
Also, set start/end logs to INFO and rest all logs to DEBUG. So, if verbose is not set, user will just see logs start and the final experiment status.
try: | ||
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more clarity, keepSucceeded
instead of completed
# Describe the Experiment and Suggestion. | ||
logging.info( | ||
os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read() | ||
) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
logging.info( | ||
os.popen(f"kubectl describe suggestion {exp_name} -n {exp_namespace}").read() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there intentions to use the kubectl
command, not Python SDK?
It would be good to use Python SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I guess, we used kubectl describe
for better visibility. Let me just print the Experiment and Suggestion:
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
try: | ||
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
# Delete the Experiment. | ||
katib_client.delete_experiment(exp_name, exp_namespace) | ||
except Exception as e: | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
# Delete the Experiment and raise an Exception. | ||
katib_client.delete_experiment(exp_name, exp_namespace) | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to delete experiment regardless of pass/fail, I think using finally
would be nice
try: | |
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | |
logging.info("---------------------------------------------------------------") | |
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | |
logging.info("---------------------------------------------------------------") | |
logging.info("---------------------------------------------------------------") | |
# Delete the Experiment. | |
katib_client.delete_experiment(exp_name, exp_namespace) | |
except Exception as e: | |
logging.info("---------------------------------------------------------------") | |
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | |
logging.info("---------------------------------------------------------------") | |
logging.info("---------------------------------------------------------------") | |
# Delete the Experiment and raise an Exception. | |
katib_client.delete_experiment(exp_name, exp_namespace) | |
raise e | |
logging.info("---------------------------------------------------------------") | |
try: | |
run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | |
logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | |
except Exception as e: | |
logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | |
raise e | |
finally: | |
logging.info("---------------------------------------------------------------") | |
logging.info("---------------------------------------------------------------") | |
# Delete the Experiment and raise an Exception. | |
katib_client.delete_experiment(exp_name, exp_namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Good suggestion.
parser.add_argument( | ||
"--verbose", | ||
type=str, | ||
default=True, | ||
choices=("True", "False"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using action=store_true
rather than using type=str
and choices=("True", "False")
for booelan type?
Then we could change the following 238th line from if args.verbose == "False":
to if args.verbose:
PVCs = client.CoreV1Api().list_namespaced_persistent_volume_claim( | ||
exp_namespace | ||
) | ||
is_deleted = 1 | ||
for i in PVCs.items: | ||
if i.metadata.name == resource_name: | ||
is_deleted = 0 | ||
if is_deleted == 1: | ||
raise Exception( | ||
"PVC is deleted for FromVolume resume policy. " | ||
f"Alive PVCs: {[i.metadata.name for i in PVCs.items]}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change this lines to use read_namespaced_persistent_volume_claim(resource_name, exp_namespace)
and check whether this method raise the 404 not found exception or not. And then raise our Exception ?
Since I think i have to read this source code more carefully to understand what this source code wants to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I thought we have only list
API 😄
I am not sure why Kubernetes API Python client named get
API for Customer Resources and read
API for Core Kubernetes Resources.
6d9afd3
to
a84c920
Compare
Thanks for your review @tenzen-y @anencore94 @johnugeorge @terrytangyuan. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Looks great! Thanks for your tremendous effort!
/lgtm
LGTM Thanks @andreyvelich |
Thanks everyone for the review! |
Related: #2024, #2044
I used Katib SDK to run the E2E test.
Also I made the following changes in the SDK:
I used the unify style for our Python APIs and Exceptions, so users can easily understand it. We need to create script to automatically generate docs from our Katib Client. Currently, this doc is outdated: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md.
Any existing tools that we can use ?
I introduced the following changes in APIs:
get_experiment_status
->get_experiment_conditions
to return Experiment conditions.is_experiment_created
the new API to check Experiment status.is_experiment_running
the new API to check Experiment status.is_experiment_restarting
the new API to check Experiment status.is_experiment_succeeded
the new API to check Experiment status.is_experiment_failed
the new API to check Experiment status.wait_for_experiment_condition
the new API to wait until Experiment reaches condition. (Similar to Training Operator.)edit_experiment_budget
the new API to modify Experiment budget in-place.get_suggestion
split betweenget_suggestion
andlist_suggestions
similar tolist_experiments
.get_trial
to get Trial CR object.get_optimal_hyperparameters
returnsV1beta1OptimalTrial
object.Please let me know what do you think about API style, can we improve it better ?
It would be great if you could start reviewing this.
Also, if you think that change is too big for upcoming release, we can postpone it.
TODO: I need to update examples that use Katib SDK. Will do it during the week.
/hold
cc @gaocegege @johnugeorge @anencore94 @tenzen-y @kubeflow/wg-training-leads