Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU example for NAS RL cifar10 training container #999

Merged
merged 4 commits into from
Jan 8, 2020

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jan 6, 2020

I made few improvements in this PR:

  1. I added Dockerfile with CPU support for NAS RL training container and fixed problem with parsing current yaml example for NAS job. In the future I will add build of this training container to our CI.
  2. Also, I noticed that currently we don't send information about Trials to Suggestion (ConvertTrials function return empty Trial set).
    Because of that, I added ConvertTrials function to Suggestion client
  3. I fixed few bugs in NAS Suggestion. The problem was with running class constructor every getSuggestions call.

Please, take a look.
/cc @gaocegege @johnugeorge @hougangliu


This change is Reviewable

@@ -67,7 +67,7 @@ spec:
- "RunTrial.py"
{{- with .HyperParameters}}
{{- range .}}
- "--{{.Name}}={{.Value}}"
- "--{{.Name}}=\"{{.Value}}\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here https://github.com/andreyvelich/katib/blob/cpu-example-rl-cifar10/examples/v1alpha3/NAS-training-containers/RL-cifar10/RunTrial.py#L13 I parse architecture and nn_config as a string. The format of architecture something like this [[15], [86, 1], [7, 1, 0], [57, 1, 0, 0], [3, 0, 1, 1, 0], [42, 1, 1, 1, 1, 0], [17, 0, 0, 0, 1, 0, 0], [49, 0, 1, 1, 0, 0, 0, 1]] and we need to send it as string. If we pass it without double quotes it fails in parsing.

t []trialsv1alpha3.Trial) []*suggestionapi.Trial {
res := make([]*suggestionapi.Trial, 0)
return res
func (g *General) ConvertTrials(ts []trialsv1alpha3.Trial) []*suggestionapi.Trial {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange. Without trial information being sent to suggestion algorithm services, how did it work for other algorithms?
@gaocegege

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have unit test with trial info. But if the trial info is missing, the algorithm also works. I think it is the reason why we did not find the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trials = Trial.convert(request.trials)

We are using trials from the request. How can bayesianoptimization like algo work without the previous trial information?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bayesianoptimization can work since the prior knowledge is always nil. skopt can handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange why we are trying to convert it from request.trials. I faced with the problem when request.trials is nil, so it is nothing to convert.

@johnugeorge
Copy link
Member

/retest

1 similar comment
@johnugeorge
Copy link
Member

/retest

@johnugeorge
Copy link
Member

grid search has consistently failed. This has to be looked at

@gaocegege
Copy link
Member

2020/01/07 08:16:41 Waiting for Experiment grid-example to finish.
2020/01/07 08:16:41 Experiment grid-example's trials: 2 trials, 0 pending trials,
1 running trials, 0 killed trials, 1 succeeded trials, 0 failed trials.
2020/01/07 08:16:41 Optimal Trial for Experiment grid-example: {grid-example-4wkdd6ww [{--lr 0.002} {--num-layers 2} {--optimizer sgd}] {[{Validation-accuracy 0.966461}]}}
2020/01/07 08:16:41 Experiment grid-example's conditions: [{Created True ExperimentCreated Experiment is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{Running True ExperimentRunning Experiment is running 2020-01-07 08:12:20 +0000 UTC 2020-01-07 08:12:20 +0000 UTC}]
2020/01/07 08:16:41 Suggestion grid-example's conditions: [{Created True SuggestionCreated Suggestion is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{DeploymentReady True DeploymentReady Deployment is ready 2020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC} {Running True SuggestionRunning Suggestion is running 2
020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC}]
2020/01/07 08:16:41 Suggestion grid-example's suggestions: [{[{--lr 0.001} {--num-layers 2} {--optimizer sgd}] grid-example-fm66xcss} {[{--lr 0.002} {--num-layers 2} {--optim
izer sgd}] grid-example-4wkdd6ww}]
2020/01/07 08:17:01 Waiting for Experiment grid-example to finish.
2020/01/07 08:17:01 Experiment grid-example's trials: 2 trials, 0 pending trials,
1 running trials, 0 killed trials, 1 succeeded trials, 0 failed trials.
2020/01/07 08:17:01 Optimal Trial for Experiment grid-example: {grid-example-4wkdd6ww [{--lr 0.002} {--num-layers 2} {--optimizer sgd}] {[{Validation-accuracy 0.966461}]}}
2020/01/07 08:17:01 Experiment grid-example's conditions: [{Created True ExperimentCreated Experiment is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{Running True ExperimentRunning Experiment is running 2020-01-07 08:12:20 +0000 UTC 2020-01-07 08:12:20 +0000 UTC}]
2020/01/07 08:17:01 Suggestion grid-example's conditions: [{Created True SuggestionCreated Suggestion is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{DeploymentReady True DeploymentReady Deployment is ready 2020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC} {Running True SuggestionRunning Suggestion is running 2
020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC}]
2020/01/07 08:17:01 Suggestion grid-example's suggestions: [{[{--lr 0.001} {--num-layers 2} {--optimizer sgd}] grid-example-fm66xcss} {[{--lr 0.002} {--num-layers 2} {--optim
izer sgd}] grid-example-4wkdd6ww}]

The hyperparameter is created, but the trials is not finished.

@johnugeorge
Copy link
Member

We should try with increased timeout

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

I think I found the problem. Here https://chocolate.readthedocs.io/api/sample.html#chocolate.Grid is says that search space to explore can be only discrete dimensions, but we have in our example https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/grid-example.yaml#L33 categorical type also.
Here https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1alpha3/chocolate/base_chocolate_service.py#L46 we create search space for Grid suggestion with all parameters.
If we remove this categorical parameter from search space for grid, will it be fine?
/cc @gaocegege

@andreyvelich
Copy link
Member Author

Hold for now
/hold

@andreyvelich
Copy link
Member Author

Tests passed.
/cc @johnugeorge

/hold cancel

@gaocegege
Copy link
Member

/lgtm

@gaocegege
Copy link
Member

@andreyvelich Thanks for the fix. It is my fault.

I meet the problem before, but forget that we do not run the unit test for grid search. https://github.com/kubeflow/katib/blob/master/test/suggestion/v1alpha3/test_chocolate_service.py.failed

@johnugeorge
Copy link
Member

@gaocegege but Is this a acceptable restriction for not allowing to having categorical parameters for grid algorithm?

@gaocegege
Copy link
Member

@johnugeorge It is not. I think we should investigate if we could solve it.

@johnugeorge
Copy link
Member

Yeah. I think so. This restriction looks unacceptable to me as this is one of the common algorithms.

Should we merge it anyways or wait ?

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9c54601 into kubeflow:master Jan 8, 2020
@andreyvelich andreyvelich deleted the cpu-example-rl-cifar10 branch October 6, 2021 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants