CPU example for NAS RL cifar10 training container #999

andreyvelich · 2020-01-06T20:11:45Z

I made few improvements in this PR:

I added Dockerfile with CPU support for NAS RL training container and fixed problem with parsing current yaml example for NAS job. In the future I will add build of this training container to our CI.
Also, I noticed that currently we don't send information about Trials to Suggestion (ConvertTrials function return empty Trial set).
Because of that, I added ConvertTrials function to Suggestion client
I fixed few bugs in NAS Suggestion. The problem was with running class constructor every getSuggestions call.

Please, take a look.
/cc @gaocegege @johnugeorge @hougangliu

This change is

Send trial info from CRD to GRPC

johnugeorge · 2020-01-07T06:05:07Z

examples/v1alpha3/nasjob-example-RL.yaml

@@ -67,7 +67,7 @@ spec:
                  - "RunTrial.py"
                  {{- with .HyperParameters}}
                  {{- range .}}
-                  - "--{{.Name}}={{.Value}}"
+                  - "--{{.Name}}=\"{{.Value}}\""


why is this required?

Here https://github.com/andreyvelich/katib/blob/cpu-example-rl-cifar10/examples/v1alpha3/NAS-training-containers/RL-cifar10/RunTrial.py#L13 I parse architecture and nn_config as a string. The format of architecture something like this [[15], [86, 1], [7, 1, 0], [57, 1, 0, 0], [3, 0, 1, 1, 0], [42, 1, 1, 1, 1, 0], [17, 0, 0, 0, 1, 0, 0], [49, 0, 1, 1, 0, 0, 0, 1]] and we need to send it as string. If we pass it without double quotes it fails in parsing.

johnugeorge · 2020-01-07T06:12:39Z

pkg/controller.v1alpha3/suggestion/suggestionclient/suggestionclient.go

-	t []trialsv1alpha3.Trial) []*suggestionapi.Trial {
-	res := make([]*suggestionapi.Trial, 0)
-	return res
+func (g *General) ConvertTrials(ts []trialsv1alpha3.Trial) []*suggestionapi.Trial {


Strange. Without trial information being sent to suggestion algorithm services, how did it work for other algorithms?
@gaocegege

We have unit test with trial info. But if the trial info is missing, the algorithm also works. I think it is the reason why we did not find the problem.

katib/pkg/suggestion/v1alpha3/skopt_service.py

Line 30 in ebb48f8

trials = Trial.convert(request.trials)

We are using trials from the request. How can bayesianoptimization like algo work without the previous trial information?

bayesianoptimization can work since the prior knowledge is always nil. skopt can handle it.

It is strange why we are trying to convert it from request.trials. I faced with the problem when request.trials is nil, so it is nothing to convert.

johnugeorge · 2020-01-07T06:13:05Z

/retest

johnugeorge · 2020-01-07T07:52:58Z

/retest

johnugeorge · 2020-01-07T09:07:45Z

grid search has consistently failed. This has to be looked at

gaocegege · 2020-01-07T09:24:49Z

2020/01/07 08:16:41 Waiting for Experiment grid-example to finish.
2020/01/07 08:16:41 Experiment grid-example's trials: 2 trials, 0 pending trials,
1 running trials, 0 killed trials, 1 succeeded trials, 0 failed trials.
2020/01/07 08:16:41 Optimal Trial for Experiment grid-example: {grid-example-4wkdd6ww [{--lr 0.002} {--num-layers 2} {--optimizer sgd}] {[{Validation-accuracy 0.966461}]}}
2020/01/07 08:16:41 Experiment grid-example's conditions: [{Created True ExperimentCreated Experiment is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{Running True ExperimentRunning Experiment is running 2020-01-07 08:12:20 +0000 UTC 2020-01-07 08:12:20 +0000 UTC}]
2020/01/07 08:16:41 Suggestion grid-example's conditions: [{Created True SuggestionCreated Suggestion is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{DeploymentReady True DeploymentReady Deployment is ready 2020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC} {Running True SuggestionRunning Suggestion is running 2
020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC}]
2020/01/07 08:16:41 Suggestion grid-example's suggestions: [{[{--lr 0.001} {--num-layers 2} {--optimizer sgd}] grid-example-fm66xcss} {[{--lr 0.002} {--num-layers 2} {--optim
izer sgd}] grid-example-4wkdd6ww}]
2020/01/07 08:17:01 Waiting for Experiment grid-example to finish.
2020/01/07 08:17:01 Experiment grid-example's trials: 2 trials, 0 pending trials,
1 running trials, 0 killed trials, 1 succeeded trials, 0 failed trials.
2020/01/07 08:17:01 Optimal Trial for Experiment grid-example: {grid-example-4wkdd6ww [{--lr 0.002} {--num-layers 2} {--optimizer sgd}] {[{Validation-accuracy 0.966461}]}}
2020/01/07 08:17:01 Experiment grid-example's conditions: [{Created True ExperimentCreated Experiment is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{Running True ExperimentRunning Experiment is running 2020-01-07 08:12:20 +0000 UTC 2020-01-07 08:12:20 +0000 UTC}]
2020/01/07 08:17:01 Suggestion grid-example's conditions: [{Created True SuggestionCreated Suggestion is created 2020-01-07 08:11:01 +0000 UTC 2020-01-07 08:11:01 +0000 UTC}
{DeploymentReady True DeploymentReady Deployment is ready 2020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC} {Running True SuggestionRunning Suggestion is running 2
020-01-07 08:12:12 +0000 UTC 2020-01-07 08:12:12 +0000 UTC}]
2020/01/07 08:17:01 Suggestion grid-example's suggestions: [{[{--lr 0.001} {--num-layers 2} {--optimizer sgd}] grid-example-fm66xcss} {[{--lr 0.002} {--num-layers 2} {--optim
izer sgd}] grid-example-4wkdd6ww}]

The hyperparameter is created, but the trials is not finished.

johnugeorge · 2020-01-07T09:54:47Z

We should try with increased timeout

andreyvelich · 2020-01-07T10:42:04Z

/retest

andreyvelich · 2020-01-07T13:02:26Z

I think I found the problem. Here https://chocolate.readthedocs.io/api/sample.html#chocolate.Grid is says that search space to explore can be only discrete dimensions, but we have in our example https://github.com/kubeflow/katib/blob/master/examples/v1alpha3/grid-example.yaml#L33 categorical type also.
Here https://github.com/kubeflow/katib/blob/master/pkg/suggestion/v1alpha3/chocolate/base_chocolate_service.py#L46 we create search space for Grid suggestion with all parameters.
If we remove this categorical parameter from search space for grid, will it be fine?
/cc @gaocegege

andreyvelich · 2020-01-07T13:02:38Z

Hold for now
/hold

Check if trial is SUCCEEDED in grid

andreyvelich · 2020-01-07T19:53:33Z

Tests passed.
/cc @johnugeorge

/hold cancel

gaocegege · 2020-01-08T01:28:28Z

/lgtm

gaocegege · 2020-01-08T01:30:35Z

@andreyvelich Thanks for the fix. It is my fault.

I meet the problem before, but forget that we do not run the unit test for grid search. https://github.com/kubeflow/katib/blob/master/test/suggestion/v1alpha3/test_chocolate_service.py.failed

johnugeorge · 2020-01-08T03:24:23Z

@gaocegege but Is this a acceptable restriction for not allowing to having categorical parameters for grid algorithm?

gaocegege · 2020-01-08T03:59:15Z

@johnugeorge It is not. I think we should investigate if we could solve it.

johnugeorge · 2020-01-08T04:06:57Z

Yeah. I think so. This restriction looks unacceptable to me as this is one of the common algorithms.

Should we merge it anyways or wait ?

johnugeorge · 2020-01-08T06:34:04Z

/approve

k8s-ci-robot · 2020-01-08T06:34:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich added 2 commits January 6, 2020 19:22

Add CPU Dockerfile example for NASRL

5456eea

Send trial info from CRD to GRPC

Parse Trial Objective

b662627

k8s-ci-robot requested review from gaocegege, hougangliu and johnugeorge January 6, 2020 20:11

k8s-ci-robot added the size/L label Jan 6, 2020

johnugeorge reviewed Jan 7, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold label Jan 7, 2020

andreyvelich added 2 commits January 7, 2020 13:06

Change variable in Metric.convert loop in grid

41a4392

Check if trial is SUCCEEDED in grid

Increase Search Space in Grid example

76bbac3

k8s-ci-robot requested a review from johnugeorge January 7, 2020 19:53

k8s-ci-robot removed the do-not-merge/hold label Jan 7, 2020

k8s-ci-robot assigned gaocegege Jan 8, 2020

k8s-ci-robot added the lgtm label Jan 8, 2020

k8s-ci-robot added the approved label Jan 8, 2020

k8s-ci-robot merged commit 9c54601 into kubeflow:master Jan 8, 2020

sakaia mentioned this pull request May 29, 2020

Katib Experiment Graph is not shown #1196

Closed

andreyvelich deleted the cpu-example-rl-cifar10 branch October 6, 2021 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU example for NAS RL cifar10 training container #999

CPU example for NAS RL cifar10 training container #999

andreyvelich commented Jan 6, 2020 •

edited by jlewi

Loading

johnugeorge Jan 7, 2020

andreyvelich Jan 7, 2020

johnugeorge Jan 7, 2020

gaocegege Jan 7, 2020

johnugeorge Jan 7, 2020

gaocegege Jan 7, 2020

andreyvelich Jan 7, 2020

johnugeorge commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

gaocegege commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

gaocegege commented Jan 8, 2020

gaocegege commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

gaocegege commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

k8s-ci-robot commented Jan 8, 2020

CPU example for NAS RL cifar10 training container #999

CPU example for NAS RL cifar10 training container #999

Conversation

andreyvelich commented Jan 6, 2020 • edited by jlewi Loading

johnugeorge Jan 7, 2020

Choose a reason for hiding this comment

andreyvelich Jan 7, 2020

Choose a reason for hiding this comment

johnugeorge Jan 7, 2020

Choose a reason for hiding this comment

gaocegege Jan 7, 2020

Choose a reason for hiding this comment

johnugeorge Jan 7, 2020

Choose a reason for hiding this comment

gaocegege Jan 7, 2020

Choose a reason for hiding this comment

andreyvelich Jan 7, 2020

Choose a reason for hiding this comment

johnugeorge commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

gaocegege commented Jan 7, 2020

johnugeorge commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

andreyvelich commented Jan 7, 2020

gaocegege commented Jan 8, 2020

gaocegege commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

gaocegege commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

johnugeorge commented Jan 8, 2020

k8s-ci-robot commented Jan 8, 2020

andreyvelich commented Jan 6, 2020 •

edited by jlewi

Loading