HPO Benchmark #3644

xiaowu0162 · 2021-05-14T06:14:01Z

Add a benchmarking tool for HPO tuners based on the automlbenchmark tool (https://github.com/openml/automlbenchmark).
Currently this tool supports:

Running ML benchmarks using nni built-in tuners as well as custom tuners. All tuners search on the same search space, defined by the hypothesis space. Currently random forest is the only supported hypothesis space.
Automatically generating reports comparing the performances of different tuners.
Running either predefined benchmarks or customized ones. Currently, our predefined benchmark includes three type of problems (binary classification, multi-class classification, and regression), each of which includes 8 tasks.

…ation on this change

* add a new benchmark "nnismall" with binary classification, multi-class classification, and regression tasks * re-implement a correct data preprocessing pipeline for random forest * re-organize dependencies and update setup.sh

ghost · 2021-05-14T06:14:13Z

All CLA requirements met.

.gitignore

examples/trials/benchmarking/automlbenchmark/README.md

examples/trials/benchmarking/automlbenchmark/nni/config.yaml

ultmaster · 2021-05-20T08:23:27Z

examples/trials/benchmarking/automlbenchmark/nni/frameworks.yaml

@@ -0,0 +1,88 @@
+---
+
+NNI:


It might be tricky to maintain NNI version 2.2 each release. Do you plan to freeze it forever?

I have changed the flag to stable, which indicates that the latest stable release should be used. This flag does not influence any functionalities on our side. The automlbenchmark designed this flag for users to choose different versions of frameworks using command line options.

examples/trials/benchmarking/automlbenchmark/README.md

examples/trials/benchmarking/automlbenchmark/setup.sh

docs/en_US/hpo_benchmark.rst

J-shang · 2021-05-26T02:55:02Z

docs/en_US/hpo_benchmark.rst

+
+As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. 
+
+After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. 


it's better to highlight what the kinds of scores are.

J-shang · 2021-05-26T02:56:55Z

docs/en_US/hpo_benchmark.rst

+     - 5.33
+     - 3.50
+
+Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.


what does the score mean, if specified by us and if could be defined by the user? maybe we need explain in doc

The scores in the tables are average rankings. To get this score, the user have to run benchmarks against multiple tuners (either indicate multiple tuners in the command, or manually aggregate the results afterwards). I will modify the doc to further clarify this.

J-shang · 2021-05-26T02:59:53Z

docs/en_US/hpo_benchmark.rst

+A Benchmark Example 
+^^^^^^^^^^^^^^^^^^^
+
+As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. 


test one tuner need 2 days?

Yes, if we use our 24-task benchmark and enforce the tuner to run 100 trials per fold per task. On average, the time cost is ~less than 1 minute per trial.

Have you run multiple tuners at the same time?

Yes. In my experiments I ran all tuners at the same time and manually aggregated the results afterwards.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

I think in our case the main time cost is training. For relatively large benchmarks, I think serializing the tasks is indeed not optimal. However, it seems that letting the user run multiscript is suboptimal either, as the user has to manually put the results together before running the result parsing script.
I suggest putting a flag in the script and allow the users to choose to run the tasks nonblocking in background. This would potentially cause a file contention, and I will add extra logic to fix it.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

Added in the following commit

… in the background

xiaowu0162 and others added 19 commits May 10, 2021 14:56

Copy nni package from the original repository

8c3f03a

Merge branch 'microsoft:master' into dev-hpo

3b7f452

Add initialization code; change hpo benchmark autorun code

bc317cc

Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo

f309530

HPO Benchmark README v1

9e463c2

HPO Benchmark README V1.1

9b08605

HPO Benchmark README V1.2

13705fa

Move HPO Benchmark to examples directory

aa4f223

Change HPO Benchmark tuner import logic

34016ed

Change HPO Benchmark running script logic; Modified README file

c6ebe4a

README fix

cf129d6

README fix

b9f4e38

Summarize dependencies in a requirement.txt file; update the document…

50caa4f

…ation on this change

Update README.md

f1b0651

Updated requirement for HPO benchmark

6d621d0

Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo

f3d9fed

Update README.md

40c8a42

20210513 HPO Benchmark feature updates:

bcd8758

* add a new benchmark "nnismall" with binary classification, multi-class classification, and regression tasks * re-implement a correct data preprocessing pipeline for random forest * re-organize dependencies and update setup.sh

Merge branch 'microsoft:master' into dev-hpo

8c82b03

xiaowu0162 added 2 commits May 14, 2021 15:27

debug

7136f31

Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo

103ef73

ultmaster requested review from J-shang and ultmaster May 14, 2021 08:37

ultmaster added the HPO label May 14, 2021

xiaowu0162 marked this pull request as draft May 18, 2021 06:14

xiaowu0162 added 3 commits May 18, 2021 16:08

Support either "time" or "ntrials" as benchmark constraints

7055991

Add graphical presentation of benchmark results

4bb2bd7

Finalize HPO Benchmark graphical reports

4695962

Add HPO Benchmark code comments and documentation

c5cc749

xiaowu0162 marked this pull request as ready for review May 19, 2021 03:54

ultmaster reviewed May 20, 2021

View reviewed changes

xiaowu0162 and others added 3 commits May 21, 2021 10:53

Refactor HPO benchmark documentation and configs

b66168e

Update hpo_benchmark.rst

07771dd

Update hpo_benchmark.rst

b8b6a4e

J-shang reviewed May 21, 2021

View reviewed changes

examples/trials/benchmarking/automlbenchmark/README.md Outdated Show resolved Hide resolved

examples/trials/benchmarking/automlbenchmark/README.md Outdated Show resolved Hide resolved

documentation debug

fb09709

xiaowu0162 requested a review from ultmaster May 21, 2021 02:17

Update hpo_benchmark.rst

b810804

ultmaster reviewed May 21, 2021

View reviewed changes

examples/trials/benchmarking/automlbenchmark/setup.sh Show resolved Hide resolved

ultmaster reviewed May 21, 2021

View reviewed changes

docs/en_US/hpo_benchmark.rst Show resolved Hide resolved

xiaowu0162 added 5 commits May 21, 2021 11:35

benchmark nan temporary fix

60d391c

Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo

b79300e

hpo benchmark script fix

0619d9c

Change result parsing code to generate two graphs per task per fold

e0e3348

HPO benchmark result article v1

ba75e9f

xiaowu0162 requested review from ultmaster and J-shang May 25, 2021 08:59

doc img bug fix

43f1efe

J-shang reviewed May 26, 2021

View reviewed changes

hpo benchmark - add the option of running all experiments in parallel…

a9f3037

… in the background

ultmaster approved these changes May 26, 2021

View reviewed changes

J-shang approved these changes May 26, 2021

View reviewed changes

ultmaster merged commit 4c49db1 into microsoft:master May 26, 2021

ultmaster mentioned this pull request May 31, 2021

NNI 2021 May~June Iteration Planning #3581

Closed

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPO Benchmark #3644

HPO Benchmark #3644

xiaowu0162 commented May 14, 2021

ghost commented May 14, 2021 •

edited by ghost

Loading

ultmaster May 20, 2021

xiaowu0162 May 21, 2021

J-shang May 26, 2021

J-shang May 26, 2021

xiaowu0162 May 26, 2021

J-shang May 26, 2021

xiaowu0162 May 26, 2021 •

edited

Loading

J-shang May 26, 2021

xiaowu0162 May 26, 2021 •

edited

Loading

J-shang May 26, 2021

xiaowu0162 May 26, 2021 •

edited

Loading

xiaowu0162 May 26, 2021


		As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``.

		After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead.

		@@ -0,0 +1,88 @@
		---

		NNI:

HPO Benchmark #3644

HPO Benchmark #3644

Conversation

xiaowu0162 commented May 14, 2021

ghost commented May 14, 2021 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaowu0162 May 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaowu0162 May 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaowu0162 May 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented May 14, 2021 •

edited by ghost

Loading

xiaowu0162 May 26, 2021 •

edited

Loading

xiaowu0162 May 26, 2021 •

edited

Loading

xiaowu0162 May 26, 2021 •

edited

Loading