Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

HPO Benchmark #3644

Merged
merged 37 commits into from
May 26, 2021
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8c3f03a
Copy nni package from the original repository
xiaowu0162 May 10, 2021
3b7f452
Merge branch 'microsoft:master' into dev-hpo
xiaowu0162 May 10, 2021
bc317cc
Add initialization code; change hpo benchmark autorun code
xiaowu0162 May 10, 2021
f309530
Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo
xiaowu0162 May 10, 2021
9e463c2
HPO Benchmark README v1
xiaowu0162 May 10, 2021
9b08605
HPO Benchmark README V1.1
xiaowu0162 May 10, 2021
13705fa
HPO Benchmark README V1.2
xiaowu0162 May 10, 2021
aa4f223
Move HPO Benchmark to examples directory
xiaowu0162 May 11, 2021
34016ed
Change HPO Benchmark tuner import logic
xiaowu0162 May 11, 2021
c6ebe4a
Change HPO Benchmark running script logic; Modified README file
xiaowu0162 May 11, 2021
cf129d6
README fix
xiaowu0162 May 11, 2021
b9f4e38
README fix
xiaowu0162 May 11, 2021
50caa4f
Summarize dependencies in a requirement.txt file; update the document…
xiaowu0162 May 12, 2021
f1b0651
Update README.md
xiaowu0162 May 12, 2021
6d621d0
Updated requirement for HPO benchmark
xiaowu0162 May 12, 2021
f3d9fed
Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo
xiaowu0162 May 12, 2021
40c8a42
Update README.md
xiaowu0162 May 12, 2021
bcd8758
20210513 HPO Benchmark feature updates:
xiaowu0162 May 13, 2021
8c82b03
Merge branch 'microsoft:master' into dev-hpo
xiaowu0162 May 14, 2021
7136f31
debug
xiaowu0162 May 14, 2021
103ef73
Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo
xiaowu0162 May 14, 2021
7055991
Support either "time" or "ntrials" as benchmark constraints
xiaowu0162 May 18, 2021
4bb2bd7
Add graphical presentation of benchmark results
xiaowu0162 May 18, 2021
4695962
Finalize HPO Benchmark graphical reports
xiaowu0162 May 19, 2021
c5cc749
Add HPO Benchmark code comments and documentation
xiaowu0162 May 19, 2021
b66168e
Refactor HPO benchmark documentation and configs
xiaowu0162 May 21, 2021
07771dd
Update hpo_benchmark.rst
xiaowu0162 May 21, 2021
b8b6a4e
Update hpo_benchmark.rst
xiaowu0162 May 21, 2021
fb09709
documentation debug
xiaowu0162 May 21, 2021
b810804
Update hpo_benchmark.rst
xiaowu0162 May 21, 2021
60d391c
benchmark nan temporary fix
xiaowu0162 May 21, 2021
b79300e
Merge branch 'dev-hpo' of https://github.com/xiaowu0162/nni into dev-hpo
xiaowu0162 May 21, 2021
0619d9c
hpo benchmark script fix
xiaowu0162 May 21, 2021
e0e3348
Change result parsing code to generate two graphs per task per fold
xiaowu0162 May 24, 2021
ba75e9f
HPO benchmark result article v1
xiaowu0162 May 25, 2021
43f1efe
doc img bug fix
xiaowu0162 May 25, 2021
a9f3037
hpo benchmark - add the option of running all experiments in parallel…
xiaowu0162 May 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions docs/en_US/hpo_benchmark.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@

Benchmark for Tuners
====================

We provide a benchmarking tool to compare the performances of tuners provided by NNI (and users' custom tuners) on different tasks. The implementation of this tool is based on the automlbenchmark repository (https://github.com/openml/automlbenchmark), which provides services of running different *frameworks* against different *benchmarks* consisting of multiple *tasks*. The tool is located in ``examples/trials/benchmarking/automlbenchmark``. This document provides a brief introduction to the tool and its usage.

Terminology
^^^^^^^^^^^


* **task**\ : a task can be thought of as (dataset, evaluator). It gives out a dataset containing (train, valid, test), and based on the received predictions, the evaluator evaluates a given metric (e.g., mse for regression, f1 for classification).
* **benchmark**\ : a benchmark is a set of tasks, along with other external constraints such as time and resource.
* **framework**\ : given a task, a framework conceives answers to the proposed regression or classification problem and produces predictions. Note that the automlbenchmark framework does not pose any restrictions on the hypothesis space of a framework. In our implementation in this folder, each framework is a tuple (tuner, architecture), where architecture provides the hypothesis space (and search space for tuner), and tuner determines the strategy of hyperparameter optimization.
* **tuner**\ : a tuner or advisor defined in the hpo folder, or a custom tuner provided by the user.
* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). In our implementation, the architecture calls tuner multiple times to obtain possible hyperparameter configurations, and produces the final prediction for a task. See ``./nni/extensions/NNI/architectures`` for examples.

Setup
^^^^^

Due to some incompatibilities between automlbenchmark and python 3.8, python 3.7 is recommended for running experiments contained in this folder. First, run the following shell script to clone the automlbenchmark repository. Note: it is recommended to perform the following steps in a separate virtual environment, as the setup code may install several packages.

.. code-block:: bash

./setup.sh

Run predefined benchmarks on existing tuners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

./runbenchmark_nni.sh [tuner-names]

This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/. To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. Some benchmarks are defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``\ , and others are defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``. One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and 8 multi-class classification tasks.

By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names], it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "Hyperband", "BOHB". It is also possible to evaluate custom tuners. See the next sections for details.

Note: the SMAC tuner and the BOHB advisor has to be manually installed before any experiments can be run on it. Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details on installing SMAC and BOHB.

Run customized benchmarks on existing tuners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To run customized benchmarks, add a benchmark_name.yaml file in the folder ``./nni/benchmarks``\ , and change the ``benchmark`` variable in ``runbenchmark_nni.sh``. See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark.

Run benchmarks on custom tuners
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_. Next, perform the following steps:


#. Install the custom tuner with command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details.
#. In ``./nni/frameworks.yaml``\ , add a new framework extending the base framework NNI. Make sure that the parameter ``tuner_type`` corresponds to the "builtinName" of tuner installed in step 1.
#. Run the following command

.. code-block:: bash

./runbenchmark_nni.sh new-tuner-builtinName
xiaowu0162 marked this conversation as resolved.
Show resolved Hide resolved

A Benchmark Example
^^^^^^^^^^^^^^^^^^^

As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test one tuner need 2 days?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we use our 24-task benchmark and enforce the tuner to run 100 trials per fold per task. On average, the time cost is ~less than 1 minute per trial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run multiple tuners at the same time?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In my experiments I ran all tuners at the same time and manually aggregated the results afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in our case the main time cost is training. For relatively large benchmarks, I think serializing the tasks is indeed not optimal. However, it seems that letting the user run multiscript is suboptimal either, as the user has to manually put the results together before running the result parsing script.
I suggest putting a flag in the script and allow the users to choose to run the tasks nonblocking in background. This would potentially cause a file contention, and I will add extra logic to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

Added in the following commit


After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to highlight what the kinds of scores are.


.. image:: ../img/hpo_benchmark/performances.png
:target: ../img/hpo_benchmark/performances.png
:alt:

When the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents a ranking of the tuners for each metric (logloss, rmse, auc), and the rankings of tuners for each metric (another view of the same data).

Average rankings for metric rmse:

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - Anneal
- 3.75
* - Random
- 4.00
* - Evolution
- 4.44
* - DNGOTuner
- 4.44
* - SMAC
- 4.56
* - TPE
- 4.94
* - GPTuner
- 4.94
* - MetisTuner
- 4.94

Average rankings for metric auc:

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - SMAC
- 3.67
* - GPTuner
- 4.00
* - Evolution
- 4.22
* - Anneal
- 4.39
* - MetisTuner
- 4.39
* - TPE
- 4.67
* - Random
- 5.33
* - DNGOTuner
- 5.33

Average rankings for metric logloss:

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - Random
- 3.36
* - DNGOTuner
- 3.50
* - SMAC
- 3.93
* - GPTuner
- 4.64
* - TPE
- 4.71
* - Anneal
- 4.93
* - Evolution
- 5.00
* - MetisTuner
- 5.93

Average rankings for tuners:

.. list-table::
:header-rows: 1

* - Tuner Name
- rmse
- auc
- logloss
* - TPE
- 4.94
- 4.67
- 4.71
* - Random
- 4.00
- 5.33
- 3.36
* - Anneal
- 3.75
- 4.39
- 4.93
* - Evolution
- 4.44
- 4.22
- 5.00
* - GPTuner
- 4.94
- 4.00
- 4.64
* - MetisTuner
- 4.94
- 4.39
- 5.93
* - SMAC
- 4.56
- 3.67
- 3.93
* - DNGOTuner
- 4.44
- 5.33
- 3.50

Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the score mean, if specified by us and if could be defined by the user? maybe we need explain in doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scores in the tables are average rankings. To get this score, the user have to run benchmarks against multiple tuners (either indicate multiple tuners in the command, or manually aggregate the results afterwards). I will modify the doc to further clarify this.


.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
:target: ../img/hpo_benchmark/car_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
:target: ../img/hpo_benchmark/car_fold1_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
:target: ../img/hpo_benchmark/christine_fold0_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
:target: ../img/hpo_benchmark/christine_fold0_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:alt:

3 changes: 2 additions & 1 deletion docs/en_US/hyperparameter_tune.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,5 @@ according to their needs.
Examples <examples>
WebUI <Tutorial/WebUI>
How to Debug <Tutorial/HowToDebug>
Advanced <hpo_advanced>
Advanced <hpo_advanced>
Benchmark for Tuners <hpo_benchmark>
Binary file added docs/img/hpo_benchmark/car_fold1_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/car_fold1_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/christine_fold0_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/christine_fold0_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/cnae-9_fold0_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/cnae-9_fold0_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/credit-g_fold1_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/credit-g_fold1_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/performances.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/titanic_2_fold1_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/hpo_benchmark/titanic_2_fold1_2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions examples/trials/benchmarking/automlbenchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# data files
nni/data/

# benchmark repository
automlbenchmark/

# all experiment results
results*

# intermediate outputs of tuners
smac3-output*
param_config_space.pcs
scenario.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300

- name: cholesterol
openml_task_id: 2295

- name: liver-disorders
openml_task_id: 52948

- name: kin8nm
openml_task_id: 2280

- name: cpu_small
openml_task_id: 4883

- name: titanic_2
openml_task_id: 211993

- name: boston
openml_task_id: 4857

- name: stock
openml_task_id: 2311

- name: space_ga
openml_task_id: 4835

- name: Australian
openml_task_id: 146818

- name: blood-transfusion
openml_task_id: 10101

- name: car
openml_task_id: 146821

- name: christine
openml_task_id: 168908

- name: cnae-9
openml_task_id: 9981

- name: credit-g
openml_task_id: 31

- name: dilbert
openml_task_id: 168909

- name: fabert
openml_task_id: 168910

- name: jasmine
openml_task_id: 168911

- name: kc1
openml_task_id: 3917

- name: kr-vs-kp
openml_task_id: 3

- name: mfeat-factors
openml_task_id: 12

- name: phoneme
openml_task_id: 9952

- name: segment
openml_task_id: 146822

- name: sylvine
openml_task_id: 168912

- name: vehicle
openml_task_id: 53
Loading