Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

HPO Benchmark #3644

Merged
merged 37 commits into from
May 26, 2021
Merged

HPO Benchmark #3644

merged 37 commits into from
May 26, 2021

Conversation

xiaowu0162
Copy link
Contributor

Add a benchmarking tool for HPO tuners based on the automlbenchmark tool (https://github.com/openml/automlbenchmark).
Currently this tool supports:

  • Running ML benchmarks using nni built-in tuners as well as custom tuners. All tuners search on the same search space, defined by the hypothesis space. Currently random forest is the only supported hypothesis space.
  • Automatically generating reports comparing the performances of different tuners.
  • Running either predefined benchmarks or customized ones. Currently, our predefined benchmark includes three type of problems (binary classification, multi-class classification, and regression), each of which includes 8 tasks.

@ghost
Copy link

ghost commented May 14, 2021

CLA assistant check
All CLA requirements met.

@ultmaster ultmaster requested review from J-shang and ultmaster May 14, 2021 08:37
@ultmaster ultmaster added the HPO label May 14, 2021
@xiaowu0162 xiaowu0162 marked this pull request as draft May 18, 2021 06:14
@xiaowu0162 xiaowu0162 marked this pull request as ready for review May 19, 2021 03:54
.gitignore Outdated Show resolved Hide resolved
examples/trials/benchmarking/automlbenchmark/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,88 @@
---

NNI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be tricky to maintain NNI version 2.2 each release. Do you plan to freeze it forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the flag to stable, which indicates that the latest stable release should be used. This flag does not influence any functionalities on our side. The automlbenchmark designed this flag for users to choose different versions of frameworks using command line options.

@xiaowu0162 xiaowu0162 requested a review from ultmaster May 21, 2021 02:17
@xiaowu0162 xiaowu0162 requested review from ultmaster and J-shang May 25, 2021 08:59

As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``.

After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to highlight what the kinds of scores are.

- 5.33
- 3.50

Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the score mean, if specified by us and if could be defined by the user? maybe we need explain in doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scores in the tables are average rankings. To get this score, the user have to run benchmarks against multiple tuners (either indicate multiple tuners in the command, or manually aggregate the results afterwards). I will modify the doc to further clarify this.

A Benchmark Example
^^^^^^^^^^^^^^^^^^^

As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test one tuner need 2 days?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we use our 24-task benchmark and enforce the tuner to run 100 trials per fold per task. On average, the time cost is ~less than 1 minute per trial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run multiple tuners at the same time?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In my experiments I ran all tuners at the same time and manually aggregated the results afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

Copy link
Contributor Author

@xiaowu0162 xiaowu0162 May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in our case the main time cost is training. For relatively large benchmarks, I think serializing the tasks is indeed not optimal. However, it seems that letting the user run multiscript is suboptimal either, as the user has to manually put the results together before running the result parsing script.
I suggest putting a flag in the script and allow the users to choose to run the tasks nonblocking in background. This would potentially cause a file contention, and I will add extra logic to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?

Added in the following commit

@ultmaster ultmaster merged commit 4c49db1 into microsoft:master May 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants