-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
…ation on this change
* add a new benchmark "nnismall" with binary classification, multi-class classification, and regression tasks * re-implement a correct data preprocessing pipeline for random forest * re-organize dependencies and update setup.sh
@@ -0,0 +1,88 @@ | |||
--- | |||
|
|||
NNI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be tricky to maintain NNI version 2.2 each release. Do you plan to freeze it forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the flag to stable, which indicates that the latest stable release should be used. This flag does not influence any functionalities on our side. The automlbenchmark designed this flag for users to choose different versions of frameworks using command line options.
|
||
As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. | ||
|
||
After the script finishes, the final scores of each tuner is summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to highlight what the kinds of scores are.
- 5.33 | ||
- 3.50 | ||
|
||
Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the score mean, if specified by us and if could be defined by the user? maybe we need explain in doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores in the tables are average rankings. To get this score, the user have to run benchmarks against multiple tuners (either indicate multiple tuners in the command, or manually aggregate the results afterwards). I will modify the doc to further clarify this.
A Benchmark Example | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
||
As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DngoTuner". (The DngoTuner is not available as a built-in tuner at the time this article is written.) As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test one tuner need 2 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we use our 24-task benchmark and enforce the tuner to run 100 trials per fold per task. On average, the time cost is ~less than 1 minute per trial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run multiple tuners at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In my experiments I ran all tuners at the same time and manually aggregated the results afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of script.sh tuner1 tuner2 ...
? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in our case the main time cost is training. For relatively large benchmarks, I think serializing the tasks is indeed not optimal. However, it seems that letting the user run multiscript is suboptimal either, as the user has to manually put the results together before running the result parsing script.
I suggest putting a flag in the script and allow the users to choose to run the tasks nonblocking in background. This would potentially cause a file contention, and I will add extra logic to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of
script.sh tuner1 tuner2 ...
? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?
Added in the following commit
… in the background
Add a benchmarking tool for HPO tuners based on the automlbenchmark tool (https://github.com/openml/automlbenchmark).
Currently this tool supports: