This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
HPO Benchmark #3644
HPO Benchmark #3644
Changes from 36 commits
8c3f03a
3b7f452
bc317cc
f309530
9e463c2
9b08605
13705fa
aa4f223
34016ed
c6ebe4a
cf129d6
b9f4e38
50caa4f
f1b0651
6d621d0
f3d9fed
40c8a42
bcd8758
8c82b03
7136f31
103ef73
7055991
4bb2bd7
4695962
c5cc749
b66168e
07771dd
b8b6a4e
fb09709
b810804
60d391c
b79300e
0619d9c
e0e3348
ba75e9f
43f1efe
a9f3037
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test one tuner need 2 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we use our 24-task benchmark and enforce the tuner to run 100 trials per fold per task. On average, the time cost is ~less than 1 minute per trial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run multiple tuners at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In my experiments I ran all tuners at the same time and manually aggregated the results afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, shall we recommend users to test multiple tuners at the same time by run multi-script instead of
script.sh tuner1 tuner2 ...
? or if we should parallel execute each tuner test in our script? What is the main time cost of a trial?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in our case the main time cost is training. For relatively large benchmarks, I think serializing the tasks is indeed not optimal. However, it seems that letting the user run multiscript is suboptimal either, as the user has to manually put the results together before running the result parsing script.
I suggest putting a flag in the script and allow the users to choose to run the tasks nonblocking in background. This would potentially cause a file contention, and I will add extra logic to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the following commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to highlight what the kinds of scores are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the score mean, if specified by us and if could be defined by the user? maybe we need explain in doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores in the tables are average rankings. To get this score, the user have to run benchmarks against multiple tuners (either indicate multiple tuners in the command, or manually aggregate the results afterwards). I will modify the doc to further clarify this.