[MetaSchedule][M3c] XGB-based Cost Model #9859

junrushao · 2022-01-06T20:49:01Z

This PR is part of the stage M3c of the meta schedule project (#8473).

The architecture is re-designed by Junru and Xiyou. In this PR we introduced a XGB-based cost model based on meta schedule's cost model interface. Unittests are included.

Thanks to all co-authors for contributing!

Co-authored-by: Xiyou Zhou <xiyou@octoml.ai>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

junrushao · 2022-01-06T20:49:47Z

CC: @comaniac @yzhliu @merrymercy

python/tvm/meta_schedule/cost_model/xgb_model.py

comaniac

Thanks for clarification and that makes lots of sense. LGTM.

One potential issue Ansor faced before is that when training data gets bigger and bigger, the time to train the XGBoost cost model becomes tedious even the accuracy isn't further improved. What Ansor has done is simply reduce the re-training frequency (e.g., re-train per 2 rounds) when training data size is larger than a threshold. Other than that, we can also refer to the accuracy between the predicted cost and new measured latencies to determine whether to re-train the model in the next round. These are just my two cents and we could probably revisit this issue in the future.

junrushao · 2022-01-06T23:04:25Z

@comaniac Thanks for the extremely valuable feedback!

when training data gets bigger and bigger, the time to train the XGBoost cost model becomes tedious even the accuracy isn't further improved

That's exactly what I'm observing too! In this particular case, hyper-parameters of XGB might not be suitable any more, which limits the model capacity, and we might have to tweak around to find out the best hyperparameters.

What Ansor has done is simply reduce the re-training frequency (e.g., re-train per 2 rounds) when training data size is larger than a threshold.

This is how Ansor deals with this right now...We might consider better heuristics in the future, including switching models, tweaking model capacity with AutoML stuff, etc.

we can also refer to the accuracy between the predicted cost and new measured latencies to determine whether to re-train the model in the next round

Using our current interface, this is pretty simple to do so. We have a validate method that allows us to validate the rmse of the cost model's prediction - and I used this method quite frequently in model debugging too.

Anyway, I think we are pretty aligned with the methodology and path to improvement. Let's work together to improve it in the future

zxybazh

LGTM.

* [MetaSchedule] XGB-based Cost Model Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> * Fix lint * fix doc * fix mypy Co-authored-by: Xiyou Zhou <xiyou@octoml.ai> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

junrushao requested review from areusch, comaniac, icemelon, jroesch, merrymercy, tqchen, yzhliu and zhiics as code owners January 6, 2022 20:49

comaniac reviewed Jan 6, 2022

View reviewed changes

python/tvm/meta_schedule/cost_model/xgb_model.py Outdated Show resolved Hide resolved

junrushao added 2 commits January 6, 2022 14:36

Fix lint

7060440

fix doc

072a712

comaniac approved these changes Jan 6, 2022

View reviewed changes

fix mypy

be8a0c9

zxybazh approved these changes Jan 7, 2022

View reviewed changes

junrushao merged commit 256d170 into apache:main Jan 7, 2022

junrushao mentioned this pull request Jan 26, 2022

[RFC][Tracking Issue] Meta Schedule (AutoTIR) #8473

Closed

62 tasks

junrushao changed the title ~~[MetaSchedule] XGB-based Cost Model~~ [MetaSchedule][M3c] XGB-based Cost Model Jan 26, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MetaSchedule][M3c] XGB-based Cost Model #9859

[MetaSchedule][M3c] XGB-based Cost Model #9859

junrushao commented Jan 6, 2022 •

edited

Loading

junrushao commented Jan 6, 2022

comaniac left a comment

junrushao commented Jan 6, 2022

zxybazh left a comment

[MetaSchedule][M3c] XGB-based Cost Model #9859

[MetaSchedule][M3c] XGB-based Cost Model #9859

Conversation

junrushao commented Jan 6, 2022 • edited Loading

junrushao commented Jan 6, 2022

comaniac left a comment

Choose a reason for hiding this comment

junrushao commented Jan 6, 2022

zxybazh left a comment

Choose a reason for hiding this comment

junrushao commented Jan 6, 2022 •

edited

Loading