[MetaSchedule] Performance Alignment - NRM and SFM (CUDA) #559

MasterJH5574 · 2021-12-26T12:54:03Z

The performance of batch normalization and softmax on CUDA is almost aligned:

totally aligned for batch normalization, and
some minor gap (around 1%) exists for softmax.

To be clear, the gap isn’t caused by rule cross-thread reduction, for sure. The detailed reason is unknown yet, partly because I cannot reproduce Ansor’s history best GFLOPS number except when tuning with Ansor. Thus let's dig out the issues later in the future.

MasterJH5574 · 2021-12-26T12:57:09Z

@zxybazh Hi Xiyou, could you help go through the changed file list to check whether some changed file has been upstreamed? If so, we need to send PRs to mainline to update these files in the short future. Thanks a lot!

zxybazh · 2021-12-26T18:25:30Z

Of course. Most of the changes were introduced in #9761, #9780, #9789(not merged yet) and #9799(not merged yet). And in these PRs I followed the file list here and upstreamed all files related to my part, i.e., Mutators, Schedule Rules and Postprocs are included but only header file and the main implementation, no concrete classes included. You may refer to this commit as the latest one on my side. Let me know if I missed anything, thanks!

MasterJH5574 · 2021-12-27T03:40:46Z

@zxybazh Got it, thanks! I went through your PRs and found no conflicts 😉 . As your parts haven't introduce concrete classes so far, there should be no problem 😆 .

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (#485) [Meta Schedule][M3c] PostOrderApply (#486) Fix Post Order Apply (#490) [MetaSchedule] Relay Integration (#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (#492) Fix replay trace. (#493) [M3c][Meta Schedule] Implement the Replay Func class. (#495) [PR] Test script for meta-schedule task extraction. Interface to load… (#494) [Meta Schedule Refactor] Get child blocks (#500) Read-at && Write-at (#497) [M3c][Meta Schedule] Measure Callbacks (#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (#496) [MetaSchedule] Sample-Perfect-Tile (#501) [MetaSchedule] TE Workloads (#502) [TensorIR] GetProducer, GetConsumer (#506) [MetaScheduleRefactor] Annotate&Unannotate (#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (#503) [Tests] Add unittests for auto-inline and multi-level-tiling (#508) [Meta Schedule] Minor Fixes (#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (#499) [Meta Schedule] Add Helper Function & Minor Modification (#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (#513) [Meta Schedule] Feature Extractor & Cost Model (#510) Blockize & Tensorize (#514) Layout Rewriting: Suggest-Index-Map (#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (#516) [Meta Schedule] Per-Store-Feature (#521) Add traced schedule for blockize & tensorize (#526) [Meta Schedule] Add XGBoost Model & Random Model (#519) User-Interface: Tune-TIR (#525) User-Interface: Tune-TE (#527) [Minor] More logging on python (#528) Get CUDA tuning working (#529) [MetaSchedule] TensorRT BYOC (#518) [BugFix] LocalBuilder API (#531) [Meta Schedule] Add Cost Model Update Measure Callback (#530) [Bugfix] BuilderInput with default params (#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (#534) [Meta Schedule] Evolutionary Search (#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (#535) [Meta Schedule] Fix some bugs (#537) Initiate Experiments for CPU Performance Alignment with Ansor (#538) [Meta Schedule] Tweak experiment scripts (#539) [Meta Schedule] Initiate experiments on CUDA (#540) [TIR][Schedule] Buffer transform (#523) Auto Tensor Core (#524) Working on Evo Search (#542) [Meta Schedule] Add Replay Tuning Interface (#543) Evolutionary Search on CPU (#544) Misc improvement over the error message (#545) [TIR][Schedule] Software pipelining (#533) [Meta Schedule Refactor] fixing unit tests (#547) [MetaSchedule] Mutator-Compute-Location (#548) Misc Improvement of Evolutionary Search (#549) Hotfix for software pipeline (#552) Misc Improvement (#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (#9738) (#555) Rule RFactor (#551) [MemHammer] Rewrite Rules (#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (#559) [MetaSchedule] Perf Alignment - NRM on CUDA (#560) [TIR] Reorder the block iters of the blocks generated by RFactor (#561) Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn> Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com> Co-authored-by: Hongyi Jin <3231950289@qq.com> Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com> Co-authored-by: Junru Shao <junrushao1994@gmail.com> Co-authored-by: Wuwei Lin <wuwei@apache.org> Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com> Co-authored-by: Xiyou Zhou <xiyou@octoml.ai>

MasterJH5574 added 11 commits December 25, 2021 08:45

Add rule cross-thread reduction to tune.py

30193e4

Skip undefined objects during simplification

d08425f

Fix postproc RewriteUnboundBlock

cb54c16

Use deep copy in PerStoreFeature

4b57247

Update RewriteUnboundBlock

e86f6bf

Use sampling in rule CrossThreadReduction

061869f

Add a not-fusible case

5408541

Support follow-split in rule cross-thread reduction

b20b181

Add unittest for trace simplification

dc32490

Fix AutoInline

660ecff

Add workload SFM

ca658ab

MasterJH5574 merged commit dea5038 into tlc-pack:meta-schedule-refactor Dec 27, 2021

MasterJH5574 mentioned this pull request Dec 27, 2021

[MetaSchedule] Perf Alignment - NRM on CUDA #560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MetaSchedule] Performance Alignment - NRM and SFM (CUDA) #559

[MetaSchedule] Performance Alignment - NRM and SFM (CUDA) #559

MasterJH5574 commented Dec 26, 2021

MasterJH5574 commented Dec 26, 2021

zxybazh commented Dec 26, 2021

MasterJH5574 commented Dec 27, 2021

[MetaSchedule] Performance Alignment - NRM and SFM (CUDA) #559

[MetaSchedule] Performance Alignment - NRM and SFM (CUDA) #559

Conversation

MasterJH5574 commented Dec 26, 2021

MasterJH5574 commented Dec 26, 2021

zxybazh commented Dec 26, 2021

MasterJH5574 commented Dec 27, 2021