Table of Contents
Hermes is a speculative mechanism that accelerates long-latency off-chip load requests by removing on-chip cache access latency from their critical path.
The key idea behind Hermes is to: (1) accurately predict which load requests might go to off-chip, and (2) speculatively start fetching the data required by the predicted off-chip loads directly from the main memory in parallel to the cache accesses. Hermes proposes a lightweight, perceptron-based off-chip predictor that identifies off-chip load requests using multiple disparate program features. The predictor is implemented using only tables and simple arithmetic operations like increment and decrement.
Hermes was presented at MICRO 2022.
Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu, "Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction", In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022
Hermes is modeled in ChampSim simulator. This modified simulator version is largely similar to the one used by Pythia [Bera+, MICRO'21], and fully compatible with all publicly-available traces for ChampSim.
The infrastructure has been tested with the following system configuration:
- cmake 3.20.2
- gcc v6.3.0
- perl v5.24.1
- xz v5.2.5
- gzip v1.6
- md5sum v8.26
- wget v1.18
- megatools v1.11.0 (Note that v1.9.98 does NOT work)
-
Install necessary prerequisites
sudo apt install perl xz gzip
-
Clone the GitHub repo
git clone https://github.com/CMU-SAFARI/Hermes.git
-
Set the environment variable as:
cd Hermes/ source setvars.sh
-
Clone the bloomfilter library inside Hermes home directory and build. This should create the static
libbf.a
library insidebuild
directory.cd $HERMES_HOME/ git clone https://github.com/mavam/libbf.git libbf cd libbf/ mkdir build && cd build/ cmake ../ make clean && make -j
-
Build Hermes using the build script as follows. This should create the executable inside
bin
directory.cd $HERMES_HOME # ./build_champsim.sh <uarch> <l1d_ref> <l2c_pref> <llc_pref> <llc_repl> <ncores> <DRAM-channels> <log-DRAM-channels> ./build_champsim.sh glc multi multi multi multi 1 1 0
Currently, we support two core microarchitectures:
- glc (modeled after Intel Goldencove)
- firestorm (modeled after Apple A14)
-
Install the megatools executable
cd $HERMES_HOME/scripts wget --no-check-certificate https://megatools.megous.com/builds/megatools-1.11.1.20230212.tar.gz tar -xvf megatools-1.11.1.20230212.tar.gz
Note: The megatools link might change in the future depending on the latest release. Please recheck the link if the download fails.
-
Use the
download_traces.pl
perl script to download the necessary ChampSim traces used in our paper.cd $HERMES_HOME/traces/ perl $HERMES_HOME/scripts/download_traces.pl --csv artifact_traces.csv --dir ./
Note: The script should download 110 traces. Please check the final log for any incomplete downloads. The total size of all traces would be ~36 GB.
-
Once the trace download completes, please verify the checksum as follows. Please make sure all traces pass the checksum test.
cd $HERMES_HOME/traces md5sum -c artifact_traces.md5
-
If the traces are downloaded in some other path, please change the full path in
experiments/MICRO22_AE.tlist
accordingly.
Our experimental workflow consists of two stages: (1) launching experiments, and (2) rolling up statistics from experiment outputs.
-
To create necessary experiment commands in bulk, we will use
scripts/create_jobfile.pl
script. Please see scripts/README to get a detailed list of supported arguments and their intended use cases. -
create_jobfile.pl
requires three necessary arguments:exe
: the full path of the executable to runtlist
: contains trace definitionsexp
: contains knobs of the experiments to run
-
Create experiments as follows. Please make sure the paths used in tlist and exp files are appropriate.
cd $HERMES_HOME/experiments/ perl $HERMES_HOME/scripts/create_jobfile.pl --exe $HERMES_HOME/bin/glc-perceptron-no-multi-multi-multi-multi-1core-1ch --tlist MICRO22_AE.tlist --exp MICRO22_AE.exp --local 1 > jobfile.sh
-
Go to a run directory (or create one) inside
experiments
to launch runs in the following way:cd $HERMES_HOME/experiments/outputs/ source ../jobfile.sh
-
If you have slurm support to launch multiple jobs in a compute cluster, please provide
--local 0
tocreate_jobfile.pl
-
To rollup stats in bulk, we will use
scripts/rollup.pl
script. Please see scripts/README to get a detailed list of supported arguments and their intended use cases. -
rollup.pl
requires three necessary arguments:tlist
exp
mfile
: specifies stat names and reduction method to rollup
-
Rollup statistics as follows. Please make sure the paths used in tlist and exp files are appropriate.
cd $HERMES_HOME/experiments/outputs/ perl ../../scripts/rollup.pl --tlist ../MICRO22_AE.tlist --exp ../rollup_perf_hermes.exp --mfile ../rollup_perf.mfile > rollup.csv
-
Export the
rollup.csv
file in your favorite data processor (Python Pandas, Excel, Numbers, etc.) to gain insights.
McPAT requires an XML file that contains all the necessary statistics (e.g., #L1D hits, #L2C hits, etc.) to compute the runtime dynamic power consumption of the processor. We have already provided a template XML file that models our GLC core configuration. Note that this file is only a template file, meaning we only use a placeholder name for key statistics (e.g., total cycles).
To generate power consumption stats using McPAT, we need to follow three key steps: (1) replace appropriate statistics from the .out
file generated by simulator in the template XML file, (2) run McPAT on the generated XML file, and (3) rollup statistics from McPAT output. To automate this process, we have provided some scripts. Please use the following instructions to run them.
-
Checkout McPAT in
mcpat/
directory and compile.cd $HERMES_HOME/mcpat git clone https://github.com/HewlettPackard/mcpat cd mcpat/ git checkout v1.3.0 make
-
Now create the jobfile to run the McPAT experiments
cd $HERMES/experiments perl $HERMES_HOME/scripts/create_mcpat_jobfile.pl --exe $HERMES_HOME/scripts/run_mcpat.pl --tlist MICRO22_AE.tlist --exp MICRO22_AE.exp --xmltemplate $HERMES_HOME/mcpat/hermes_glc_template.xml --mcpatexe $HERMES_HOME/mcpat/mcpat/mcpat --statsdir $HERMES_HOME/experiments/outputs/ --outdir $HERMES_HOME/experiments/outputs/ --local 1 > mcpat_jobfile.sh
This will essentially create a set of jobs, where each job runs the script
run_mcpat.pl
on a.out
file generated by the simulator. The scriptrun_mcpat.pl
does three things: (1) creates a new XML file by by replacing all placeholder stats in the template XML by real all stats from the.out
file, (2) saves this new XML file inoutdir
, and (3) runs McPAT executable on the generated XML file. -
Launch the jobs
cd $HERMES_HOME/experiments/outputs/ source ../mcpat_jobfile.sh
-
Once the runs are complete, you can rollup necessary statistics from the McPAT output files using the following script
cd $HERMES_HOME/experiments/outputs/ perl ../../scripts/rollup_mcpat.pl --tlist ../MICRO22_AE.tlist --exp ../MICRO22_AE.exp > rollup_mcpat.csv
Be careful: the
rollup_mcpat.pl
script is very hardcoded, meaning, it extracts specific stats (e.g., power consumption by dcache) from the McPAT generated output files by grepping and relying on line numbers of the grepped output. This is not the best way to write code. So, if you want to make this code more flexible, please open a pull request and I will be very happy to merge your contribution.
Hermes was code-named DDRP (Direct DRAM Prefetch) during development. So any mention of DDRP anywhere in the code inadvertently means Hermes.
-
Off-chip prediction mechanism is implemented with an extensible interface in mind. The base off-chip predictor class is defined in
inc/offchip_pred_base.h
. -
There are nine implementations of off-chip predictor shipped out of the box.
Predictor type Description Base Always NO Basic Simple confidence counter-based threshold Random Random Hit-miss predictor with a given positive probability HMP-Local Hit-miss predictor [Yoaz+, ISCA'99] with local prediction HMP-GShare Hit-miss predictor with GShare prediction HMP-GSkew Hit-miss predictor with GSkew prediction HMP-Ensemble Hit-miss predictor with all three types combined TTP Tag-tracking based predictor Perc Perceptron-based OCP used in this paper -
You can also quickly implement your own off-chip predictor just by extending
OffchipPredBase
class and implement your ownpredict()
andtrain()
functions. For a new type of off-chip predictor, please call the initialization function insrc/offchip_pred.cc
. -
The off-chip predictor
predict()
function is called atsrc/ooo_cpu.cc:1354
, when an LQ entry gets created. Thetrain()
function is called atsrc/ooo_cpu.cc:2281
when an LQ entry gets released. -
Please note that, in the out-of-the-box Hermes configuration, only the memory request that goes out of the LLC is marked as off-chip. If a memory request gets merged on another memory request that has already gone out to off-chip, the waiting memory request will NOT be marked as off-chip. This property can be toggled by setting
offchip_pred_mark_merged_load=true
. -
Hermes issues the speculative load requests directly to the main memory controller using the function
issue_ddrp_request()
. This function is only called after the translation has been done.
1. How much memory and timeout should I allocate for each job?
While most of the experiments to reproduce MICRO'22 key results finish within 4 hours in our Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz, some experiments might take considerably higher time to finish (e.g., different experiments with different prefetchers). We suggest putting 12 hours as the timeout and 4 GB as the maximum memory required per job.
2. Some slurm job failed with the error: "STEPD TERMINATED DUE TO JOB NOT ENDING WITH SIGNALS". What to do?
This is likely stemming from the slurm scheduler. Please rerun the job either in slurm or in local machine.
3. Some experiments are not correctly ending. They show the output: "Reached end of trace for Core: 0 Repeating trace"... and the error log says "/usr/bin/xz: Argument list too long". What to do?
We have encountered this problem sometimes while running jobs in slurm. Please check the xz version in the local machine and rerun the job locally.
4. The perl create_jobfile.pl
script cannot find the Trace module. What to do?
Have you sourced the setvars.sh
? The setvars.sh
script should set the PERL5LIB
path variable appropriately to make the library discoverable. If the problem still persists, execute the perl script as follows:
perl -I$HERMES_HOME/scripts $HERMES_HOME/scripts/create_jobfile.pl # the remaining command
If you use this framework, please cite the following paper:
@inproceedings{bera2022,
author = {Bera, Rahul and Kanellopoulos, Konstantinos and Balachandran, Shankar and Novo, David and Olgun, Ataberk and Sadrosadati, Mohammad and Mutlu, Onur},
title = {{Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction}},
booktitle = {Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture},
year = {2022}
}
Distributed under the MIT License. See LICENSE
for more information.
Rahul Bera - write2bera@gmail.com
We acknowledge support from SAFARI Research Group's industrial partners.