-
Notifications
You must be signed in to change notification settings - Fork 232
PatternsHaswellEP
Thomas Roehl edited this page Sep 17, 2015
·
2 revisions
Pattern | Desired events | Available events |
---|---|---|
ALU saturation | Amount of UOPs executed per port, Amount of load/store UOPs, Amount of calculation UOPs | UOPS_EXECUTED_PORT.PORT_(0-8), INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, MEM_UOPS_RETIRED.ALL_LOADS, MEM_UOPS_RETIRED.ALL_STORES, AVX_INSTS_CALC |
Bandwidth saturation | Amount of transferred cache lines between L1 and L2, L2 and L3, L3 and Memory including prefetches, snoops, ..., Amount of scalar/packed/vector loads/stores | L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, UNC_H_IMC_READS.NORMAL, UNC_H_BYPASS_IMC.TAKEN, UNC_H_IMC_WRITES.ALL, AVX_INSTS.LOADS, AVX_INSTS.STORES |
Pattern | Desired events | Available events |
---|---|---|
Inefficient data access due to excess data volume | Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses | L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS |
Inefficient data access due to latency-bound accesses | Latency in cycles for loads and stores, Amount of cache lines transferred between cache levels (in and out), Amount of cache hits, Amount of cache misses | Latency measurements only available at kernel space with PEBS, L1D.REPLACEMENT, L2_TRANS.L1_WB, L2_LINES_IN.ALL, L2_TRANS.L2_WB, UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_HIT, MEM_LOAD_UOPS_RETIRED.L(1/2/3)_MISS |
Limited instruction throughput | Stall and used cycles at decoder, reservation station, all execution ports, reorder buffer and store buffer | UOPS_ISSUED.THREAD, UOPS_EXECUTED.THREAD, UOPS_RETIRED.THREAD, RESOURCE_STALLS.(RS, SB, ROB), UOPS_ISSUED.THREAD:CMASK=0x1:INV=1, UOPS_EXECUTED.THREAD:CMASK=0x1:INV=1, UOPS_RETIRED.THREAD:CMASK=0x1:INV=1, RESOURCE_STALLS.(RS, SB, ROB):CMASK=0x1:INV=1 |
Micro-architectural anomalies | Amount of memory aliasing stalls, Amount of conflict misses, Amount of unaligned loads and stores, Amount of requeues of UOPs, All amounts of performance degrading hardware behavior | RESOURCE_STALLS.(RS, SB, ROB), MISALIGN_MEM_REF.ANY, LD_BLOCKS_PARTIAL.ADDRESS_ALIAS, LOCK_CYCLES.CACHE_LOCK_DURATION |
False sharing of cache lines | Amount of modified cache lines transferred from a CPU's private cache to other CPU's cache, Amount of modified cache lines transferred between CPU sockets | MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_RESPONSE:LLC_HIT:HITM_OTHER_CORE, OFFCORE_RESPONSE:LLC_MISS:REMOTE_HITM |
Bad ccNUMA page placement | Amount of cache lines transferred from local memory to a CPU core, Amount of cache lines transferred from remote memory to a CPU core (best with filtering for source memory domain), Amount of data transferred over socket interconnect | UNC_M_CAS_COUNT.RD, UNC_M_CAS_COUNT.WR, RXL_FLITS_G0.DATA, TXL_FLITS_G0.DATA, OFFCORE_RESPONSE:L3_MISS:LOCAL_DRAM, OFFCORE_RESPONSE:L3_MISS:REMOTE_DRAM |
Control flow issues | Amount of all branches, Amount of all misspredicted branched, Amount of retired instructions | BR_INST_RETIRED.ALL_BRANCHES, BR_MISP_RETIRED.ALL_BRANCHES, INST_RETIRED.ANY |
Pattern | Desired events | Available events |
---|---|---|
Load imbalance / serial fraction | Amount of "work" instructions e.g. floating point operations or bit shifts, Amount of cache lines transferred between L1 and CPU core | AVX_INSTS.CALC, L1D.REPLACEMENT, L2_TRANS.L1D_WB |
Synchronization overhead | Amount of "work" instructions, e.g. floating point operations or bit shifts, Amount of halted cycles, Amount of unhalted cycles, Amount of retired instructions | AVX_INSTS.CALC, INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, CPU_CLK_UNHALTED.THREAD_P:CMASK=0x1:INV=1 |
Instruction overhead | Amount of "long-latency" instructions, Amount of issued/executed/retired instructions, Amount of floating-point instructions | INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC |
Bad code composition due to expensive instructions | Amount of expensive UOPs like divide, sqrt, rand, ..., Amount of retired instructions, Amount of retired UOPs | ARITH.DIVIDER_UOPS, INST_RETIRED.ANY, UOPS_RETIRED.ANY |
Bad code composition due to ineffective instructions | Amount of not work-related instructions, Amount of retired instructions, floating-point instructions separated by scalar, packed and vectorized | INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD_P, AVX_INSTS.CALC, AVX_INSTS.LOADS, AVX_INSTS.STORES |
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing