Skip to content

Commit

Permalink
Add support for Intel SapphireRapids (SPR) (#524)
Browse files Browse the repository at this point in the history
* Intel Sapphire Rapids: Core files with FIXED, PMC, VOLTAGE, THERMAL and RAPL

* Allow same registers for access as Intel IcelakeX

* Add Intel Sapphire Rapids IDs and strings

* By default add the fixed TOPDOWN_SLOTS event

* Add CPU feature detection

* Add energy monitoring interface

* Add general in-core hardware performance monitoring

* Add TOPDOWN_SLOTS to perf_event backend

* Add support for hardware thread monitoring on Intel SapphireRapids

* Remove MEM* groups, no support for uncore yet

* Full Uncore support for Intel SapphireRapids

* Fixed for multi-socket systems

* Fix for direct access mode

* Fix for debug output in perf_event backend

* Add MEM groups

* Fixes for HBM units and HBM group

* Combined group for DDR and HBM measurements

* Add HBM_SP and HBM_DP group, similar to MEM_SP/DP on SPR

* Add unit [FLOP/Byte] to operational intensity. See #541

* Remote empty line.

* Reset all bits of temporary variable

* Some more checks in Intel's uncore discovery method

* Revert setting return value to zero. Breaks lookup

* Add missing M2PCIe units

* Uncore Discovery: don't use memcpy but own byte-wise copy for reliable results

* Add missing MDF units

* Complete event file

* Need more register indicies
  • Loading branch information
TomTheBear authored Oct 20, 2023
1 parent 734cb94 commit 2c684a9
Show file tree
Hide file tree
Showing 47 changed files with 13,418 additions and 126 deletions.
915 changes: 915 additions & 0 deletions doc/archs/sapphirerapids.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions doc/likwid-doxygen.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ Optionally, a global configuration file \ref likwid.cfg can be given to modify s
- \subpage icelake
- \subpage icelakesp
- \subpage tigerlake
- \subpage sapphirerapids

\subsubsection Architectures_AMD AMD®
- \subpage k8
Expand Down
32 changes: 32 additions & 0 deletions groups/SPR/BRANCH.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
SHORT Branch prediction miss rate/ratio

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 BR_INST_RETIRED_ALL_BRANCHES
PMC1 BR_MISP_RETIRED_ALL_BRANCHES

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Branch rate PMC0/FIXC0
Branch misprediction rate PMC1/FIXC0
Branch misprediction ratio PMC1/PMC0
Instructions per branch FIXC0/PMC0

LONG
Formulas:
Branch rate = BR_INST_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction rate = BR_MISP_RETIRED_ALL_BRANCHES/INSTR_RETIRED_ANY
Branch misprediction ratio = BR_MISP_RETIRED_ALL_BRANCHES/BR_INST_RETIRED_ALL_BRANCHES
Instructions per branch = INSTR_RETIRED_ANY/BR_INST_RETIRED_ALL_BRANCHES
-
The rates state how often on average a branch or a mispredicted branch occurred
per instruction retired in total. The branch misprediction ratio sets directly
into relation what ratio of all branch instruction where mispredicted.
Instructions per branch is 1/branch rate.

24 changes: 24 additions & 0 deletions groups/SPR/CLOCK.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
SHORT Power and Energy consumption

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PWR0 PWR_PKG_ENERGY

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Energy [J] PWR0
Power [W] PWR0/time

LONG
Formulas:
Power = PWR_PKG_ENERGY / time
-
Sapphire Rapids implements the RAPL interface. This interface enables to
monitor the consumed energy on the package (socket) level.

23 changes: 23 additions & 0 deletions groups/SPR/DATA.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
SHORT Load to store ratio

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 MEM_INST_RETIRED_ALL_LOADS
PMC1 MEM_INST_RETIRED_ALL_STORES

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Load to store ratio PMC0/PMC1

LONG
Formulas:
Load to store ratio = MEM_INST_RETIRED_ALL_LOADS/MEM_INST_RETIRED_ALL_STORES
-
This is a metric to determine your load to store ratio.

112 changes: 112 additions & 0 deletions groups/SPR/DDR_HBM.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
SHORT Memory bandwidth in MBytes/s for DDR and HBM

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
MBOX0C0 CAS_COUNT_RD
MBOX0C1 CAS_COUNT_WR
MBOX1C0 CAS_COUNT_RD
MBOX1C1 CAS_COUNT_WR
MBOX2C0 CAS_COUNT_RD
MBOX2C1 CAS_COUNT_WR
MBOX3C0 CAS_COUNT_RD
MBOX3C1 CAS_COUNT_WR
MBOX4C0 CAS_COUNT_RD
MBOX4C1 CAS_COUNT_WR
MBOX5C0 CAS_COUNT_RD
MBOX5C1 CAS_COUNT_WR
MBOX6C0 CAS_COUNT_RD
MBOX6C1 CAS_COUNT_WR
MBOX7C0 CAS_COUNT_RD
MBOX7C1 CAS_COUNT_WR
MBOX8C0 CAS_COUNT_RD
MBOX8C1 CAS_COUNT_WR
MBOX9C0 CAS_COUNT_RD
MBOX9C1 CAS_COUNT_WR
MBOX10C0 CAS_COUNT_RD
MBOX10C1 CAS_COUNT_WR
MBOX11C0 CAS_COUNT_RD
MBOX11C1 CAS_COUNT_WR
MBOX12C0 CAS_COUNT_RD
MBOX12C1 CAS_COUNT_WR
MBOX13C0 CAS_COUNT_RD
MBOX13C1 CAS_COUNT_WR
MBOX14C0 CAS_COUNT_RD
MBOX14C1 CAS_COUNT_WR
MBOX15C0 CAS_COUNT_RD
MBOX15C1 CAS_COUNT_WR
HBM0C0 CAS_COUNT_RD
HBM0C1 CAS_COUNT_WR
HBM1C0 CAS_COUNT_RD
HBM1C1 CAS_COUNT_WR
HBM2C0 CAS_COUNT_RD
HBM2C1 CAS_COUNT_WR
HBM3C0 CAS_COUNT_RD
HBM3C1 CAS_COUNT_WR
HBM4C0 CAS_COUNT_RD
HBM4C1 CAS_COUNT_WR
HBM5C0 CAS_COUNT_RD
HBM5C1 CAS_COUNT_WR
HBM6C0 CAS_COUNT_RD
HBM6C1 CAS_COUNT_WR
HBM7C0 CAS_COUNT_RD
HBM7C1 CAS_COUNT_WR
HBM8C0 CAS_COUNT_RD
HBM8C1 CAS_COUNT_WR
HBM9C0 CAS_COUNT_RD
HBM9C1 CAS_COUNT_WR
HBM10C0 CAS_COUNT_RD
HBM10C1 CAS_COUNT_WR
HBM11C0 CAS_COUNT_RD
HBM11C1 CAS_COUNT_WR
HBM12C0 CAS_COUNT_RD
HBM12C1 CAS_COUNT_WR
HBM13C0 CAS_COUNT_RD
HBM13C1 CAS_COUNT_WR
HBM14C0 CAS_COUNT_RD
HBM14C1 CAS_COUNT_WR
HBM15C0 CAS_COUNT_RD
HBM15C1 CAS_COUNT_WR


METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DDR read bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0/time
DDR read data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0)*64.0
DDR write bandwidth [MBytes/s] 1.0E-06*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time
DDR write data volume [GBytes] 1.0E-09*(MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0
DDR bandwidth [MBytes/s] 1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0/time
DDR data volume [GBytes] 1.0E-09*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX8C0+MBOX9C0+MBOX10C0+MBOX11C0+MBOX12C0+MBOX13C0+MBOX14C0+MBOX15C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1+MBOX8C1+MBOX9C1+MBOX10C1+MBOX11C1+MBOX12C1+MBOX13C1+MBOX14C1+MBOX15C1)*64.0
HBM read bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0/time
HBM read data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0)*64.0
HBM write bandwidth [MBytes/s] 1.0E-06*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/time
HBM write data volume [GBytes] 1.0E-09*(HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0
HBM bandwidth [MBytes/s] 1.0E-06*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0/time
HBM data volume [GBytes] 1.0E-09*(HBM0C0+HBM1C0+HBM2C0+HBM3C0+HBM4C0+HBM5C0+HBM6C0+HBM7C0+HBM8C0+HBM9C0+HBM10C0+HBM11C0+HBM12C0+HBM13C0+HBM14C0+HBM15C0+HBM0C1+HBM1C1+HBM2C1+HBM3C1+HBM4C1+HBM5C1+HBM6C1+HBM7C1+HBM8C1+HBM9C1+HBM10C1+HBM11C1+HBM12C1+HBM13C1+HBM14C1+HBM15C1)*64.0

LONG
Formulas:
DDR read bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOX*C0))*64.0/runtime
DDR read data volume [GBytes] = 1.0E-09*(SUM(MBOX*C0))*64.0
DDR write bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOX*C1))*64.0/runtime
DDR write data volume [GBytes] = 1.0E-09*(SUM(MBOX*C1))*64.0
DDR bandwidth [MBytes/s] = 1.0E-06*(SUM(MBOX*C0)+SUM(MBOX*C1))*64.0/runtime
DDR data volume [GBytes] = 1.0E-09*(SUM(MBOX*C0)+SUM(MBOX*C1))*64.0
HBM read bandwidth [MBytes/s] = 1.0E-06*(SUM(HBM*C0))*64.0/runtime
HBM read data volume [GBytes] = 1.0E-09*(SUM(HBM*C0))*64.0
HBM write bandwidth [MBytes/s] = 1.0E-06*(SUM(HBM*C1))*64.0/runtime
HBM write data volume [GBytes] = 1.0E-09*(SUM(HBM*C1))*64.0
HBM bandwidth [MBytes/s] = 1.0E-06*(SUM(HBM*C0)+SUM(HBM*C1))*64.0/runtime
HBM data volume [GBytes] = 1.0E-09*(SUM(HBM*C0)+SUM(HBM*C1))*64.0
--
Profiling group to measure memory bandwidth drawn by all cores of a socket for DDR
as well as HBM. Since this group is based on Uncore events it is only possible to measure on a
per socket base. Some of the counters may not be available on your system.
Also outputs total data volume transferred from both memory technologies.


35 changes: 35 additions & 0 deletions groups/SPR/DIVIDE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
SHORT Divide unit information

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 ARITH_FPDIV_COUNT
PMC1 ARITH_FPDIV_ACTIVE
PMC2 ARITH_IDIV_COUNT
PMC3 ARITH_IDIV_ACTIVE


METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Number of FP divide ops PMC0
Avg. FP divide unit usage duration PMC1/PMC0
Number of INT divide ops PMC2
Avg. INT divide unit usage duration PMC3/PMC2

LONG
Formulas:
Number of FP divide ops = ARITH_FPDIV_COUNT
Avg. FP divide unit usage duration = ARITH_FPDIV_ACTIVE/ARITH_FPDIV_COUNT
Number of INT divide ops = ARITH_IDIV_COUNT
Avg. INT divide unit usage duration = ARITH_IDIV_ACTIVE/ARITH_IDIV_COUNT
-
This performance group measures the average latency of divide operations.
The Intel Sapphire Rapids architecture performs FP and INT divide operations
on different ports (P0 and P1 respectively).
The COUNT events are the ACTIVE event with the edge detect bit set to count only
the activation of the unit.
41 changes: 41 additions & 0 deletions groups/SPR/ENERGY.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
SHORT Power and Energy consumption

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
TMP0 TEMP_CORE
PWR0 PWR_PKG_ENERGY
PWR1 PWR_PP0_ENERGY
PWR3 PWR_DRAM_ENERGY
PWR4 PWR_PLATFORM_ENERGY



METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Temperature [C] TMP0
Energy [J] PWR0
Power [W] PWR0/time
Energy PP0 [J] PWR1
Power PP0 [W] PWR1/time
Energy DRAM [J] PWR3
Power DRAM [W] PWR3/time
Energy PLATFORM [J] PWR4
Power PLATFORM [W] PWR4/time

LONG
Formulas:
Power = PWR_PKG_ENERGY / time
Power PP0 = PWR_PP0_ENERGY / time
Power DRAM = PWR_DRAM_ENERGY / time
Power PLATFORM = PWR_PLATFORM_ENERGY / time
-
Icelake implements the RAPL interface. This interface enables to
monitor the consumed energy on the package (socket), DRAM and
platform level.

26 changes: 26 additions & 0 deletions groups/SPR/FLOPS_AVX.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
SHORT Packed AVX MFLOP/s

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE
PMC1 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
PMC2 FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
Packed SP [MFLOP/s] 1.0E-06*(PMC0*8.0+PMC2*16.0)/time
Packed DP [MFLOP/s] 1.0E-06*(PMC1*4.0+PMC3*8.0)/time

LONG
Formulas:
Packed SP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE*8+FP_ARITH_INST_RETIRED_512B_PACKED_SINGLE*16)/runtime
Packed DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
-
Packed 32b AVX FLOPs rates.
35 changes: 35 additions & 0 deletions groups/SPR/FLOPS_DP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
SHORT Double Precision MFLOP/s

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
PMC1 FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
PMC2 FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
PMC3 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
DP [MFLOP/s] 1.0E-06*(PMC0*2.0+PMC1+PMC2*4.0+PMC3*8.0)/time
AVX DP [MFLOP/s] 1.0E-06*(PMC2*4.0+PMC3*8.0)/time
AVX512 DP [MFLOP/s] 1.0E-06*(PMC3*8.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC0+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC1/time
Vectorization ratio 100*(PMC0+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)

LONG
Formulas:
DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE*2+FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
AVX DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE*4+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
AVX512 DP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE*8)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED_SCALAR_DOUBLE/runtime
Vectorization ratio = 100*(FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)/(FP_ARITH_INST_RETIRED_SCALAR_DOUBLE+FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE+FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE)
-
SSE scalar and packed double precision FLOP rates.

37 changes: 37 additions & 0 deletions groups/SPR/FLOPS_HP.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
SHORT Half Precision MFLOP/s

EVENTSET
FIXC0 INSTR_RETIRED_ANY
FIXC1 CPU_CLK_UNHALTED_CORE
FIXC2 CPU_CLK_UNHALTED_REF
FIXC3 TOPDOWN_SLOTS
PMC0 FP_ARITH_INST_RETIRED2_SCALAR
PMC1 FP_ARITH_INST_RETIRED2_128B_PACKED_HALF
PMC2 FP_ARITH_INST_RETIRED2_256B_PACKED_HALF
PMC3 FP_ARITH_INST_RETIRED2_512B_PACKED_HALF

METRICS
Runtime (RDTSC) [s] time
Runtime unhalted [s] FIXC1*inverseClock
Clock [MHz] 1.E-06*(FIXC1/FIXC2)/inverseClock
CPI FIXC1/FIXC0
HP [MFLOP/s] 1.0E-06*(PMC0+PMC1*8.0+PMC2*16.0+PMC3*32.0)/time
128B HP [MFLOP/s] 1.0E-06*(PMC1*8.0)/time
256B HP [MFLOP/s] 1.0E-06*(PMC2*16.0)/time
512B HP [MFLOP/s] 1.0E-06*(PMC3*32.0)/time
Packed [MUOPS/s] 1.0E-06*(PMC1+PMC2+PMC3)/time
Scalar [MUOPS/s] 1.0E-06*PMC0/time
Vectorization ratio 100*(PMC1+PMC2+PMC3)/(PMC0+PMC1+PMC2+PMC3)

LONG
Formulas:
HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_SCALAR+FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*16+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*32)/runtime
128B HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF*8)/runtime
256B HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_256B_PACKED_HALF*8)/runtime
512B HP [MFLOP/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_512B_PACKED_HALF*8)/runtime
Packed [MUOPS/s] = 1.0E-06*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/runtime
Scalar [MUOPS/s] = 1.0E-06*FP_ARITH_INST_RETIRED2_SCALAR/runtime
Vectorization ratio [%] = 100*(FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)/(FP_ARITH_INST_RETIRED2_SCALAR+FP_ARITH_INST_RETIRED2_128B_PACKED_HALF+FP_ARITH_INST_RETIRED2_256B_PACKED_HALF+FP_ARITH_INST_RETIRED2_512B_PACKED_HALF)
-
Scalar and packed half precision FLOP rates new in Sapphire Rapids.

Loading

0 comments on commit 2c684a9

Please sign in to comment.