Skip to content

Latest commit

 

History

History
458 lines (287 loc) · 20.4 KB

Zfinx_spec.adoc

File metadata and controls

458 lines (287 loc) · 20.4 KB

Zfinx Specification

Overview

Zfinx is an extension which changes all existing and future floating point extensions which use the F floating point registers, so that instead they use the X registers. Hence the name F-in-X. This does not affect floating point instructions which are implemented as part of the Vector (v) extension. Zfinx additionally removes all

  1. floating point load instructions (e.g. FLW)

  2. floating point store instructions (e.g. FSW)

  3. integer to/from floating point register move instructions (e.g. FMV.X.W)

In all cases the integer versions of these are required instead.

On a Zfinx core the assembler syntax of floating point instructions changes so that they only refer to X registers. Therefore on an RV32F core this is legal syntax:

FLW     fa4, 12(sp)        //load floating point data
FMADD.S fa1, fa2, fa3, fa4 //floating point arithmetic for RV32F

On a Zfinx core, this syntax must be used as the F registers are not implemented, and FLW is not a supported instruction.

LW      a4, 12(sp)     //load integer data
FMADD.S a1, a2, a3, a4 //floating point arithmetic for RV32F Zfinx

Note that only the assembler syntax differs between the two FMADD.S instructions, the encoding is the same.

The assembler syntax changes to avoid code-porting bugs, so that the registers must be updated and not just reused from non-Zfinx code

Zfinx may be used with any extensions which uses F registers. The number of integer registers does not affect Zfinx (I or E extensions) although the relative sizes of XLEN and FLEN do affect the specification.

This specification uses D for 64-bit floating point, F for 32-bit floating point and Zfh for 16-bit floating point. Zfinx behviour is only affected by the data width so future formats are implicitly supported, e.g. 64 or 32-bit POSIT formats.

Table 1. supported Zfinx configurations
Architecture Comment

RV32IFD Zfinx

XLEN<FLEN

RV32IFD Zfh Zfinx

XLEN<FLEN

RV32F Zfinx

XLEN==FLEN

RV32F Zfh Zfinx

XLEN==FLEN

RV64FD Zfinx

XLEN==FLEN

RV64FD Zfh Zfinx

XLEN==FLEN

RV64F Zfinx

XLEN>FLEN

RV64F Zfh Zfinx

XLEN>FLEN

Note that RV32FD [Zfh] Zfinx requires register pairs so is more complex than the other cases.

RV128 and the Q extension are not covered by this specification, but it is simple to extend this specification to include them.

Semantic Differences

The NaN-boxing behaviour of floating point arithmetic instructions is modified to suppress checking of sources only. Floating point results are always NaN-boxed to XLEN bits.

NaN-boxing checking is removed as integer loads do not NaN-box their result, and so loading fewer than XLEN bits (for example using LW to load floating point data on an RV64 core) would otherwise require NaN-boxing in software which wastes performance and code-size

There are no other semantic differences for floating point instruction behaviour between a Zfinx and a non-Zfinx core, but there are some differences for special cases (such as x0 handling) as listed later in this specification.

Discovery

If Zfinx is specified then the compiler will have the following #define set

__riscv_zfinx

So software can use this to choose between Zfinx or normal versions of floating point code.

Privileged code can detect whether Zfinx is implemented by checking if:

  1. mstatus.FS is hardwired to zero, and

  2. misa.F is 1 at reset, or is writeable

Non-privileged code can detect whether Zfinx is implemented as follows.

li a0, 0 # set a0 to zero

#ifdef __riscv_zfinx

fneg.s a0, a0 # this will invert a0

#else

fneg.s fa0, fa0 # this will invert fa0

#endif

If a0 is non-zero then it’s a Zfinx core, otherwise it’s a non-Zfinx core. Both branches result in the same encoding, but the assembly syntax is different for each variant

mstatus.fs

For Zfinx cores mstatus.fs is hardwired to zero, because all the integer registers already form part of the current context. Note however that fcsr needs to be saved and restored. This gives a performance advantage when saving/restoring contexts.

Floating point instructions and fcsr accesses do not trap if mstatus.fs=0. This is different to non-Zfinx cores.

Register pair handling for XLEN < FLEN

For RV32D, all D-extension instructions which are implemented with Zfinx will access register pairs:

  1. The specified register must be even, odd registers will cause an illegal instruction exception

  2. Even registers will cause an even/odd pair to be accessed

    1. Accessing Xn will cause the {Xn+1, Xn} pair to be accessed. For example if n = 2

      1. X2 is the least significant half (bits [31:0]) for little endian mode

      2. X3 the most significant half (bits [63:32]) for little endian mode

    2. For big endian mode the register mapping is reversed, so X2 is the most significant half, and X3 is the least significant half.

  3. X0 has special handling

    1. Reading {X1, X0} will read all zeros

    2. Writing {X1, X0} will discard the entire result, it will not write to X1

The register pairs are only used by the floating point arithmetic instructions. All integer loads and stores will only access XLEN bits, not FLEN.

Note:

  1. Zp64 from the P-extension specifies consistent register pair handling.

  2. Big endian mode is enabled in M-mode if mstatus.MBE=1, in S-mode if mstatus.SBE=1, or in U-mode if mstatus.UBE=1

x0 register target

If a floating point instruction targets x0 then it will still execute, and will set any required flags in fcsr. It will not write to a target register. This matches the non-Zfinx behaviour for

fcvt.w.s x0, f0

If the floating point source is invalid then it will set the fflags.NV bit, regardless of whether Zfinx is implemented. The target register is not written as it is x0.

If fcsr.RM is in an illegal state then floating point instruction behaviour is the same whether the target register is x0 is not, i.e. targetting x0 doesn’t disable any execution side effects.

In the case of RV32D Zfinx, register pairs are used. See above for x0 handling.

NaN-boxing

For Zfinx the NaN-boxing is limited to XLEN bits, not FLEN bits. Therefore a FADD.S executed on an RV64D core will write a 64-bit value (the MSH will be all 1’s). On an RV32D Zfinx core it will write a 32-bit register, i.e. a single X register only. This means there is semantic difference between these code sequences:

#ifdef __riscv_zfinx

fadd.s x2, x3, x4 # only write x2 (32-bits), x3 is not written

#else

fadd.s f2, f3, f4 # NaN-box 64-bit f2 register to 64-bits

#endif

NaN-box generation is supported by Zfinx implementations. NaN-box checking is not supported by scalar floating point instructions. For example for RV64F:

#ifdef __riscv_zfinx

lw[u] x1, 0(sp)   # load 32-bits into x1 and sign / zero extend upper 32-bits
fadd.s x1, x1, x1 # use x1 but do not check source is Nan-boxed, NaN-box output

#else

flw.s  f1, 0(sp)  # load 32-bits into f1 and NaN-box to 64-bits (set upper 32-bits to 0xFFFFFFFF)
fadd.s f2, f1, f1 # check f1 is NaN-boxed, NaN-box output

#endif

Floating point loads are not supported on Zfinx cores so x1 is not NaN-boxed in the example above, therefore the FADD.S instruction does not check the input for NaN-boxing. The result of FADD.S is NaN-boxed, which means setting the upper half of the output register to all 1’s.

The table shows the effect of writing each possible width of value to the register file for all supported combinations. Note that Verilog syntax is used in the final column.

Table 2. NaN-boxing for supports configurations
XLEN Width of write to Xreg from FP instruction Value written to Xreg

64

16

{48{1’b1}, result[15:0]}

32

16

{16{1’b1}, result[15:0]}

64

32

{32{1’b1}, result[31:0]}

32

32

result[31:0]

64

64

result[63:0]

Little endian

32

64

EvenXreg: result[31:0]

Odd Xreg: result[63:32]

special handling Xreg={0, 1}

Big endian

32

64

Odd Xreg: result[31:0]

EvenXreg: result[63:32]

special handling Xreg={0, 1}

Therefore, for example, if an FADD.S instruction is issued on an RV64F core then the upper 32-bits will be set to one in the target integer register, or an FADD.H (floating point add half-word) instruction will set the upper 48-bits to one.

Assembly Syntax and Code Porting

Any references to F registers, or removed instructions will cause assembler errors.

For example, the encoding for

FMADD.S <1>, <2>, <3>, <4>

will disassemble and execute as

FMADD.S f1, f2, f3, f4

on a non-Zfinx core, or

FMADD.S x1, x2, x3, x4

on a Zfinx core.

We considered allowing pseudo-instructions for the deleted instructions for easier code porting. For example allowing FLW to be a pseudo-instruction for LW, but decided not to. Because the register specifiers must change to integer registers, it makes sense to also remove the use of FLW etc. In this way the user is forced to rewrite their code for a Zfinx core, reducing the chance of undiscovered porting bugs. This only affects assembly code, high level language code is unaffected as the compiler will target the correct architecture.

Replaced Instructions

All floating point loads, stores and floating point to integer moves are removed on a Zfinx core. The following three tables give suggested replacements.

Table 3. replacements for floating point load instructions
Instruction RV32F Zfh Zfinx RV32D Zfh Zfinx RV64F Zfh Zfinx RV64D Zfh Zfinx RV32F Zfinx RV32D Zfinx RV64F Zfinx RV64D Zfinx

loads

suggested replacement instructions

FLD frd, offset(xrs1)

reserved

LW,LW

LD

reserved

LW, LW

LD

FLW frd, offset(xrs1)

LW

LW[U] and NaN-box in software

LW

LW[U] and NaN-box in software

FLH frd, offset(xrs1)

LH[U] and NaN-box in software

reserved

C.FLD frd’, offset(xrs1’)

reserved

[C.]LW,[C.]LW

[C.]LD

reserved

[C.]LW,[C.]LW

[C.]LD

C.FLDSP frd, uimm(x2)

reserved

C.LWSP,C.LWSP

C.LDSP

reserved

C.LWSP,C.LWSP

C.LDSP

C.FLW frd, offset(xrs1)

C.LW

C.LW and NaN-box in software

C.LW

C.LW and NaN-box in software

C.FLWSP frd, uimm(x2)

C.LWSP

C.LWSP and NaN-box in software

C.LWSP

C.LWSP and NaN-box in software

Table 4. replacements for floating point store instructions
Instruction RV32F Zfh Zfinx RV32D Zfh Zfinx RV64F Zfh Zfinx RV64D Zfh Zfinx RV32F Zfinx RV32D Zfinx RV64F Zfinx RV64D Zfinx

stores

suggested replacement instructions

FSD frd, offset(xrs1)

reserved

SW,SW

SD

reserved

SW, SW

SD

FSW frd, offset(xrs1)

SW

FSH frd, offset(xrs1)

SH

reserved

C.FSD frd’, offset(xrs1’)

reserved

[C.]SW,[C.]SW

[C.]SD

reserved

[C.]SW,[C.]SW

[C.]SD

C.FSDSP frd, uimm(x2)

reserved

C.SWSP,C.SWSP

C.SDSP

reserved

C.SWSP,C.SWSP

C.SDSP

C.FSW frd, offset(xrs1)

C.SW

C.FSWSP frd, uimm(x2)

C.SWSP

Table 5. replacements for floating point move instructions
Instruction RV32F Zfh Zfinx RV32D Zfh Zfinx RV64F Zfh Zfinx RV64D Zfh Zfinx RV32F Zfinx RV32D Zfinx RV64F Zfinx RV64D Zfinx

moves

suggested replacement instructions

FMV.X.D xrd, frs1

reserved

MV,MV

reserved

MV

reserved

MV,MV

reserved

MV

FMV.D.X frd, xrs1

reserved

MV,MV

reserved

MV

reserved

MV,MV

reserved

MV

FMV.X.W xrd, frs1

MV

MV and sign extend in software

MV

MV and sign extend in software

FMV.W.X frd, xrs1

MV

MV and NaN-box in software

MV

MV and NaN-box in software

FMV.X.H xrd, frs1

MV and sign extend in software

reserved

FMV.H.X frd, xrs1

MV and NaN-box in software

reserved

Notes:

  1. Where a floating point load loads fewer than XLEN bits then software NaN-boxing in software is required to get the same semantics as a non-Zfinx core

  2. Where a floating point move moves fewer than XLEN bits then either sign extension (if the target is an X register) or NaN-boxing (if the target is an F register) is required in software to get the same semantics

The B-extension is useful for sign extending and NaN-boxing.

To sign-extend using the B-extension:

FMV.X.H rd, rs1

is replaced by

SEXT.H rd, rs1

Without the B-extension two instructions are required: shift left 16 places, then arithmetic shift right 16 places.

NaN boxing in software is more involved, as the upper part of the register must be set to 1. The B-extension is also helpful in this case.

FMV.H.X a0, a1

is replaced by

C.ADDI a2, zero, -1

PACK a0, a1, a2

Emulation

A non-Zfinx core can run a Zfinx binary. M-mode software can do this:

  1. Set mstatus.fs=0 to cause every floating point instruction to trap

  2. When a floating point instruction traps, move the source operands from the X registers to the equivalent F registers (i.e. the same register numbers)

  3. Set mstatus.fs to be non-zero

  4. Execute the original instruction which caused the trap

  5. Move the result from the destination F register to the X register / X register pair (For RV32D)

  6. Set mstatus.fs=0

  7. MRET

There are corner cases around the use of x0 and register pairs for RV32D

  1. Two 32-bit X registers must be transferred to a single 64-bit F register to set up the source operands. This must be done by saving each X register to consecutive memory locations, and using a 64-bit floating point load (FLD or C.FLD) to load the data

  2. One 64-bit F register must be transferred to two 32-bit X registers to receive the result. This must be done with a 64-bit floating point store (FSD or C.FSD) and then two 32-bit loads (such as LW or C.LW).

  3. If the source register pair is {x1,x0}, the source data will read as all zeroes. Therefore f0 must be loaded with a 64-bit zero constant from memory.

  4. If the destination register pair is {x1,x0} then the full output is discarded, do not transfer the resulting data to the {x1,x0} register pair which would result in the upper half being written to x1

A Zfinx core cannot trap on floating point instructions by setting mstatus.fs=0, so the reverse emulation isn’t possible. The code must be recompiled (or ported for assembler).

ABI

For details of the current calling conventions see:

https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md C The ABI when using Zfinx is the standard integer calling convention as listed in the table below.

The Zfinx ABI can be thought of as being similar to using the softfloat routines to execute floating point functionality, but replacing the call to the softfloat function with the actual floating point ISA instruction.

Note that RV32D Zfinx requires register pair handling. This does not require an ABI change as long types are already supported using register pairs. It is likely to require some work in the compiler (according to Jim Wilson).

Floating Point Configurations To Reduce Area

To reduce the area overhead of FPU hardware new configurations will make the F[N]MADD.*, F[N]MSUB.* and FDIV.*, FSQRT.*` instructions optional in hardware. This then gives the choice of implementing them in software instead by:

  1. Taking an illegal instruction trap, and calling the required software routine in the trap handler. This requires that the opcodes are not reallocated and gives binary compatibility between cores with/without hardware support for F[N]MADD.*, F[N]MSUB.* and FDIV.*, FSQRT.*, but is lower performance than option 2

  2. Use the GCC options below so that a software library is used to execute them

This argument already exists for RISCV

gcc -mno-fdiv

This argument exists for other architectures (e.g. MIPs) but not for RISCV, so it needs to be added

gcc -mno-fused-madd

To achieve this we break all current and future floating point extensions into three parts: Zf*base, Zfma and Zfdiv. Zfinx is orthogonal, and so is an additional modifier to these as described below.

Options, all start with Zf Meaning

Zfhbase

Support half precision base instructions

Zffbase

Support single precision base instructions

Zfdbase

Support double precision base instructions

Zfqbase

Support quad precision base instructions

Zfldstmv

Support load,store and integer to/from FP move for all FP extensions

Zfma

Support multiply-add for all FP extensions

Zfdiv

Support div/sqrt for all FP extensions

Zfinx

Share the integer register file for all FP extensions

So the Zfldstmv, Zfma, Zfdiv, Zfinx options apply to all floating point extensions, including future ones. This keeps the support regular across the different options.

Therefore RV32FD Zfh Zfinx can also be expressed as:

rv32_Zfhbase_Zffbase_Zfdbase_Zfma_Zfdiv_Zfinx

Also RV32FD Zfh can be expressed as:

rv32_Zfhbase_Zffbase_Zfdbase_Zfldstmv_Zfma_Zfdiv

The options are designed to be additive, none of them remove instructions.

Rationale, why implement Zfinx?

Small embedded cores which need to implement floating point extensions have some options:

  1. Use software emulation of floating point instructions, so don’t implement a hardware FPU which gives minimum core area

    1. The floating point library can be large, and expensive in terms of ROM or flash storage, costing power and energy consumption

    2. The performance of this solution is very low

  2. Low core area floating point implementations

    1. Share the integer registers for floating point instructions (Zfinx)

      1. Will cause more register spills/fills than having a separate register file, but the effect of this is application dependant

      2. No need for special instructions such as load and stores to access floating point registers, and moves between integer and floating point registers

    2. There are still performance/area tradeoffs to make for the FPU design itself

      1. e.g. pipelined versus iterative

    3. Optionally remove multiply-add instructions to save area in the FPU and a register file read port

    4. Optionally remove divide/square root instructions to to save area in the FPU

  3. Dedicated FPU registers, and higher performance FPU implementations use the most area

    1. Separate floating point registers allow fewer register spills/fills, and can also be used for integer code to prevent spilling to memory

    2. There are the same performance/area tradeoffs for the FPU design

Zfinx is implemented to allow core area reduction as the area of the F register file is significant, for example:

  1. RV32IF Zfinx saves 1/2 the register file state compared to RV32IF

  2. RV32EF Zfinx saves 2/3 the register file state compared to RV32EF

Therefore Zfinx should allow for small embedded cores to support floating point with

  1. Minimal area increase

  2. Similar context switch time as an integer only core

    1. there are no F registers to save/restore

  3. Reduced code size by removing the floating point library