RISC-V LR/SC Translation #75

mastercaution · 2022-03-24T16:01:08Z

In contrast to ARM, an LR/SC sequence (code between LR and SC) is very limited on RISC-V platforms. A maximum number of 16 instructions and only a part of the base "I" and "C" instruction set is permitted. Since additional loads and stores are also excluded, instrumenting an instruction inside the sequence will most likely turn it into an "unconstrained LR/SC loop" resulting in the trailing SC to always fail on our test device. The ISA only guaranties for "constrained LR/SC loops" to succeed eventually.

The way unconstrained LR/SC loops are handled is considered a hardware implementation detail. On a SiFive U54, unconstrained LR/SC loops will never succeed, resulting in deadlocks in some cases.

The Approach to fix this issue is to translate the LR/SC sequence to a mixture of a software emulated and hardware atomic sequence. The following figure hopefully gives you an idea of how it works:

The actual implementation stores the value of register x into the dbm_thread structure and only uses one temporary scratch register. The ordering flags aq and rl were not considered in the software emulation part (LR replaced by LD) which may lead to side effects (we did not encounter any side effects).

Benchmarks

In terms of performance, the implementation seems to have no negative effect on real world applications. In all 4 applications, LR/SC sequences were called 40-60 times (per run).

	ref	dbm	dbm + Atomic Translation
Primes (exection time)	1	1.04	1.04
GCC (exection time)	1	1.04	1.04
SHA1 (exection time)	1	12.70	12.76
CoreMark (score)	1	13.52	13.52

Fix unhandled ELF vector types on Linux kernel 5.12+ with glibc 2.34+.

In contrast to ARM, an LR/SC sequence (code between LR and SC) is very limited on RISC-V platforms. A maximum number of 16 instructions and only a part of the base "I" and "C" instruction set is permitted. Since additional loads and stores are also excluded, instrumenting an instruction inside the sequence will most likely turn it into an "unconstrained LR/SC loop" resulting in the trailing SC to always fail on our test device. The ISA only garanties for "constrained LR/SC loops" to succeed eventually. A LR/SC loop may spread over 2 or more basic blocks which makes the translation a little complex. For now, one scrach register is used to save the original value read by LR and translate the loop to a mix of software and hardware atomic sequence. The scrach register is hardcoded to x31 (t6) which could interfere with a function that makes use of x31 and contains this translation, but it seems to work for the most programs (luckily).

Changes the translation of atomic sequences with lr/sc to use a shadow register in memory (in `dbm_thread` struct) instead of a hard-coded CPU register.

mastercaution · 2022-03-25T14:12:44Z

An Issue with the dbm reference benchmark lead to far better scores in SHA1. I corrected the values in both PRs.

mastercaution added 3 commits March 23, 2022 13:17

Ignore ELF cache vectors (glibc 2.34+)

2b7c676

Fix unhandled ELF vector types on Linux kernel 5.12+ with glibc 2.34+.

Fix atomic translation to use shadow register

9ab7e06

Changes the translation of atomic sequences with lr/sc to use a shadow register in memory (in `dbm_thread` struct) instead of a hard-coded CPU register.

mastercaution marked this pull request as ready for review March 25, 2022 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RISC-V LR/SC Translation #75

RISC-V LR/SC Translation #75

mastercaution commented Mar 24, 2022 •

edited

Loading

mastercaution commented Mar 25, 2022

RISC-V LR/SC Translation #75

Are you sure you want to change the base?

RISC-V LR/SC Translation #75

Conversation

mastercaution commented Mar 24, 2022 • edited Loading

Benchmarks

mastercaution commented Mar 25, 2022

mastercaution commented Mar 24, 2022 •

edited

Loading