Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial implementation of the Sail-generated RISCV disassembler module #2498

Draft
wants to merge 4 commits into
base: next
Choose a base branch
from

Conversation

moste00
Copy link

@moste00 moste00 commented Oct 4, 2024

Your checklist for this pull request

  • I've documented or updated the documentation of every API function and struct this PR changes.
  • I've added tests that prove my fix is effective or that my feature works (if possible)

Detailed description

This PR aims to replace the LLVM-derieved RISCV module with a Sail-derieved RISCV module. The generator tool is being developed here, and for the Sail model of RISCV is here.

Sail is an architecture description language being developed here, it's an imperative language inspired in syntax and semantics by OCaml, with some syntax sugar and innovative features designed specifically for describing computer architectures. See here for a detailed tour and explanation of major features.

The RISCV foundation has adopted the Sail model of RISCV as the "official" definition of the architecture, and therefore it's desirable to generate a C implementation of the any RISCV-related logic from the sail-riscv model, as it will be up-to-date and compliant by construction.

Test plan

The current state of the module doesn't compile, this will be updated as work continues on the module. The initial goal of the work is to be able to invoke cstool and obtain useful results (e.g. the instruction in string form, as a start). Hopefully this goal is not too far.

Closing issues

...

@@ -2,13 +2,8 @@
/* RISC-V Backend By Rodrigo Cortes Porto <porto703@gmail.com> &
Shawn Chang <citypw@gmail.com>, HardenedLinux@2018 */

#ifdef CAPSTONE_HAS_RISCV
//#ifdef CAPSTONE_HAS_RISCV
Copy link
Contributor

@wargio wargio Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ifdef is required by capstone to slim down the build when you compile only for certain archs.

Comment on lines 1 to 18
#include <stdint.h>

#include <stddef.h>

#include <string.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add at the beginning a multiline comment saying that this SAIL generated from the riscv repository YYYYYY with commit UUUUU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the copyright. Will be necessary for the SPDX in the future: #2132

uint64_t imm_18_13 = (binary_stream & 0x000000007E000000)>>25 ;
uint64_t imm_19 = (binary_stream & 0x0000000080000000)>>31 ;
tree->ast_node_type = RISCV_JAL ;
tree->ast_node.riscv_jal.imm = (imm_19 << 23) | (imm_7_0 << 15) | (imm_8 << 14) | (imm_18_13 << 8) | (imm_12_9 << 4) | (0x0 << 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm sure the compiler is smart enough to compile out | (0x0 << 0) but this is essentially dead code, maybe add a check in the sail2c code that checks for this pattern and omit the value.

Copy link
Collaborator

@Rot127 Rot127 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also go into a helper function imho. Something like inline void concat_bits(uint8_t *dest, ...);. For which the vargs should be of the form: uint64_t bits, uint64_t offset.

I think the following reads better:

concat_bits((uint8_t*)&tree->ast_node.riscv_jal.imm, imm_19, 23, imm_7_0, 15, imm_8, 14, imm_18_13, 8, imm_12_9, 4);

Also passing bytes as pointer to uint8_t array to support future 192 bit extension.

What are your guys opinions @XVilka @wargio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if concat_bits is needed since is just an OR. i think as is now is ok.

Copy link
Contributor

@wargio wargio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall is a very good progress, i would suggest tho to maybe split arch/RISCV/riscv_ast2str.gen.inc into multiple files since it too big, maybe split it by RV32 and RV64

@XVilka

Copy link
Contributor

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thestr4ng3r take a look too, please, when you have time.

Comment on lines 1 to 18
#include <stdint.h>

#include <stddef.h>

#include <string.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the copyright. Will be necessary for the SPDX in the future: #2132

uint8_t riscv_zicboz /* bits : 5 */;
} ast_node;

} ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary space. Also, I think, enums could be extracted to the top level

tree->ast_node.rtype.op = RISCV_SLT;
return ;
}
if ((binary_stream & 0x000000000000007F == 0x33) && ((binary_stream & 0x0000000000007000)>>12 == 0x3) && ((binary_stream & 0x00000000FE000000)>>25 == 0x00)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split these too long lines in the generator into two.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the prototype/first implementation the decode is fine like this. But before we can merge it into next we need to optimize two things:

Size

      uint64_t rd = (binary_stream & 0x0000000000000F80)>>7 ;
      uint64_t rs1 = (binary_stream & 0x00000000000F8000)>>15 ;
      uint64_t rs2 = (binary_stream & 0x0000000001F00000)>>20 ;
      tree->ast_node_type = RISCV_RTYPE ;
      tree->ast_node.rtype.rs2 = rs2;
      tree->ast_node.rtype.rs1 = rs1;
      tree->ast_node.rtype.rd = rd;

These specific lines are repeated 10 times in the decoder.
I assume there are other decoding patterns happening just as often. In the final version we should not have any duplicated code in here.

Runtime complexity

I greped for ^ if and found 505 if cases in the decode function. This means for an illegal instructions it does at least 505 comparisons (assuming the compiler doesn't optimize something out). Which is something more than ~O(n * 10) (n = number of bits).
But we should reach in worst case O(n * 1) and O(log(n)) on average before we merge it to next.

The current structure is fine. Also because you have the RzIL task as well. So no worries.

What is important though, is that the decoded details (operand details) are stable. No matter how the architecture of this decoder is. Because on once you finished RzIL we would not want to refactor the whole RzIL work, just because we optimized the Capstone decoder :)

That said, good job! Looks like a lot of work! Well done!

} else if ((first_byte >> 6) & 0x1 == 0x0) {
insn->size = 8;
} else {
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fprintf a warning here that instructions >64bit are not supported yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do it below where you check the result of this function. Either way, just please inform the user about it.

@@ -0,0 +1,3 @@
#include "capstone.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "capstone.h"
#include <capstone/capstone.h>

@@ -0,0 +1,5 @@
#include "../../include/capstone/capstone.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "../../include/capstone/capstone.h"
#include <capstone/capstone.h>

RISCV_AMOMAXU
} op;

uint8_t aq /* bits : 1 */;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These /* bits : 1 */ comments, what do they mean?

  1. aq encodes bit 1 of instruction.
  2. aq is one bit wide.

Please make this more clear. E.g. for the first meaning you could replace it with insn_bits[1:1]. And for the second meaning: bit_width : 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess it means the last, though being more descriptive doesn't hurt. This one has a low priority though.


}
if (op != 0xFFFFFFFFFFFFFFFF) {
uint64_t rd = (binary_stream & 0x0000000000000F80)>>7 ;
Copy link
Collaborator

@Rot127 Rot127 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement a helper function like inline uint64_t get_bit_field(uint8_t *bytes, size_t start, size_t n); for those. It can be in utils.h

Mind that the bytes are passed as a pointer! So we can go beyond 64bit in the future. The array size of bytes can be assumed by the requested bits. If start > 64 we know the array should be at >8bytes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helper functions are important because they allow bit extraction of >64bit instructions. The currently generated code has a maximum of 64bit supported.

uint64_t imm_18_13 = (binary_stream & 0x000000007E000000)>>25 ;
uint64_t imm_19 = (binary_stream & 0x0000000080000000)>>31 ;
tree->ast_node_type = RISCV_JAL ;
tree->ast_node.riscv_jal.imm = (imm_19 << 23) | (imm_7_0 << 15) | (imm_8 << 14) | (imm_18_13 << 8) | (imm_12_9 << 4) | (0x0 << 0);
Copy link
Collaborator

@Rot127 Rot127 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also go into a helper function imho. Something like inline void concat_bits(uint8_t *dest, ...);. For which the vargs should be of the form: uint64_t bits, uint64_t offset.

I think the following reads better:

concat_bits((uint8_t*)&tree->ast_node.riscv_jal.imm, imm_19, 23, imm_7_0, 15, imm_8, 14, imm_18_13, 8, imm_12_9, 4);

Also passing bytes as pointer to uint8_t array to support future 192 bit extension.

What are your guys opinions @XVilka @wargio?

tree->ast_node.riscv_jal.rd = rd;
return ;
}
if ((binary_stream & 0x000000000000007F == 0x67) && ((binary_stream & 0x0000000000007000)>>12 == 0x0)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary_stream & 0x000000000000007F == 0x67 also in a helper please. E.g.
inline bool test_bits64(uint64_t bytes, size_t start, size_t n, uint64_t expected)

@XVilka
Copy link
Contributor

XVilka commented Oct 23, 2024

@moste00 please update the PR with your latest state of the generated code


RISCV_INS_ENDING,
} riscv_insn;
#include "riscv_insn.gen.inc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not ok, but we can write a script to update this.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much like what I see! Awesome job!
Especially the changes to riscv_decode.gen.inc :)

Please focus on just making it work. You can ignore my comments for now. They are just there so we don't forget about it.

Because I could only take a shallow look, I'll check again in the next days.

@@ -1,3 +1,3 @@
#include "capstone.h"
#include "../include/capstone/capstone.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "../include/capstone/capstone.h"
#include <capstone/capstone.h>


} else if (insn->size == 4) {
struct ast instruction;
decode(&instruction, code[3] << 24 | code[2] << 16 | code[1] << 8 | code[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these byte extractions we have readBytes32 and others in utils.c. Please use those.

Comment on lines 37 to 40
while (*curr != ' ') {
mnemonic_len++;
curr++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use strchr for this. You can get the length by subtracting the pointers.

Comment on lines 42 to 45
while (*curr) {
operand_len++;
curr++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines 49 to 28
} else {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {

@@ -2,26 +2,38 @@
/* RISC-V Backend By Rodrigo Cortes Porto <porto703@gmail.com> &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set your copyright. Please use SPDX header style:

# Copyright © 2022 Rot127 <unisono@quyllur.org>
# SPDX-License-Identifier: BSD-3

DEF_HEX_BITS(31)
DEF_HEX_BITS(32)

void hex_bits_signed(uint64_t bitvec, char **s, size_t *len, uint8_t bvlen_bits,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either make these functions inline, or do not put into header file

@@ -0,0 +1,984 @@
/*=======================================================================*/
/*This code was generated by the tool riscv_disasm_from_sail (see
* https://github.com/moste00/riscv_disasm_from_sail)*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should rename the project, maybe since it's capstone auto-sync, something like capstone-autosync-sail?
  2. Once the repository transferred to rizinorg, also update the URL

cc @wargio

/*SPDX-License-Identifier: BSD-3-Clause*/
/*=======================================================================*/

#ifndef __Riscv_insn_gen_inc__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use capital letters for include guards

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also above in the other files please.

#ifndef __Riscv_insn_gen_inc__
#define __Riscv_insn_gen_inc__
#include <stdint.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary empty line between includes

#include <string.h>

enum riscv_insn {
//--------------------- RISCV_REV8---------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe think about converting these into Doxygen, if it's ever possible?


void decode_compressed(struct ast *tree, uint64_t binary_stream) {
// ---------------------------C_NOP-------------------------------
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary additional {} braces

Copy link
Author

@moste00 moste00 Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh those are necessary, because many of the If conditions and the associated context use variables with the same identifier. This wasn't a problem in the original Sail because every rule match forms a different scope, but it's a problem now in C.

The single easiest solution is adding those braces to start a new scope for each rule. I guess I could also solve the problem by making a new function decode_<instruction_case> for every rule to achieve the same effect, but that would complicate the generator a bit.

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Doxygen comments for these helper functions. Also, maybe move to the capstone utils instead? cc @Rot127

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move it to `utils.c

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can just use sprintf(). We depend on libc anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore it. See my comment in the final review message.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For all the string operations: It is better to use SStream everywhere. And don't do direct operations on char *. This is what it was implement for anyways. Is better tested and is convenient to use.

Edit: Sorry, pressed the "review" button by accident. Will add some more comments.

CMakeLists.txt Outdated
arch/RISCV/RISCVDisassembler.h
arch/RISCV/RISCVInstPrinter.h
arch/RISCV/RISCVMapping.h
arch/RISCV/riscv_helpers_ast2str.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename them before merging to CamelCase

@@ -0,0 +1,36 @@
#include "RISCVDetails.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your copyright please in SPDX form:

// Copyright © 2024 Rot127 <unisono@quyllur.org>
// SPDX-License-Identifier: BSD-3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add include guard and your copyright please.

@@ -4,29 +4,28 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your copyright.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright please

* version 0b9c639f19da48734cbf14b61f6ad200b7c70de9.*/
/*DO NOT MODIFY THIS CODE MANUALLY.*/
/* */
/*SPDX-License-Identifier: BSD-3-Clause*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your copyright.

Comment on lines 65 to 66
if (((binary_stream & 0x000000000000007F) == 0x67) &&
(((binary_stream & 0x0000000000007000) >> 12) == 0x0)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These if cases are ok for the prototype. Especially, since they more comprehensible. But for the final version we have to reduce runtime complexity again to something like O(log n).

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please move it to `utils.c

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can just use sprintf(). We depend on libc anyways.

*ps = " , "; \
*plen = 3

static inline void hex_bits(uint64_t bitvec, char **s, size_t *len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore it. See my comment in the final review message.

Copy link
Collaborator

@Rot127 Rot127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! This is a really nice basis to develop further.
I just pushed a new experimental branch (based on newest next). Please rebase your PR on top of it.


typedef struct riscv_conf {
Void2Bool sys_enable_fdext;
Void2Bool sys_enable_zfinx;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen please. Also for the Void2Bool callback.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright.

/*SPDX-License-Identifier: BSD-3-Clause*/
/*=======================================================================*/

#ifndef __Riscv_insn_gen_inc__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also above in the other files please.

…s in riscv.h, changed name to reflect other archs conventio
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants