Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM opcode & operand encoding information (based on auto-sync) #2045

Closed
wants to merge 2,260 commits into from

Conversation

AngelDev06
Copy link

I implemented two new structures.
The first one being:
Screenshot 2023-06-16 051447
and the second one being:
Screenshot 2023-06-16 051641
where both are defined in capstone.h and are available for use by all auto-sync archs. An instance of the first struct is provided as part of the detail member as shown here:
Screenshot 2023-06-16 052108
while the instance of the second struct is provided directly as part of the arch's operand (detail only) such as cs_arm_op. Currently only the ARM arch fills these structs with info as this is the only arch I have generated the required tables for. However this is entirely based on auto-sync so once I publish my modified copy of llvm-capstone (containing the function used to generate the encoding strings), it should be easy to generate the tables for the rest of the auto-sync archs.

  • The info for the opcode encoding is mapped from the ARMGenCSMappingInsn.inc file (and is also wrapped under CAPSTONE_DIET) in the function ARM_set_instr_map_data (which is invoked right after the instruction is decoded).
  • The info for the operand however, is mapped from the ARMGenCSMappingInsnOp.inc file (like the rest of the operand info) and only after ARM_add_cs_detail is called. The mapping operation takes place on functions such as ARM_set_detail_op_reg and ARM_set_detail_op_imm and in some special cases it's hardcoded inside the rest of the add_cs_detail functions that are implemented (such as add_cs_detail_general). Note that for the reglist operand, since capstone breaks it down to a list of register operands (each one being a seperate cs_arm_op), I proceeded to do the same for the encoding. I added a useful comment in that part.

A test I did myself on a few instructions to verify that it works includes the following output:
Screenshot 2023-06-16 044229
Screenshot 2023-06-16 042025
and yes I verified the results from the official ARM documentation. More info:

Let me know if there are any issues.

arch/ARM/ARMMapping.c Outdated Show resolved Hide resolved
For better understanding purposes
@XVilka
Copy link
Contributor

XVilka commented Jul 27, 2023

Please fix fuzzing bugs as well:

fuzz_disasmnext: /src/capstonenext/arch/ARM/ARMMapping.c:1861: void ARM_set_detail_op_imm(MCInst *, unsigned int, arm_op_type, int64_t): Assertion `!(map_get_op_type(MI, OpNum) & CS_OP_MEM)' failed.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==27==ERROR: AddressSanitizer: ABRT on unknown address 0x00000000001b (pc 0x7fb46332c00b bp 0x7fb4634a1588 sp 0x7ffc74ea3570 T0)
SCARINESS: 10 (signal)
    #0 0x7fb46332c00b in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 1878e6b[47](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:48)5720c7c51969e69ab2d276fae6d1dee)
    #1 0x7fb46330b858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #2 0x7fb46330b728  (/lib/x86_64-linux-gnu/libc.so.6+0x22728) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #3 0x7fb46331cfd5 in __assert_fail (/lib/x86_64-linux-gnu/libc.so.6+0x33fd5) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #4 0x598546 in ARM_set_detail_op_imm /src/capstonenext/arch/ARM/ARMMapping.c:1861:2
    #5 0x596cbc in add_cs_detail_general /src/capstonenext/arch/ARM/ARMMapping.c
    #6 0x593145 in ARM_add_cs_detail /src/capstonenext/arch/ARM/ARMMapping.c:1808:2
    #7 0x7e011b in add_cs_detail /src/capstonenext/arch/ARM/ARMMapping.h:62:2
    #8 0x7da65b in printCImmediate /src/capstonenext/arch/ARM/ARMInstPrinter.c:1145:2
    #9 0x7d713b in printInstruction /src/capstonenext/arch/ARM/ARMGenAsmWriter.inc
    #10 0x7e1985 in printInst /src/capstonenext/arch/ARM/ARMInstPrinter.c
    #11 0x7e1985 in ARM_LLVM_printInstruction /src/capstonenext/arch/ARM/ARMInstPrinter.c:1895:2
    #12 0x586e58 in ARM_printer /src/capstonenext/arch/ARM/ARMMapping.c:437:2
    #13 0x56e8a6 in cs_disasm /src/capstonenext/cs.c:944:4
    #14 0x56c636 in LLVMFuzzerTestOneInput /src/capstonenext/suite/fuzz/fuzz_disasm.c:56:20
    #15 0x43ddc3 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #16 0x43d5aa in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:514:3
    #17 0x43f414 in fuzzer::Fuzzer::ReadAndExecuteSeedCorpora(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile> >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:826:7
    #18 0x43f649 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile> >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:857:3
    #19 0x42ecaf in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:912:6
    #20 0x458302 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #21 0x7fb46330d082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #22 0x41f6ed in _start (build-out/fuzz_disasmnext+0x41f6ed)

DEDUP_TOKEN: raise--abort--
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: ABRT (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) in raise
==27==ABORTING
MS: 0 ; base unit: 0000000000000000000000000000000000000000
0x3,0x33,0xed,0x1e,0x7f,
\0033\355\036\177
artifact_prefix='/tmp/tmpet2wpn1m/'; Test unit written to /tmp/tmpet2wpn1m/crash-7529d18d8f1d51daf81fcc7a43063df595571beb
Base64: AzPtHn8=
stat::number_of_executed_units: 4685
stat::average_exec_per_sec:     4685
stat::new_units_added:          0
stat::slowest_unit_time_sec:    0
stat::peak_rss_mb:              104
/github/workspace/build-out/fuzz_disasmnext -max_len=4096 -timeout=25 -rss_limit_mb=2560 -len_control=0 -seed=1337 -artifact_prefix=/tmp/tmpet2wpn1m/ -max_total_time=300 -print_final_stats=1 /github/workspace/cifuzz-corpus/fuzz_disasmnext >fuzz-1.log 2>&1
================== Job 1 exited with exit code 77 ============
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1337
INFO: Loaded 1 modules   (38312 inline 8-bit counters): 38312 [0x1723b20, 0x172d0c8), 
INFO: Loaded 1 PC tables (38312 PCs): 38312 [0x14b1198,0x1546c18), 
INFO:    63359 files found in /github/workspace/cifuzz-corpus/fuzz_disasmnext
INFO: seed corpus: files: 63359 min: 1b max: 4096b total: 8870283b rss: 78Mb
fuzz_disasmnext: /src/capstonenext/arch/ARM/ARMMapping.c:1861: void ARM_set_detail_op_imm(MCInst *, unsigned int, arm_op_type, int64_t): Assertion `!(map_get_op_type(MI, OpNum) & CS_OP_MEM)' failed.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==31==ERROR: AddressSanitizer: ABRT on unknown address 0x00000000001f (pc 0x7f623e53b00b bp 0x7f623e6b0588 sp 0x7ffd10872d50 T0)
SCARINESS: 10 (signal)
    #0 0x7f623e53b00b in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #1 0x7f623e51a858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #2 0x7f623e51a728  (/lib/x86_64-linux-gnu/libc.so.6+0x22728) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #3 0x7f623e52bfd5 in __assert_fail (/lib/x86_64-linux-gnu/libc.so.6+0x33fd5) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #4 0x598546 in ARM_set_detail_op_imm /src/capstonenext/arch/ARM/ARMMapping.c:1861:2
    #5 0x596cbc in add_cs_detail_general /src/capstonenext/arch/ARM/ARMMapping.c
    #6 0x593145 in ARM_add_cs_detail /src/capstonenext/arch/ARM/ARMMapping.c:1808:2
    #7 0x7e011b in add_cs_detail /src/capstonenext/arch/ARM/ARMMapping.h:62:2
    #8 0x7da65b in printCImmediate /src/capstonenext/arch/ARM/ARMInstPrinter.c:1145:2
    #9 0x7d713b in printInstruction /src/capstonenext/arch/ARM/ARMGenAsmWriter.inc
    #10 0x7e1985 in printInst /src/capstonenext/arch/ARM/ARMInstPrinter.c
    #11 0x7e1985 in ARM_LLVM_printInstruction /src/capstonenext/arch/ARM/ARMInstPrinter.c:1895:2
    #12 0x586e58 in ARM_printer /src/capstonenext/arch/ARM/ARMMapping.c:437:2
    #13 0x56e8a6 in cs_disasm /src/capstonenext/cs.c:944:4
    #14 0x56c636 in LLVMFuzzerTestOneInput /src/capstonenext/suite/fuzz/fuzz_disasm.c:56:20
    #15 0x43ddc3 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #16 0x43d5aa in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:514:3
    #17 0x43f414 in fuzzer::Fuzzer::ReadAndExecuteSeedCorpora(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile> >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:826:7
    #18 0x43f6[49](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:50) in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile> >&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:857:3
    #19 0x42ecaf in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:912:6
    #20 0x458302 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #21 0x7f623e[51](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:52)c082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #22 0x41f6ed in _start (build-out/fuzz_disasmnext+0x41f6ed)

DEDUP_TOKEN: raise--abort--
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: ABRT (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee) in raise
==31==ABORTING
MS: 0 ; base unit: 0000000000000000000000000000000000000000
0x3,0x33,0xed,0x1e,0x7f,
\0033\355\036\177
artifact_prefix='/tmp/tmpet2wpn1m/'; Test unit written to /tmp/tmpet2wpn1m/crash-7[52](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:53)9d18d8f1d51daf81fcc7a43063df59[55](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:56)71beb
Base[64](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:65): AzPtHn8=
stat::number_of_executed_units: 4[68](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:69)5
stat::average_exec_per_sec:     4685
stat::new_units_added:          0
stat::slowest_unit_time_sec:    0
stat::peak_rss_mb:              103
2023-07-26 06:32:32,[80](https://github.com/capstone-engine/capstone/actions/runs/5631012546/job/15350252977?pr=2045#step:5:81)1 - root - INFO - Re

@AngelDev06
Copy link
Author

Pretty sure this has to do with the commit onto which I rebased. Shouldn't be an issue on future rebase

@Rot127
Copy link
Collaborator

Rot127 commented Jul 27, 2023

I'll do a second round of review soon.
But do you think we can add a reference to each cs_arm_op which piece(s) of encoding are its source?

It would be nice, because when printing the details we could also print the variable names of each operand.

If every encoding piece has an OpNum assigned to it, the code could simply be moved to ARM_set_detail_op_<reg/imm/mem>(), right?

@AngelDev06
Copy link
Author

Each OpNum doesn't necessarily correspond to a single piece.
That depends on how that specific operand is encoded.
Some operands don't have all of their bits next to each other and instead have some of them at one position and the rest on another. My favourite example is the encoding T1 of mov register which has one bit of the destination register operand (Rd) at position 7 while the other 3 are at the begining (positions 0-2).
So in that case OpNum would correspond to two pieces of the same operand. And yes because each OpNum corresponds to pieces of the same operand I have moved all of the mapping operations (regarding the encoding) to ARM_set_detail_op_<reg/imm/mem>() except for a few exceptions such as the reglist operand.
Because capstone adds an operand entry for each individual register in the reglist I had to make the encoding specific to each of these registers and not the entire reglist. Also yes in each cs_arm_op entry all of the operand pieces provided are part of that specific operand so you know to which operand those pieces belong to.

There is only one problem and that's why I added my second commit.
Since memory operands are complicated and therefore the *.td files don't provide extra info about the encoding of their suboperands, I can only get the encoding of the entire memory operand. That's ok considering that capstone does add an entry for the entire memory operand and sets its type as ARM_OP_MEM, however some user might want to know whether a piece provided corresponds to the base register of the memory operand or to the index one, and so on. To add this information, I also added a field to arm_op_mem named format which holds some hardcoded enum value specifying how the memory operand is formatted.
This tells you exactly which piece should be the base register, which one the displacement or index register etc.

/// The bit positions of each piece that form the full operand in order. If
/// there is only one piece then there is only one index as well. Likewise
/// if there are 4 pieces, there are 4 indexes and so on.
/// Also note that these indices DON'T necessarily match the indices in the ISA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always save bits at the same index as the ISA does. Otherwise there is no reference point for people.

@Rot127
Copy link
Collaborator

Rot127 commented Aug 5, 2023

@AngelDev06 Just rebased #2026. You shouldn't get conflicts anymore.

@Rot127
Copy link
Collaborator

Rot127 commented Aug 5, 2023

Ok, so there are some more changes coming (nothing ARM related). You might as well also just wait with the rebase. I can give you the go then some when later this month, if this is fine for you.

@Rot127
Copy link
Collaborator

Rot127 commented Aug 5, 2023

Is it fine for you, if I cherry-pick your commits into the auto-sync PR? It is easier to work on it this way. Also it would probably be merged faster.
You can still work on it by opening PRs on the auto-sync branch

@AngelDev06
Copy link
Author

AngelDev06 commented Aug 5, 2023

I might be on holidays for about two weeks starting from 15th of this month so I might not be able to rebase. But in the worst case senario, I will be available either at the end of this month or at the start of the next one. Yes you can cherry pick, although I can still rebase onto your changes of the arm64 pr if you want.

Also note that I will try to fix the indexes to be according to the ARM documentation & also open a brand new pr on llvm-capstone that includes my changes.

Up to you if you want me to rebase now or wait for more changes to come as you mentioned.

@Rot127
Copy link
Collaborator

Rot127 commented Aug 5, 2023

My problem is that llvm-capstone is out of sync and generates newer tables (with the encoding). And I am tiered of cherry picking the right commits into other branches. Also I'd like to have your encoding work merged end of the month (if @kabeor finds time to review it).
If you want to change little things they can be done after the merge as well.

So I would cherry pick them then. Especially if you are on vacation (wish you a nice one btw).

also open a brand new pr on llvm-capstone that includes my changes.

If you manage to do this before your vacation it would be great. Or does it change a lot?

Also note that I will try to fix the indexes to be according to the ARM documentation

Sure. If this happens after #2026 it should be fine as well.
Better check before if the indices in LLVM maybe already match the ones in the ISA. Then we just fix the doc string.

@AngelDev06
Copy link
Author

AngelDev06 commented Aug 5, 2023

My problem is that llvm-capstone is out of sync and generates newer tables (with the encoding). And I am tiered of cherry picking the right commits into other branches.

I can do some small patches to my current clone of llvm-capstone to make them sync to what capstone needs right now and push the newer commits so hopefuly you won't have to do any manual work. After all the only thing I generate is ARMGenCSMappingInsn.inc & ARMGenCSMappingInsnOp.inc

Also I'd like to have your encoding work merged end of the month (if @kabeor finds time to review it). If you want to change little things they can be done after the merge as well.

Sure. Is my work going to be included as part of #2026 or merged directly from here?

If you manage to do this before your vacation it would be great. Or does it change a lot?

It doesn't change a lot. I can get it done today. Done!

Sure. If this happens after #2026 it should be fine as well. Better check before if the indices in LLVM maybe already match the ones in the ISA. Then we just fix the doc string.

This can be done now if needed. It's a very small patch which I should have done before my pr on llvm-capstone. Also done

@Rot127
Copy link
Collaborator

Rot127 commented Apr 22, 2024

Sorry for letting this PR stall for so long. I looked into it quite a while, and we discussed this internally as well.
Unfortunately, we won't merge it.

The problem is not, that we don't need it, but that the current implementation doesn't work for other architectures.
This is not really a problem with your design, but first and foremost due to the flawed definitions in LLVM.

As you know, you had to do quite some manipulations of the encodings in ARMMapping.c, so the result is useable.
Additionally, I couldn't manage to have a one to one mapping from the ISA encoding classes to the one we have in LLVM/Capstone
(see: cff87a3. Never succeeded with it).

While these points would be acceptable for a single architecture module, we cannot do it for all the others.
If we add the instruction encodings as a feature, it must fit the paradigm of being easy to update and be comparable to LLVM output and the ISA.
But I couldn't find a way to make this possible for all other modules.

Due to the faulty encoding information in LLVM (and ARM is actually not the worst here), this is not maintainable without extra effort.
We lack maintainers as you probably know, so adding a relatively niche feature, which would consume more time is currently not an option, unfortunately.

However, it would be great if you would keep this feature up to date in your repository, so we can refer to it if other people need it for ARM.
I will move the relevant commits in llvm-capstone to another branch as well.

Also, we plan to add a generator for the SAIL definitions of several architectures. They should have way less flawed encoding information.
So once this is done, I would be delighted if you want to add it again.

Thank you a lot for the effort you put into it! And I am sorry I figured out too late, that it is not transferable to the other architecture modules.

@AngelDev06
Copy link
Author

It's understandable.
I can tell that it was quite painful to try and get all the encodings correct when there were multiple inaccuracies and missing definitions in the .td files making automation a lot harder than it should have been.
Considering that all of that was just for ARM I can imagine how time-consuming it can be to get it done for each architecture.

Sadly I can't easily keep up my fork to date since I barely have any time for my own project, so I will probably just leave it as is for the time being.

I truly hope this project gets more maintainers in the future since it has a lot of potential and I personally, couldn't find a better disassembler for my needs.
Despite all that thanks for your effort in trying to integrate my changes into the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.