Arm AArch64 instruction set plugin. #1141

zhaoyonghe · 2020-06-19T00:12:56Z

This pr presents a draft of Core Theory/KB based ARM AArch64 instructions' lifter, which is mostly a incomplete skeleton presenting how the final lifter will be. Now it can only lift one instruction LDR Xn addr. No Float number and SIMD related supports for now.

Because the instruction support two set of registers: Xn and Wn, but they actually share the same physical register, so Xn and Wn will be mapped onto same register xn(see a64_env_registers.ml:Line86).

Question:
Which manual I should reference?
https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf This one has the detailed instruction set (a lot of instructions, which parts are needed?), but seems too old?
https://static.docs.arm.com/den0024/a/DEN0024A_v8_architecture_PG.pdf This one seems very offical, but has much more less about the instructions.

XVilka · 2020-06-19T04:19:40Z

@zhaoyonghe here is the latest official ARMv8 instruction set reference:

This one is for procedure call https://developer.arm.com/docs/ihi0055/latest/procedure-call-standard-for-the-arm-64-bit-architecture

Here is the quick reference: https://courses.cs.washington.edu/courses/cse469/18wi/Materials/arm64.pdf

Be aware of the different addressing modes - pre-index, index, post-index.

XVilka · 2020-06-19T09:08:56Z

It will require some tests as well, see e.g.:

ivg · 2020-06-19T13:52:55Z

and unit tests as well,

https://github.com/BinaryAnalysisPlatform/bap/tree/master/lib_test/powerpc

In general, for the development setup you can follow the powerpc lifter. So far it looks ok, as far as I understand you're not building from the BAP source repository and installing bap from opam, is that right?

plugins/arm_a64/a64.ml

zhaoyonghe · 2020-06-19T15:44:33Z

and unit tests as well,

https://github.com/BinaryAnalysisPlatform/bap/tree/master/lib_test/powerpc

In general, for the development setup you can follow the powerpc lifter. So far it looks ok, as far as I understand you're not building from the BAP source repository and installing bap from opam, is that right?

Yes, I just install bap from opam.

plugins/arm_a64/a64_defs.ml

ivg

It looks like we made a bad start as this lifter is too much influenced by the armv7 lifter instead of mips and powerpc lifters, which in turn influenced core theory. That is totally my fault as I didn't communicate that the arm lifter is a total failure and an example of how lifters shall not be written. The lifter shall not look as OCaml it shall not do any OCamlish programming, it should be a specification. In other words, a lifter is not a program but a declaration. The main reason why our x86 and arm lifters are failures is that they try to build abstractions of machine instructions, e.g., they define OCaml functions that are reused to describe the semantics of different instructions. The premise was to enable code reuse, which is good when you develop and design programs but is totally not applicable when your task to encode the knowledge. The whole idea of the lifter is that it shall describe/encode the already existing program, which is microcoded and hardwired in the CPU. That program is already designed and built (sometimes for years) by the developers of the CPU, with limitations and design approaches that are specific to building hardware and which are totally different from techniques used in software engineering.

To be more specific, both arm and x86 lifters were trying to write generic functions for loading and storing, assuming that there exists some abstract load/store operation that is reused by all machine operations that do loading and storing. It resulted in an ugly function with tons of parameters that works equally bad for all instructions. And it is not because the abstraction was chosen wrongly, it is because it didn't exist on the first hand, in the hardware. Therefore we have this impedance mismatch when we try to describe two physically different hardware units with the same abstraction. That is not to say, of course, that hardware designers are not using abstractions, they also do reuse IP cores but they apply those abstractions in a completely different dimension in which a small change of parameters results in a big change of semantics (which is our dimension).

So what we should focus on, is to make the Aarch64 lifter look as close as possible to the pseudo-code in the specification. To enable this we shall embrace the specification centric approach and build a theory of the Aarch64 processor, which will encode operations from the reference manual in terms of the Core Theory. Taking LDR as a concrete example, we should follow closely (if not literal) the provided pseudocode. Most of the operations that it uses are already provided by the Core Theory. You might find yourself building some operations on top of them which are more abstract/sugary. The pseudocode in the reference documentation even advise which common operations you will reuse between different loads and stores. Those would be decoders so you will need a library of decoders. Basically, instead of trying to build the one function that rules them all (as they were doing in the old arm lifter), we will build a pack of small and easy to use functions and express the semantics of each instruction independently of any other instructions but using those primitives.

ivg · 2020-06-24T12:21:51Z

plugins/arm_a64/a64.ml

+module Env  = A64_env.Env
+
+(* maybe a wrong definition *)
+type insn_t = simple_insn_t * (operand_t list)


please do not use the _t suffix for types, this convention is totally not necessary in OCaml where the names of the types resides in its own namespace and syntactically the type could never be confused with any other syntactic entity.

ivg · 2020-06-24T12:25:38Z

plugins/arm_a64/a64.ml

+    let open Mem in
+    match insn, ops with
+    | `LDRi_pre_64, [rt; rn_sp; imm] ->
+      Mem.lift_LDRi_wrapper ~rt:rt ~rn_sp:rn_sp ~imm:imm ~wback:true ~post_index:false ~is_64_reg:true ~to_load_bv:X


Please, stick to 70 characters per line. It's ok to sometimes exceed 70 up to 80 max, consider 70..80 as the yellow zone.

The motivation is simple (and it is not about the editor/display size), it is that it is very hard to understand such long expressions. Consider 70 characters as the complexity threshold anything longer shall be abstracted and refactored into smaller pieces.

Also, in OCaml, ~rt:rt could be written as just ~rt.

ivg · 2020-06-24T12:46:40Z

plugins/arm_a64/a64_env.ml

+  let byte_sort: byte_t Theory.Bitv.t Theory.Value.sort = Theory.Bitv.define 8
+  let half_word_sort: half_word_t Theory.Bitv.t Theory.Value.sort = Theory.Bitv.define 16
+  let word_sort: word_t Theory.Bitv.t Theory.Value.sort = Theory.Bitv.define 32
+  let double_word_sort: bv64_t Theory.Bitv.t Theory.Value.sort = Theory.Bitv.define 64


The suffix _sort in byte_sort, half_sort etc, serves as a poor man namespace, to which we have to reside when we code in C, in a language that doesn't have namespaces and modules. In OCaml, whenever you see a pattern of adding certain suffixes or prefixes to a group of definitions you should consider factoring those definitions into a module so that instead of writing byte_sort or half_sort you can write Sort.byte, Sort.half, etc. With that said, I sometimes rely on those poor-man namespaces, but keep it much more concise and in a very limited scope. For example, I would define sort values bytes, halfs, words, dwords, mems, etc. So yes, s here denotes sort (and also matches the intuition behind the sort that it denotes a set of values, so bytes is the set of bytes and words is the set of words). This approach borrows from the idea of defining a mathematical notation for some framework and selecting certain meta-variables or syntactic forms to denote entities of one group. Once this notation is clearly described and followed it makes it much easier to understand the code. This comes with a price that you're inventing a meta-language that a reader should learn before proceeding, so such approach should be reserved only for cases where this notation will be used a lot (and therefore learning the notation will be payed off from using it in many places). Here is an example of notation. And yes, a lifter is perfect example where such notations can make our code more readable and understandable.

Also, it is important to maintain a notation that is not noisy and doesn't clutter expressions with unnecessary suffixes and prefixes. Yes it is more or less an art, but all these _sort or _t are not contributing to readability, to my opinion.

It also looks like that you're trying to distinguish between types and sorts of the same entity, e.g., you're introducing word_t and word_sort. I am not sure why you're trying to separate them, but in OCaml it is perfectly fine to have a function and a type that have the same name. And in our case, word_t is the reflection of Aarch64 word sort into the OCaml type system, so it is like a proxy or representative, therefore it is clearly a good idea to give them the same names.

ivg · 2020-06-24T13:21:37Z

plugins/arm_a64/a64_exceptions.ml

@@ -0,0 +1,2 @@
+exception Unexpected_Situation


We use This_notation for exceptions, modules, and constructors (syntactic entities that shall start shall be capitalized) and this_notation for all other entities.

Also, those two exceptions are not really needed as they do not add anything specific on top of the existing Assert_failure and Failure of string. Also, we use exceptions only of unexpected situations, which are programmer errors, so exception Unexpected_situation is sort of tautology.

In other words there is no need to introduce new exceptions unless you're going to handle them. For programmer errors use failwith and failwithf (the latter takes the formatted input).

Also, when you see an invalid instruction it is not an exception or error it is just an instruction with an undefined semantics. So you can use special if it is an underspecified effect or unknown if it is a value. To summarize, the lifter shall be able to indefinitely process the output of /dev/urandom without any exceptions. Any exception is a bug.

ivg · 2020-06-24T13:33:25Z

plugins/arm_a64/a64_mem.ml

+      | W -> f Env.word_sort
+      | X -> f Env.double_word_sort 
+
+  let lift_LDRi ~rt ~rn_sp ~imm ~wback ~post_index ~is_64_reg ~to_load_bv = 


I think you got too much inspiration from the existing ARM lifter. It is my fault that I didn't communicate it well, but our ARM lifter is a total failure and is an example of how a lifter shall not be written. An ideal lifter should be a specification of the instruction semantics, expressed in an eDSL, which should be close to the DSL that is used by the specification manual. Our powerpc and mips lifters are good examples. And they should be used as the starting point. They are also were written in pre-theory times, so we can't really borrow their eDSL, but what we should do is develop the eDSL specialized to AArch64 that uses core theory primitive operations underneath the hood.

ivg · 2020-06-24T14:45:50Z

plugins/arm_a64/a64_defs.ml

@@ -0,0 +1,213 @@
+open Core_kernel


We don't really need this module at all (or at least its current contents). There is no need to maintain two sets of definitions for registers as variables and as polymorphic constructors. This is a design error inherited from the ARMv7 lifter.

btw, by this module I didn't mean Core_kernel as it may look like, it was a comment to the whole file, i.e., a64_defs.ml

ivg · 2020-06-25T14:30:29Z

plugins/arm_a64/a64_utils.ml

+    let extend = if unsign then unsigned else signed in
+    extend assem_reg_sort x 
+
+  let extend_reg ~r ~extend_type ~is_64_reg ~shift_len = 


For the reference, here's the reference implementation:

/ ExtendReg() // =========== // Perform a register extension and shift bits(N) ExtendReg(integer reg, ExtendType exttype, integer shift) assert shift >= 0 && shift <= 4; bits(N) val = X[reg]; boolean unsigned; integer len; case exttype of when ExtendType_SXTB unsigned = FALSE; len = 8; when ExtendType_SXTH unsigned = FALSE; len = 16; when ExtendType_SXTW unsigned = FALSE; len = 32; when ExtendType_SXTX unsigned = FALSE; len = 64; when ExtendType_UXTB unsigned = TRUE; len = 8; when ExtendType_UXTH unsigned = TRUE; len = 16; when ExtendType_UXTW unsigned = TRUE; len = 32; when ExtendType_UXTX unsigned = TRUE; len = 64; // Note the extended width of the intermediate value and // that sign extension occurs from bit <len+shift-1>, not // from bit <len-1>. This is equivalent to the instruction // [SU]BFIZ Rtmp, Rreg, #shift, #len // It may also be seen as a sign/zero extend followed by a shift: // LSL(Extend(val<len-1:0>, N, unsigned), shift); len = Min(len, N - shift); return Extend(val<len-1:0> : Zeros(shift), N, unsigned);

I like very much that your implementation follows closely the reference so we can match it one-to-one and verify its correctness.

A few ways to improve it:

do not abuse labeled parameter (i.e., ~) they will just clutter the code in the place of use (and we want it to be as close to the specification pseudocode as possible). Especially in this function, they are not needed, as all arguments are of different types so we can't mess with them at the call site. Also, whenever you find that you want to add labels to your arguments, it is a good indicator that the function is not simple anymore and has too complicated interface.

the is_64_reg parameter is superfluous here and may introduce potential anomalies. The reason is that the register, which is passed to the function, already has the sort (and correspondingly size) attached to it. So you should use the information provided by the register.

there is a mismatch between the extend_type from the reference and our representation. Why is LSL there? We need to rectify this asap.

at the epilogue of the function we start to diverge from the reference. You shall use the extend function here, and I fail to see a reason why it could be used here.

// Extend() // ======== bits(N) Extend(bits(M) x, integer N, boolean unsigned) return if unsigned then ZeroExtend(x, N) else SignExtend(x, N); // Extend() // ======== bits(N) Extend(bits(M) x, boolean unsigned) return Extend(x, N, unsigned);

Don't forget that you can always get the sort for any value or register, so you can go, _ bitv => _ sort => int and from _ var => _ sort => int and the other way around, int -> _ sort => _ bitv, etc. We commonly define the following operator (which we should probably add to the Core Theory, given how useful it is),

Basically, here's I would define this function,

let extend_reg reg etype shift = let s = Theory.Var.sort reg in let n = Theory.Bitv.size s in let unsigned,len = match etype with | SXTB -> false,8 | SXTH -> false,16 | SXTW -> false,32 | SXTX -> false,64 | UXTB -> true,8 | UXTH -> true,16 | UXTW -> true,32 | UXTX -> true,64 in let len = Int.min len (n - shift) in let zeros = zero (Theory.Bitv.define shift) in let s' = Theory.Bitv.define (len + shift) in let v = append s' (static_extract (len-1) 0 (var reg)) zeros in extend v s unsigned

I had to define a few helpers, first of all, we have the extract function which is a little bit more generic than we need for our purposes. The Theory's extract takes runtime values for the hi and lo bits of the range of bytes that are extracted but in our case they are static constants defined by the code itself. So we will define a helper called static_extract that will take to ML integers (denoting statically known boundaries) and use the generic extract to extract the range,

let static_extract hi lo x = let s = Theory.Bitv.define (hi-lo+1) in extract s (byte hi) (byte lo) x

The extract operation requires us to provide the sort of the output value, which we can statically determine as hi-lo+1. As you can see, we can define sorts in runtime, as long as you can compute the size of the bitvector. We also need to wrap ML integers into the target language bitvectors. Since we know that they are going to be small (they denote the register sizes) and since the type of extract,

val extract : 'a Bitv.t Value.sort -> 'b bitv -> 'b bitv -> _ bitv -> 'a bitv

doesn't require hi and lo to be of the same sort as the extractee, we will define some helpers (which we will certainly use a lot in other places) to create small bitvectors from integers. We pick 8 bits for small integers and add the following definitions to our theory,

module Byte = Bitvec.M8 let bytes = Theory.Bitv.define 8 let byte x = int bytes (Byte.int x) let static_extract hi lo x = let s = Theory.Bitv.define (hi-lo+1) in extract s (byte hi) (byte lo) x

which corresponds to

Here, module Byte are ML bitvectors with modulus 8 arithmetic. (Do not confuse Bitv, which is the Theory (aka the target language) bitvectors with Bitvec wich is ML (aka OCaml, aka the host language) bitvectors aka words, the latter are concrete representation the former is an abstraction). We also defined bytes as the sort for bytes (you may find a better place for it). And the small helper byte that takes an ML integer and lifts it to the Theory byte value that represents a byte constant with the given bits.

One more comment about this final expression of our extract_reg function

let s' = Theory.Bitv.define (len + shift) in let v = append s' (static_extract (len-1) 0 (var reg)) zeros in extend v s unsigned

which corresponds to

return Extend(val<len-1:0> : Zeros(shift), N, unsigned);

our notation is a little bit heavier:

x<n-1:0> we represent as static_extract (n-1) 0 x,

x:y we represent as append s x y, where s is the result sort, which in our case is len+shift.

You can see that many generic operators in Core Theory are parameterized by sorts which you have to provide explicitly to ensure that they are statically (from the perspective of the target language).

Side question: is your setup OK or you need any help? Like does merlin work for you? Showing type annotation, indicating no errors (when there are none), jumping to the definition (including your own definitions), etc? Indenter working automatically? It is very important that tools are not failing you so that you can focus on the main topic without spending your focus on things that can be done by automation. So if anything is wrong, just tell what and we will guide you how to set it up.

Hi @ivg, thanks for your detailed directions!

For the setup, I use VSCode on Ubuntu and have installed a plugin just called OCaml.

Like does merlin work for you? Showing type annotation, indicating no errors (when there are none), jumping to the definition (including your own definitions), etc?

These things works well. One awkward thing is when I write a function in file A and use it in file B, this function will not be recognized(with read wave underline). And the "red wave underline" will only disappear after bapbuild.

Indenter working automatically?

It can automatically indent, but does not fit the ocp-indent format you need. Now I just have a shell file to run cd ../../ then make indent then cd ./plugins/arm_a64/.

@ivg Another question. We have one function per instruction, but what are the parameters of the function?

Still use instruction LDR (register) as an example. It seems that using a bit vector with 32 bits as an input is the best fit. Because we have pseudo-code like this:

Decode for all variants of this encoding

integer scale = UInt(size); if option<1> == '0' then UNDEFINED; // sub-word index ExtendType extend_type = DecodeRegExtend(option); integer shift = if S == '1' then scale else 0;

size, option and S are all bits in the 32-bit instruction encoding. We can simply extract them out and use them. But the thing is, we are lifting from assembly code, which is essentially characters. So I am confused now.

By the way, how to deal with the UNDEFINED? Raise a failure?

Hi @ivg,

there is a mismatch between the extend_type from the reference and our representation. Why is LSL there? We need to rectify this asap.

In the description of LDR (register), there is a field "Assembler symbols" says:

Is the index extend/shift specifier, defaulting to LSL, and which must be omitted for the LSL option
when is omitted. encoded in the "option" field. It can have the following values:
UXTW when option = 010
LSL when option = 011
SXTW when option = 110
SXTX when option = 111

It is strange, but it does can have LSL, but at here its behavior is same as UXTX when register size is 64 bits and UXTW when 32 bits. So I write code like this LSL.

Hi @ivg,
Also, when I try to implement CheckSPAlignment, I need to define a lot of things:

let bv64s = Theory.Bitv.define 64 let bvs_of_size (size: int) = Theory.Bitv.define size type bv1 type bv2 type bv4 type bv8 let bv_of (s: int) (n: int) = let module M = Bitvec.Make(struct let modulus = Bitvec.modulus s end) in int (bvs_of_size s) (M.int n) let _EL3 = bv_of 2 3 let _EL2 = bv_of 2 2 let _EL1 = bv_of 2 1 let _EL0 = bv_of 2 0 let sctlr_el1 = Theory.Var.define bv64s "SCTLR_EL1" let sctlr_el2 = Theory.Var.define bv64s "SCTLR_EL2" let sctlr_el3 = Theory.Var.define bv64s "SCTLR_EL3" type proc_state = { _N: bv1 Theory.bitv; _Z: bv1 Theory.bitv; _C: bv1 Theory.bitv; _V: bv1 Theory.bitv; _D: bv1 Theory.bitv; _A: bv1 Theory.bitv; _I: bv1 Theory.bitv; _F: bv1 Theory.bitv; _SP: bv1 Theory.bitv; _EL: bv2 Theory.bitv (* todo more *) } let _PSTATE: proc_state = { _N = zero (bvs_of_size 1); (* maybe some other default value *) _Z = zero (bvs_of_size 1); _C = zero (bvs_of_size 1); _V = zero (bvs_of_size 1); _D = zero (bvs_of_size 1); _A = zero (bvs_of_size 1); _I = zero (bvs_of_size 1); _F = zero (bvs_of_size 1); _SP = zero (bvs_of_size 1); _EL = zero (bvs_of_size 2) } let sctlr_el1 = Theory.Var.define bv64s "SCTLR_EL1" let sctlr_el2 = Theory.Var.define bv64s "SCTLR_EL2" let sctlr_el3 = Theory.Var.define bv64s "SCTLR_EL3" let sctlr regime = match regime with | _EL1 -> sctlr_el1 | _EL2 -> sctlr_el2 | _EL3 -> sctlr_el3 | _ -> raise (Failure "unreachable") let have_el = (* TODO *) let s1_translation_regime el = if not is_zero (logxor el _EL0) then el else if (* TODO *) let check_SP_alignment = if is_zero (logxor _PSTATE._EL _EL0) then (* TODO *)

I try to model the ProcState just like the pseudo-code, and I found it is awkward to compare two bit vectors. Now I use is_zero (logxor a b) as ==. Is there any elegant way? I think it should have but I cannot find it :(

I found the way to define bit vector types like _N: bv1 Theory.bitv; is awkward. What do think of my defines? These are important because these things and the code style will exist everywhere in the lifter.

Am I on the right way? Because I found this function seems "endless-nested" and a lot of small functions to be implemented. (I know ARM is complex, but just want to make sure.)

size, option and S are all bits in the 32-bit instruction encoding. We can simply extract them out and use them. But the thing is, we are lifting from assembly code, which is essentially characters. So I am confused now.

This part is handled by the LLVM-MC library so we are already handled with decoded parts of the instructions, see the output of llvm-mc or bap-mc.

By the way, how to deal with the UNDEFINED? Raise a failure?

Certainly not! Any valid sequence of bits is an expected input to the lifter/disassembler. And undefined expression has unk semantics. To represent an undefined side-effect, you can use perform

This part is handled by the LLVM-MC library so we are already handled with decoded parts of the instructions, see the output of llvm-mc or bap-mc.

I know this. What I mean is that we are lifting from assembly code and the assembly code is characters like LDR <Wt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]. So, we still need a step to transfer these characters to those variant:

integer scale = UInt(size); if option<1> == '0' then UNDEFINED; // sub-word index ExtendType extend_type = DecodeRegExtend(option); integer shift = if S == '1' then scale else 0;

And then pass them to a function like this to implement LDR (register):

let ldr_register scale option extend_type shift regsize datasize = bits(64) offset = ExtendReg(m, extend_type, shift); if HaveMTEExt() then SetTagCheckedInstruction(TRUE); bits(64) address; bits(datasize) data; if n == 31 then CheckSPAlignment(); address = SP[]; else address = X[n]; address = address + offset; data = Mem[address, datasize DIV 8, AccType_NORMAL]; X[t] = ZeroExtend(data, regsize);

I just want to say this step is not close to the specification and not that obvious to be implemented(although, yes, we can interpret those variants from the characters' instruction). I want to check if this step is needed.

I try to model the ProcState just like the pseudo-code, and I found it is awkward to compare two bit vectors. Now I use is_zero (logxor a b) as ==. Is there any elegant way? I think it should have but I cannot find it :(

We have the full set of ordering operations, including equality and comparison.

I found the way to define bit vector types like _N: bv1 Theory.bitv; is awkward. What do think of my defines? These are important because these things and the code style will exist everywhere in the lifter.

Leading underscores are having special meaning for OCaml compiler, it means that the value is not used, so it won't trigger the warning if it is really unused. So please do not start any of your identifiers with an undescore, unless you really not going to use them.

Also, don't pack CPU state into an OCaml structure, you're confusing here the meta language (ML) and the target language (that we model). The CPU state is a record in the target language so we shall represent it is a record in the target language. We don't have any special provisions for structures or bit-fields in Core Theory but we commonly model CPU state using a single word and access/modify its bits using extract and bit-wise logic operations.

let bv_of (s: int) (n: int) =
let module M = Bitvec.Make(struct let modulus = Bitvec.modulus s end) in
int (bvs_of_size s) (M.int n)

In case when you don't know the modulus beforehand (which is a rare case) you could use the generic interface that works for any modulus, e.g.,

let small ~width x = Bitvec.(int x mod modulus width)

Though it is rare that you have unknown and fully parameterized witdh, in most cases, e.g., you can use the existing modulus, e.g., to create a one-bit bitvector of an integer you can use Bitvec.(int x mod m1). Also, there are Bitvec.one and Bitvec.zero for such well-known constants.

Also, when I try to implement CheckSPAlignment, I need to define a lot of things:

Please don't implement it, we don't need that level of detail. You can either totally ignore this function or model it as,

let aligment_check = Theory.Effect.Sort.ctrl "Aarch64.SPAlignmentCheck"

and then just

perform alignment_check

This will preserve enough information for us so that we can implement this check in runtime, but won't clutter the semantics with service details.

This part is handled by the LLVM-MC library so we are already handled with decoded parts of the instructions, see the output of llvm-mc or bap-mc.

I know this. What I mean is that we are lifting from assembly code and the assembly code is characters like LDR <Wt>, [<Xn|SP>, (<Wm>|<Xm>){, <extend> {<amount>}}]. So, we still need a step to transfer these characters to those variant:

integer scale = UInt(size); if option<1> == '0' then UNDEFINED; // sub-word index ExtendType extend_type = DecodeRegExtend(option); integer shift = if S == '1' then scale else 0;

And then pass them to a function like this to implement LDR (register):

let ldr_register scale option extend_type shift regsize datasize = bits(64) offset = ExtendReg(m, extend_type, shift); if HaveMTEExt() then SetTagCheckedInstruction(TRUE); bits(64) address; bits(datasize) data; if n == 31 then CheckSPAlignment(); address = SP[]; else address = X[n]; address = address + offset; data = Mem[address, datasize DIV 8, AccType_NORMAL]; X[t] = ZeroExtend(data, regsize);

I just want to say this step is not close to the specification and not that obvious to be implemented(although, yes, we can interpret those variants from the characters' instruction). I want to check if this step is needed.

Just to prevent any confusion, let's say it clear that we are not lifting from the assembly (which is indeed a string of characters) but from the LLVM Machine Code (MC), represented as Disasm_expert.Basic.Insn, which is an algebraic data type and is an n-tuple of opcode and a number of operands that are a sum type of registers, bitvectors, and floating-point constants. So no characters are involved here. Subfields of an instruction encoding, when are relevant are usually provided as immediate operands and we commonly translate them to sum types that more or less correspond to the types in the original specification. We can translate directly in the semantic definition function if such a translation is needed only once (in that case probably even defining a type is not necessary) or we can write a helper function when the variant is used in more than one instruction.

I think that we should stick to tid -> op array -> insn knowledge as the type for all our semantic definition functions. The lifter will maintain a hashtable from strings, that represent opcodes, to semantic functions of type tid -> op array -> insn knowledge. There will be a driver function, that will extract from the knowledge base the MC instruction and select a proper semantic function based on opcode. It will also handle exceptions (e.g., from array indexing operations) and map them to undefined semantics (or even mark them as data in the knowledge base).

As a side note, it is possible that LLVM decoder misses some of the fields or decodes them incorrectly, in that case, we can always retrieve the memory representation and access any bits of it.

Phosphorus15 · 2020-09-28T11:51:14Z

The works on this PR should now be transferred to this repo https://github.com/Phosphorus15/arm-isa, which is an implementation & interpretation to ARM ISA's standard machine-readable instructions' semantics, and finally gets the semantics translated to BAP's Core Theory.

Currently this repo is still WIP, with a full set of functions on parse & generation done, some delicate issues on collaborating with both ARM ISA and BAP+LLVM sides still have to be addressed.

ivg · 2021-04-02T20:13:07Z

closing in favor of #1291

Thanks a lot for this contribution I was relying on it, when I was writing the Primus Lisp-based lifter.

zhaoyonghe added 2 commits June 18, 2020 00:25

simple defs

ea99abf

can lift only LDR now

5bda9b6

ivg reviewed Jun 19, 2020

View reviewed changes

plugins/arm_a64/a64.ml Show resolved Hide resolved

zhaoyonghe and others added 4 commits June 23, 2020 00:55

real LDR

7f51b61

Full LDR(immediate)

1b11a97

Merge branch 'master' into a64

52c07a9

LDR,LDRB,LDRH

1d84848

zhaoyonghe commented Jun 24, 2020

View reviewed changes

plugins/arm_a64/a64_defs.ml Show resolved Hide resolved

run indent

7d5cc22

ivg requested changes Jun 24, 2020

View reviewed changes

ivg mentioned this pull request Jun 24, 2020

adds symbolizer based on radare2 #1112

Merged

ivg reviewed Jun 25, 2020

View reviewed changes

This was referenced Jun 25, 2020

enables ARM Thumb support #1122

Closed

ARMv8 (64bit) architecture support #1087

Closed

ivg closed this Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm AArch64 instruction set plugin. #1141

Arm AArch64 instruction set plugin. #1141

zhaoyonghe commented Jun 19, 2020 •

edited

Loading

XVilka commented Jun 19, 2020 •

edited

Loading

XVilka commented Jun 19, 2020

ivg commented Jun 19, 2020

zhaoyonghe commented Jun 19, 2020

ivg left a comment •

edited

Loading

ivg Jun 24, 2020

ivg Jun 24, 2020

ivg Jun 24, 2020

ivg Jun 24, 2020

ivg Jun 24, 2020 •

edited

Loading

ivg Jun 24, 2020

ivg Jun 25, 2020

ivg Jun 25, 2020

zhaoyonghe Jun 26, 2020

zhaoyonghe Jun 26, 2020 •

edited

Loading

zhaoyonghe Jun 29, 2020 •

edited

Loading

zhaoyonghe Jun 29, 2020 •

edited

Loading

ivg Jun 29, 2020

zhaoyonghe Jun 29, 2020

ivg Jun 29, 2020

ivg Jun 30, 2020

Phosphorus15 commented Sep 28, 2020 •

edited

Loading

ivg commented Apr 2, 2021

Arm AArch64 instruction set plugin. #1141

Arm AArch64 instruction set plugin. #1141

Conversation

zhaoyonghe commented Jun 19, 2020 • edited Loading

XVilka commented Jun 19, 2020 • edited Loading

XVilka commented Jun 19, 2020

ivg commented Jun 19, 2020

zhaoyonghe commented Jun 19, 2020

ivg left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivg Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhaoyonghe Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

zhaoyonghe Jun 29, 2020 • edited Loading

Choose a reason for hiding this comment

zhaoyonghe Jun 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Phosphorus15 commented Sep 28, 2020 • edited Loading

ivg commented Apr 2, 2021

zhaoyonghe commented Jun 19, 2020 •

edited

Loading

XVilka commented Jun 19, 2020 •

edited

Loading

ivg left a comment •

edited

Loading

ivg Jun 24, 2020 •

edited

Loading

zhaoyonghe Jun 26, 2020 •

edited

Loading

zhaoyonghe Jun 29, 2020 •

edited

Loading

zhaoyonghe Jun 29, 2020 •

edited

Loading

Phosphorus15 commented Sep 28, 2020 •

edited

Loading