Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhauls the target/architecture abstraction (1/n) #1225

Merged
merged 4 commits into from
Sep 25, 2020

Conversation

ivg
Copy link
Member

@ivg ivg commented Sep 24, 2020

Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the core-theory:unit class in #1119. The problem
with that approach was the stringly typed interface as arch was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., Project.empty and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
Project.Input.from_string .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the --target option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.

The new implementation will untouch the document if the other one is
an empty document or is the same. When the documents are different and
non-trivial we will take the larger one and update it with the smaller
one, which is also faster. At the end, it makes merge more than 10
times faster under certain scenarios.
The sexp is using the OGRE syntax not just a deriviation from the
representation.
Introduces Theory.Target.t that superseeds Bap.Std.Arch.t. The old
representation suffered from a few problems that we inherited from
LLVM. The main issue is that Arch.t is not extensible and in order
to add a new architecture the Bap.Std code shall be changed in a
backward-compatibility-breaking manner. Arch.t is als unable to
represent the whole variety of computing devices, which is especially
relevant to micro-controllers (AVR, PIC) and IoT devices on which we
are currently focusing. Finally, Arch.t is not precise enough to
capture information that is necessary for code generation, the new
venue that we are currently exploring.

As the first attempt that didn't really work we introduced arch, sub,
and other properties to the `core-theory:unit` class in BinaryAnalysisPlatform#1119. The problem
with that approach was the stringly typed interface as `arch` was
represented as a simple string. In addition, the proposed properties
werent' able to describe uncommon architectures. Finally, it was very
awkward to use, all fields were optional with no good
defaults.

This is the second attempt and it will be split into several pull
requests. The first PR, this one, introduce the Theory.Target.t but
still keeps Arch.t alive, i.e., it is used by all internal and
external components of BAP. This is to ensure that switching to
Target.t doesn't break any existing code. The consequent pull requests
will gradually deprecated functions that use Arch.t and switch
Target.t everywhere. The most important switch will affect the
disassembler/decoder framework, which is currently still stuck on
Arch.t. Just to be clear, after this work is finished and until BAP
3.0 and maybe even thereafter Arch.t will still work as it used to
work and no code will break or require updates. However, newly added
architectures, such as AVR or PIC, i.e., those that could not be
represented with Arch.t will not be available for the code that still
relies on it.

In addition to Theory.Target.t we add a few more abstractions and
convenience functions, e.g., `Project.empty` and a completely new
interface for Project.Input.t generation, which makes it easier to
create projects from strings or other custom data, e.g.,
`Project.Input.from_string` .

We also add Source, Language, and Compiler abstractions to the
knowledge base Core Theory. These abstractions, together with Target,
describe the full cycle of the program transformation using the
compiler from source code in the given language to the program for the
specified target (and the other way around). The Target abstraction
itself comes with a few more data types that describe various aspects
of the target system, including file formats, ABI, floating-point
ABI (FABI), endianness, which is no longer limited to the binary
choice of little and big endianness, and an extensible data type for
storing target-specific options.

Finally, all targets are formed into hierarchies and families, which
helps in controlling the vast zoo of computer architectures and
devices.

The Target.t is an abstract data type and is self-describing and
includes enough information that describes all the details of the
architecture. We also provide four library modules, for arm, mips,
powerpc, and x86 that exposes the currenlty declared targets.

Our LLVM backend is not yet precise enough to recongize many of the
supported targets and we don't have analyses right now that will infer
the target from the binary, but we will add the `--target` option in
the next PRs (when we will switch to Target.t) everywhere.

As usual, comments, questions, reviews are very welcome.
@ivg ivg merged commit 0095046 into BinaryAnalysisPlatform:master Sep 25, 2020
ivg added a commit to ivg/bap that referenced this pull request Sep 30, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Sep 30, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Oct 1, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Oct 1, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit that referenced this pull request Oct 1, 2020
* enables interworking in the disassembler driver

What is interworking
--------------------

Interworking is a feature of some architectures that enables mixing
several instruction sets in the same compilation unit. Example, arm
and thumb interworking that this branch is trying to add.

What is done
-------------

1. We add the switch primitive to the basic interface that changes the
dissassembler in the current disassembling state. It is a bold move
and can have conseqeuences, should be carefully reviewed

2. Attributes each destination in the disassembler driver state with
the architecture and calls switch every time we are going to
disassemble the next chunk of memory.

3. The default rule that extends the unit architecture to all
instructions in that unit is disabled for ARM/Thumb and is overriden
in the arm plugin with the following behavior, if an arm unit has a file
and that file has a symbol table then we provide information based on
the last bit of that symbol table (todo: we should also check for
abi), otherwise we propagate the unit arch to instructions.

What is to be done
------------------

Next, the arm lifter shall provide a promise to compute
destinations (which itself will require destinations, because we don't
really want to compute them) and provide the destination architecture,
based on the source encoding. We can safely examine any representation
of the instruction since it is already will be lifted by that moment.

* flattens the target interface, publishes the Enum module

also makes Enum more strict by checking that the element is indeed a
member of the set of elements and by preventing double declarations.

* adds an llvm decode for x86

* drops the dependency on arch from the disassembler driver

* overhauls the target/architecture abstraction (2/n)

In the second patch of this series (#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge #1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Feb 22, 2021
This PR is the continuation of the BinaryAnalysisPlatform#1225, BinaryAnalysisPlatform#1226, and BinaryAnalysisPlatform#1227 series of
changes that were focused on substituting the old and inextensible
`Arch.t` abstraction with the new `Theory.Target.t` representation.

This episode is instigated by the upcoming implementation of the
RISCV target. Since RISCV is the out target that is not supported with
Arch.t it became a good test of the new Theory.Target.t abstraction.

As the RISCV worked showed, we still have lots of code that depends on
Arch.t, most importantly Primus, which was fully dependent on
Arch.t. The main issue was that Theory.Target.t doesn't provide any
means to encode register classes, which prevented us from using it
everywhere in Primus, e.g., we need to know which register is the
stack pointer in order to setup the stack.

To implement this, we introduce a new abstraction called _role_. A
_role_ could be generally applied to any entity but so far we are only
talking about the roles of registers in various targets. The target
definiton now acccepts the `regs` paramater that takes the register
file specification with each register assigned one or more roles,
e.g., here is the register file specification for 8086,

```ocaml
Theory.Role.Register.[
 [general; integer], main @< index @< segment;
 [stack_pointer], untyped [reg r16 "SP"];
 [frame_pointer], untyped [reg r16 "BP"];
 [Role.index], untyped index;
 [Role.segment], untyped segment;
 [status], untyped flags;
 [integer], untyped [
   reg bool "CF";
   reg bool "PF";
   reg bool "AF";
   reg bool "ZF";
   reg bool "SF";
   reg bool "OF";
]
```

I.e., we assign a set of roles to a set of registers. We also now have
two new functions `Theory.Target.regs` and `Theory.Target.reg` that
enable querying the register file of the target for register that
fulfill one or more roles. Whilst we publish a limited number of
well-known (blessed) roles in the `Theory.Role.Register` module, more
roles could be added as user need it. For example, in the code snippet
above we have two non-standard roles that are specific to the x86
architectures, `Role.index` and `Role.segment`.

With roles we can drop the dependency on Target in most of the places
where it makes sense (I still left it in x86 and other target-specific
plugins, which obviously are independent on the newly added
architectures).
ivg added a commit that referenced this pull request Feb 22, 2021
This PR is the continuation of the #1225, #1226, and #1227 series of
changes that were focused on substituting the old and inextensible
`Arch.t` abstraction with the new `Theory.Target.t` representation.

This episode is instigated by the upcoming implementation of the
RISCV target. Since RISCV is the out target that is not supported with
Arch.t it became a good test of the new Theory.Target.t abstraction.

As the RISCV worked showed, we still have lots of code that depends on
Arch.t, most importantly Primus, which was fully dependent on
Arch.t. The main issue was that Theory.Target.t doesn't provide any
means to encode register classes, which prevented us from using it
everywhere in Primus, e.g., we need to know which register is the
stack pointer in order to setup the stack.

To implement this, we introduce a new abstraction called _role_. A
_role_ could be generally applied to any entity but so far we are only
talking about the roles of registers in various targets. The target
definiton now acccepts the `regs` paramater that takes the register
file specification with each register assigned one or more roles,
e.g., here is the register file specification for 8086,

```ocaml
Theory.Role.Register.[
 [general; integer], main @< index @< segment;
 [stack_pointer], untyped [reg r16 "SP"];
 [frame_pointer], untyped [reg r16 "BP"];
 [Role.index], untyped index;
 [Role.segment], untyped segment;
 [status], untyped flags;
 [integer], untyped [
   reg bool "CF";
   reg bool "PF";
   reg bool "AF";
   reg bool "ZF";
   reg bool "SF";
   reg bool "OF";
]
```

I.e., we assign a set of roles to a set of registers. We also now have
two new functions `Theory.Target.regs` and `Theory.Target.reg` that
enable querying the register file of the target for register that
fulfill one or more roles. Whilst we publish a limited number of
well-known (blessed) roles in the `Theory.Role.Register` module, more
roles could be added as user need it. For example, in the code snippet
above we have two non-standard roles that are specific to the x86
architectures, `Role.index` and `Role.segment`.

With roles we can drop the dependency on Target in most of the places
where it makes sense (I still left it in x86 and other target-specific
plugins, which obviously are independent on the newly added
architectures).
@ivg ivg deleted the revamps-the-target-theory branch December 1, 2021 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant