Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduces a generic byte pattern matcher based on Ghidra #1364

Merged
merged 28 commits into from
Nov 17, 2021

Conversation

ivg
Copy link
Member

@ivg ivg commented Nov 16, 2021

This PR implements the byte pattern matcher that could be fed with Ghidra patterns and used for function starts identification, instruction encoding (e.g., thumb vs arm), function names, code boundaries and many more. Even more than that, our implementation allows attaching arbitrary actions to the patterns. The semantics of actions is specified in Primus Lisp, with defmethod.

The PR is independent of Ghidra and the patterns themselves will be packed in the bap-signatures package in opam (with corresponding attributions). Once merged, and assuming that either bap-signatures or libghidra-data or both packages are installed, it will work out of the box and improve function start identification and, most importantly, significantly improve the disassembling of arm/thumb interworked binaries.

The patterns language is quite versatile and supports active patterns, i.e., other than matching with a sequence the pattern must satisfy with some non-trivial conditions. We implement actions in Primus Lisp to enable extensibility and fast prototyping. So far, we have (trivially) implemented only two actions, here is the current implementation, just to give you the flavor of actions,

(defmethod bap:patterns-action (action addr attrs)
  (when (is-in action 'funcstart 'possiblefuncstart)
    (promise-function-start addr)))

(defmethod bap:patterns-action (action addr attrs)
  (when (= action 'setcontext)
    (let ((name (patterns-attribute attrs :name))
          (mode (patterns-attribute attrs :value)))
      (when (and name mode (= name 'TMode))
        (let ((lang (if (= mode '1) :T32 :A32)))
          (arm-set-encoding addr lang))))))

So far it works perfectly even without complex constraints that some of the patterns impose, possible because bap is quite resilient to function start false positives, thanks to byteweight. We might later implement those constraints as necessary. In addition to the vast repository of patterns that are available from Ghidra (again, no need to install Ghidra we will have them in our sigs.tar.gz,
on the release page and installable via opam install bap-signatures), it is possible to define your own patterns and your own actions. The format is described in the bap --patterns-help man page, for convenience copied here,

patterns(3)                 BAP Programmer's Manual                patterns(3)



DESCRIPTION
       Applies semantic actions to the matching byte patterns. The patterns
       are specified in an XML format, described below, and the actions are
       implemented with Primus Lisp methods. Used to identify function starts,
       instruction encodings, function names, etc.

INPUT FORMAT
       The patterns are represented with XML files, each corresponding to a
       specific target. The patternconstraints.xml files are used as the table
       of contents and contain mapping between targets and files that provide
       patterns for that target.

       The file scheme itself is derived from the Ghidra bytesearch patterns,
       so that the patterns provided by Ghidra could be used as is.

PATTERNS SCHEME
       The file with patterns must have a single element  patternlist, which
       contains a list of pattern or patternpairs elements. The pattern
       element is composed of a list of data elements and a list of action
       elements. The data element descibes the ditted pattern, and an action
       is any element (other than data), which is translated to a Primus Lisp
       singal, bap:patterns-action name addr attrs, where name is the element
       name, addr is the address where the pattern matches, and attrs is the
       list of element attributes, accessible via patterns-attribute function.

       The patternspairs element is very similar to the patterns element
       except that the list of patterns is described by a cartesian product of
       two sets of patterns, prepatterns and postpatterns. Both prepatterns
       and postpatterns must contain a non-empty list of data elements and
       nothing more. The resulting patterns are made by concatenating every
       prepattern with every postpattern and leaving only those combinations
       that have the total number of non-masked bits equal to totalbits and
       the total number of non-masked bits in the postpattern part equal to
       postbits. When the resulting pattern matches with a sequence of bytes,
       the address of a byte that matches with the start of the postpattern is
       passed to the bap:patterns-action method.

PATTERNCONSTRAINTS SCHEME
       The patternconstraints.xml file is used as the table of contents and
       contain a mapping between targets (languages in Ghidra parlance) and
       paths to corresponding files, relative to the location of the
       patternconstraints.xml file.

       The file must have a single patternconstraints element that contains a
       list of language elements. The language element is required to have the
       id attribute which must be either the name of a BAP target (see bap
       list targets) or a Ghidra language specification, which is four-tuple
       of elements, separated with :. The first element is the architecture
       name, the second is the endiannes, the third is the bitness, and the
       last is the variant. Any field except the architecture, could use
       default or just * as the wildcard character that matches with anything.
       The endianness is specified as either LE for little or BE for big
       endianness. For instructions in little endian and data in big endian,
       use LEBE.

         The language element contains a list of patternfile or compiler
       elements. The patternfile element contains the path to the patterns
       file, and the compiler element contains the patternfile element with
       patterns specific to a compiler, which is specified in the required id
       attribute of the compiler element.

DITTED PATTERNS
       Each pattern is described as a ditted sequence of bits or nibbles. Each
       bit is represented with 0, 1, and . that match, correspondigly with,
       zero, one, and any bit. And a nibble is a sequence of hexadecimal
       digits, matching with their corresponding four-bit representation, and
       ., which matches with any four bits (i.e., with any binary number in
       the range from 0 to 0xF). The nibble sequence must start with 0x and
       continues until the next whitespace character. If the sequence doesn't
       start with 0x then it is assumed to be a sequence of bits. Sequences
       could be separated by the arbitrary number of whitespace characters.

BUILTIN ACTIONS
       All actions in the bap namespace, which is set as the default namespace
       when parsing the patterns file, are reserved to BAP. It is possible to
       add arbitrary actions, provided that they are not using the bap
       namespace. The following set of actions have predefined meaning.

       functionstart and possiblefuncstart mark the matching sequence as the
       function start. The attributes could be used to impose an extra
       constraint. The current implementation ignore them, but they will be
       implemented later.

       setcontext is used to control the disassembler context and currently
       the following two attributes are recongized, name and value. When the
       name is set to TMode then the matching sequence has the encoding T32 if
       the value is 1 and A32 otherwise.

 2.4.0-alpha+175870a                                               patterns(3)

ivg added 28 commits November 16, 2021 15:02
we need this to properly handle nested loops

also adds more combinators and hides the implementation with an interface
We will eventually release it as a separate library as we will heavily
rely on it in the future for parsing the decompiler output.
it may also include compilers, now the target description parser
should also be updated, as soon as I will figure out what is the forth
element of the triple means in ghidra.
tries do not really work correctly with masked data
Now an action could be any element oither than `<data>`. The actions
are represented as a pair of a name and an attribute mapping. Both
action names and attribute names are represented with KB.Name.t. The
namespaces are supported but we take only the host part of the XML
namespace name if it is represented as a URI.
not ideal, but at least some documentation :)
so that it will now point to the updated signatures and won't
overwrite them in the next week alpha release
ivg added a commit to BinaryAnalysisPlatform/opam-repository that referenced this pull request Nov 16, 2021
@ivg ivg merged commit 8e5f867 into BinaryAnalysisPlatform:master Nov 17, 2021
ivg added a commit to BinaryAnalysisPlatform/opam-repository that referenced this pull request Nov 17, 2021
ivg added a commit to ivg/opam-repository that referenced this pull request Dec 8, 2021
This release brings This release brings Ghidra as the new disassembler
and lifting backend, significantly improves our Thumb
lifter (especially with respect to interworking), adds
forward-chainging rules and context variables to the knowledge base,
support for LLVM 12, a pass that flattens IR, and a new framework for
pattern matching on bytes that leverages the available patterns and
actions from the Ghidra project.

It also contains many bug fixes and improvements, most notable
performance improvements that make bap from 30 to 50 per cent
faster. See below for the full list of changes.

Package-wise, we split bap into three parts: `bap-core`, `bap`, and
`bap-extra`. The `bap-core` metapackage contains the minimal set of
core packages that is necessary to disassemble the binary, the `bap`
package extends this set with various analysis, finally, `bap-extra`
includes rarely used or hard to install packages, such as the symbolic
executor, which is very heavy on installation, and `bap-ghidra`, which
is right now in a very experimental stage and is only installable on
Ubuntu 18.04, since it requires the libghidra-dev package available
from ppa,

```
sudo add-apt-repository ppa:ivg/ghidra -y
sudo apt-get install libghidra-dev -y
sudo apt-get install libghidra-data -y
```

Changelog
=========

Features
--------

- BinaryAnalysisPlatform/bap#1325 adds armeb abi
- BinaryAnalysisPlatform/bap#1326 adds experimental Ghidra disassembler and lifting backend
- BinaryAnalysisPlatform/bap#1332 adds the flatten pass
- BinaryAnalysisPlatform/bap#1341 adds context variables to the knowledge base
- BinaryAnalysisPlatform/bap#1343 adds register aliases to the Core Theory
- BinaryAnalysisPlatform/bap#1358 adds LLVM 12 support
- BinaryAnalysisPlatform/bap#1360 extends the knowledge monad interface
- BinaryAnalysisPlatform/bap#1363 adds forward-chaining rules and Primus Lisp methods
- BinaryAnalysisPlatform/bap#1364 adds a generic byte pattern matcher based on Ghidra
- BinaryAnalysisPlatform/bap#1365 adds support for the Thumb IT blocks
- BinaryAnalysisPlatform/bap#1369 adds some missing `t2LDR.-i12` instructions to the Thumb lifter

Improvements
------------

- BinaryAnalysisPlatform/bap#1336 improves the `main` function discovery heuristics
- BinaryAnalysisPlatform/bap#1337 adds more Primus Lisp stubs and fixes some existing
- BinaryAnalysisPlatform/bap#1342 uses context variables to store the current theory
- BinaryAnalysisPlatform/bap#1344 uses the context variables to store the Primus Lisp state
- BinaryAnalysisPlatform/bap#1355 tweaks symbolization and function start identification facilities
- BinaryAnalysisPlatform/bap#1353 improves arm-family support
- BinaryAnalysisPlatform/bap#1356 stops proposing aliases as potential subroutine names
- BinaryAnalysisPlatform/bap#1361 rewrites knowledge and primus monads
- BinaryAnalysisPlatform/bap#1370 tweaks Primus Lisp' method resolution to keep super methods
- BinaryAnalysisPlatform/bap#1375 error handling and performance tweaks
- BinaryAnalysisPlatform/bap#1378 improves reification of calls in the IR theory (part I)
- BinaryAnalysisPlatform/bap#1379 improves semantics of some ITT instructions
- BinaryAnalysisPlatform/bap#1380 Fixes handling of fallthroughs in IR theory

Bug Fixes
---------

- BinaryAnalysisPlatform/bap#1328 fixes C.ABI.Args `popn` and `align_even` operators
- BinaryAnalysisPlatform/bap#1329 fixes frame layout calculation in the Primus loader
- BinaryAnalysisPlatform/bap#1330 fixes the address size computation in the llvm backend
- BinaryAnalysisPlatform/bap#1333 fixes and improves label handling in the IR theor
- BinaryAnalysisPlatform/bap#1338 fixes core:eff theory
- BinaryAnalysisPlatform/bap#1340 fixes the Node.update for graphs with unlabeled nodes
- BinaryAnalysisPlatform/bap#1347 fixes a knowledge base race condition in the run plugin
- BinaryAnalysisPlatform/bap#1348 fixes endianness in the raw loader
- BinaryAnalysisPlatform/bap#1349 short-circuits evaluation of terms in Bap_main.init
- BinaryAnalysisPlatform/bap#1350 fixes variable rewriter and some Primus Lisp symbolic functions
- BinaryAnalysisPlatform/bap#1351 fixes and improves aarch64 lifter
- BinaryAnalysisPlatform/bap#1352 fixes several Primus Lisp stubs
- BinaryAnalysisPlatform/bap#1357 fixes some T32 instructions that are accessing to PC
- BinaryAnalysisPlatform/bap#1359 fixes handling of let-bound variables in flatten pass
- BinaryAnalysisPlatform/bap#1366 fixes a bug in the `cmp` semantics
- BinaryAnalysisPlatform/bap#1374 fixes handling modified immediate constants in ARM T32 encoding
- BinaryAnalysisPlatform/bap#1376 fixes fresh variable generation
- BinaryAnalysisPlatform/bap#1377 fixes the IR theory implementation

Tooling
-------

- BinaryAnalysisPlatform/bap#1319 fixes the shared folder in deb packages
- BinaryAnalysisPlatform/bap#1320 removes sudo from postinst and postrm actions in the deb packages
- BinaryAnalysisPlatform/bap#1321 enables push flag in the publish-docker-image action
- BinaryAnalysisPlatform/bap#1323 fixes the ppx_bap version in the dev-repo opam file
- BinaryAnalysisPlatform/bap#1331 fixes the docker publisher, also enables manual triggering
- BinaryAnalysisPlatform/bap#1327 fixes a typo in the ubuntu dockerfiles
- BinaryAnalysisPlatform/bap#1345 fixes bapdoc
- BinaryAnalysisPlatform/bap#1346 nightly tests are failing due to a bug upstream
@ivg ivg mentioned this pull request Mar 1, 2022
@ivg ivg deleted the adds-pattern-matcher branch March 9, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant