Skip to content

Latest commit

 

History

History
450 lines (356 loc) · 18.2 KB

README.md

File metadata and controls

450 lines (356 loc) · 18.2 KB

This directory contains a dynamic analysis framework built using DynamoRio. All together, these tools and techniques are called YARN.

Please note that this README and the one at tracetools/README.md may be out-of-date.

This is messy research code. Use at your own risk.

Requirements

  • Docker, installed and running.
  • Git LFS extension installed
  • Binary Ninja licence (see "Installing" section for information on where these should be copied). In order to avoid api version mismatch issues, Binary Ninja binaries are included in this repository. This is not 100% required unless you plan on developing/using anything that relies on signatures/moment-of-recognition -- either tracetools/tools/pt_tracker.py or tracetools/tools/poppler_jpeg.py

The DynamoRio-based YARN instrumentation tool does not work with macOS, even when run indirectly via docker.

git clone-ing the source code

Make sure you have the git LFS extension installed before cloning anything. If it isn't installed then none of the *.zip files in ./parsers will be valid zip files.

Installing

If you have one, copy your Binary Ninja license to third-party/binaryninja/license.dat (both headless and regular Binary Ninja packages work).

After you have properly installed docker on your system, build the docker

./build.sh

This will build a docker for the DynamoRIO-based tools by default. The image is named mr_memtrace-analysis-dev.

If you find yourself needing to debug the build.sh file, run it with the --no-cache option to force docker to rebuild the image from scratch.

Running the YARN docker container

Starting the YARN docker container:

docker run -it --rm mr_memtrace-analysis-dev:latest

If you plan on doing any tool development, you can mount your local memtrace directory (repository root) to /processor.

docker run -it --rm -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest

This should be run from the root of the memtrace directory in which you will be working on memtrace or memtrace-tool scripts. If you edit any files that relate to the DynamoRIO instrumentation (e.g., mem-trace.c), you will need to run make from your container's /processor directory.

Note: if the filesystem where your docker containers live has limited storage you may wish to tell docker to store the results/logs generated by memtrace (stored in the container's /results/ directory) elsewhere using the -v (volume) option to specify the host directory, e.g.,

docker run -it --rm -v"/media/largedisk/results:/results" -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest

Quick start to an instrumented parser run

To quickly get started, run these three commands:

mkdir results
./build.sh
docker run -it --rm -v "$(pwd)/results:/results" mr_memtrace-analysis-dev:latest
./run_trace.py /pdfs/hello.pdf

Next proceed to the Postprocessing instrumentation results section to learn how to run analyses on the tracer's results. If you are interested in generating and viewing a parse tree for the results continue on to the Parse tree analysis and viewing section.

Running an instrumented parser run

Use run_trace.py within a mr_memtrace-analysis-dev container to execute a instrumented run of a parser. It supports a small number of parser/parser families including poppler's pdftotext and pdftops as well as mupdf's mutool conversion to ps and text.

run_trace.py wraps the output generated by the DynamoRIO tools in a structured manner with which all the processing tools in memtrace-tools understand.

run_trace.py still has a lot of hard-coded cruft in there, so for now it is best to run it inside the provided docker container.

run_trace.py can be run on the included example pdf:

./run_trace.py /pdfs/hello.pdf

You must specify the path to at least 1 input to be processed as arguments to run_trace.py, i.e.,

./run_trace.py path/to/foo.pdf path/to/bar.pdf

By default, run_trace.py will execute poppler's pdftops. To see what other parser families/binaries are supported, execute ./run_trace.py --list If you would like to run a non-default parser, specify the parser family using the -p option, version using -v, and binary using -b, e.g.,

> ./run_trace.py --list
Parser family mupdf:
	 input type: pdf:
	 version: 1.18.0
		 supported binaries: (name/command)
			 - mutool: mutool clean -s -ggg {in_file} out.pdf
			 - mutops: mutool convert -F ps -o out.ps {in_file}
			 - mutotext: mutool convert -F txt -o out.txt {in_file}
			 - mutotext-decrypt-user: mutool convert -p user -F txt -o out.txt {in_file}
			 - mutotext-decrypt-owner: mutool convert -p owner -F txt -o out.txt {in_file}
Parser family poppler:
	 input type: pdf:
	 version: 0840
		 supported binaries: (name/command)
			 - pdftops: utils/pdftops {in_file} out.ps
			 - pdf-fullrewrite: test/pdf-fullrewrite {in_file} out.pdf
			 - pdftocairo: utils/pdftocairo -png {in_file} out
			 - pdftotext: utils/pdftotext {in_file} out.txt
			 - pdftotext-decrypt-user: utils/pdftotext -upw user {in_file} out.txt
			 - pdftotext-decrypt-owner: utils/pdftotext -opw owner {in_file} out.txt
	 version: eval1_sri
		 supported binaries: (name/command)
			 - pdftops: utils/pdftops {in_file} out.ps
> ./run_trace.py -p mupdf -v 1.18.0 -b mutops path/to/foo.pdf path/to/bar.pdf

(Note: if supported needs to be added for a different parser family, version, and/or binary, it needs to be added to a json configuration file in ./parser-settings. The contents/semantics/format of these files are currently undocumented)

Tracing output

./run_trace.py will create a directory containing the run's results under /results (or directory specified by -r option). The subdirectory will be given a randomly generated name that starts with res_. You may use-t <name> to tag the generated results with a more memorable name (this merely creates a symbolic link). Most memtrace postproccings tools require the path (or symbolic link) to the result directory to be processed.

The generated results directory contains information including:

  • process's address space layout (address map, in mmap.*.log)
  • Binary event log generated by instrumentation (in memcalltrace.*.log, one per thread)
  • command's standard output/error content (in subprocess.out)
  • command invoked, exit value, runtime, etc (in info.txt)
  • a copy of the input file

All binaries/libraries loaded by the parser will be cached in a /results/bins_* directory (by default) -- this is done once per instrumented parser binary. Each /results/bins_* directory contains all results directories generated by its corresponding parser binary (cached in the /results/bins_*/data directory). The /results/res_* directories are merely symbolic links.

Postprocessing instrumentation results

Tools for postprocessing instrumentation results live in the memtrace subrepo/directory.

After a successful run of run_trace.py in the docker container, it'll create a directory named res_* in /results containing the memory tracing log and other artifacts. The path to this directory is passed as an argument to the --parse_result/-R option by the postprocesing tools (which live in tracetools/tools).

The memtrace source tree contains a run-analysis.sh script (in the root directory) that is a handy wrapper for running the postprocessing tools within the mr_memtrace-analysis-dev docker container. (It invokes postprocessing tools withing a pypy environment which is significantly faster than using the default python interpreter).

For example, suppose you ran an instance of run_trace.py in the docker container that saved its results to /results/res_40a7286ba6614a33ba658115ec8c719c and you wanted to print out its memtrace log in a human readable format. You can use the tracetools/tools/print_log.py tool to do this. Within the docker container, invoke the tool this way:

./run-analysis.sh print_log.py -R /results/res_40a7286ba6614a33ba658115ec8c719c

./run-analysis.sh is merely a wrapper to tracetools/tools/print_log.py, so you can use it view the tool's documentation:

./run-analysis.sh print_log.py --help

See tracetools/README.md for more information on analyzing YARN's instrumentation's output with yarn's postprocessing tools.

Parse tree analysis and viewing postprocessing tools

Although parse tree analysis and viewing somewhat requires a binary ninja license to calculate addresses of important parsing code, we've included some files containing pre-generated addresses in this repository so that one can bypass the Binary Ninja requirements. Installing these pre-generated files in the proper location isn't a straightforward task.

First spin up a docker container with a directory of sample PDFs mounted at /pdfs, e.g.,

docker run -it --rm -v"$HOME/pdfs:/pdfs" -v"$(pwd)/results:/results" mr_memtrace-analysis-dev:latest

E.g., via

./run_trace.py /pdfs/sample.pdf

(This will trace /pdfs/sample.pdf as its parsed by poppler's pdftops)

If this runs successfully, then at the end of stdout you should see something like the following:

Results saved to /results/bins_9ad307edb9ca430e814bef40d09fd232/res_40a7286ba6614a33ba658115ec8c719c

This is where the tracing results from the tracer were saved. Note that ./run_trace.py also creates a symbolic link to this directory at /results/res_40a7286ba6614a33ba658115ec8c719c.

Next, copy the pre-generated address metadata in parser-metadata to the /results/bins_*/data directory, e.g.,

cp parser-metadata/* /results/bins_9ad307edb9ca430e814bef40d09fd232/data/

Finally you should now be able to use the parse tree postprocessing tool to generate and save a copy of the parse tree:

./run-analysis.sh  pt_tracker.py -R /results/res_40a7286ba6614a33ba658115ec8c719c -s

If when you run this tool you see output that looks similar to the following, then you haven't copied all the parser metadata to the proper bins_*/data directory.

running pt_tracker.py
ERROR:root:Cannot import bin_info/binary ninja
ERROR:root:No module named 'binaryninja'
ERROR:root:Note: binja is not supported by pypy3
WARNING:root:Address version cache for these results have not been createed yet for libraries: pdftops, libpoppler.so.94, libc-2.31.so. Try rerunning pt tracker with '-g' option to generate cache
INFO:root:adding tracing for library /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94 at /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94.otherdb
INFO:root:adding tracing for library /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2 at /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2.otherdb
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/pdftops.otherdb
INFO:root:... done
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94.otherdb
INFO:root:... done
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2.otherdb
INFO:root:... done
Traceback (most recent call last):
  File "/processor/tracetools/tools/pt_tracker.py", line 134, in <module>
    run(args)
  File "/processor/tracetools/tools/pt_tracker.py", line 111, in run
    p = PTTracker(a)
  File "/processor/tracetools/tools/pt_tracker.py", line 69, in __init__
    print_offset=a.print_offset)
  File "/processor/tracetools/tracetools/signatures/versions.py", line 550, in create_tracker
    return tracker_cls(ml, unique_only, **kwargs)
  File "/processor/tracetools/tracetools/signatures/xpdf_poppler.py", line 842, in __init__
    **kwargs)
  File "/processor/tracetools/tracetools/signatures/evaluator.py", line 268, in __init__
    super(SigPTEval, self).__init__(parse_log, **kwargs)
  File "/processor/tracetools/tracetools/signatures/evaluator.py", line 34, in __init__
    self.signatures.setup_sig_classes(self, self.ml)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
    self.setup_sig_classes(manager, parselog, subcls)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
    self.setup_sig_classes(manager, parselog, subcls)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
    self.setup_sig_classes(manager, parselog, subcls)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 259, in setup_sig_classes
    self.setup_sig_classes(manager, parselog, subcls)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 246, in setup_sig_classes
    cls.setup_sig_class(manager, parselog, callback)
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 89, in setup_sig_class
    cls._setup()
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 688, in _setup
    super(NewFrameMoment, cls)._setup()
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 473, in _setup
    cls.setup()
  File "/processor/tracetools/tracetools/signatures/xpdf_poppler.py", line 902, in setup
    cls.xref_fetch_objs = cls.addrs_of("xref_fetch_obj")
  File "/processor/tracetools/tracetools/signatures/signatures.py", line 479, in addrs_of
    absolute)
  File "/processor/tracetools/tracetools/signatures/versions.py", line 353, in addrs_of
    + str(addrs))
tracetools.signatures.utils.BinaryInfoException: Issue looking up addrs for libpoppler.so.94:xref_fetch_obj, found []

A successful run will save the derived parse tree in res_*/derived-pt.json. You can then view a textual representation of the parse tree via:

./run-analysis.sh  pt_tracker.py -R /results/res_40a7286ba6614a33ba658115ec8c719c -j

You can also browse the parse tree via a GUI if you execute docker from the host so that it has access to the display:

xhost +local:
docker run -it --rm  -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix:ro  -v"$(pwd)/results:/results"  mr_memtrace-analysis-dev:latest

Inside the docker container:

./run-analysis.sh view_parsetree.py  -R /results/res_40a7286ba6614a33ba658115ec8c719c

The left panel of the GUI displays each parse tree, you can browse the parse trees using the up/down arrows and expand/hide the tree by a generation using the right and left arrow keys. Rows are sorted and assigned an ID in parsing order. For each node, you can see the object type and value (if it is a leaf node). The right panel shows details about the object highlighted in the left column, including a unique ID number, object type, object value (if it isn't a parent object), File Taint -- the byte offset in input file from which the node was build.

Running instrumentation on arbitrary executables

Use ./test_trace.py directly to apply memtrace instrumentation to arbitrary executables. E.g., to instrument a binary located at ./ls,

./test_trace.py -R -b --parser ./ls --parser-args '' .

This will perform an instrumented run of ./ls (because --parser ./ls) called with no arguments (--parser-args ''), tracing will include basic block information (b/c -b argument is specified). Tracing will being when main is called and end when it returns (use -e [fn] argument to override), and then it will print out the path to the result directory (because '-R' is specified). The trailing dot (.) is treated as the binary's input file by the instrumentation. If the binary doesn't process any input files, this final positional argument can be any arbitrary file. If the binary does process an input file, this argument should be the the path to the input file -- if the binary needs to take the path as a command-line argument, update the value --parser-args to reflect this. E.g., if you want ./ls -l /root to be called, then specify the argument using the {in_file} placeholder in --parser-args, i.e.,

./test_trace.py -b -R --parser ./ls --parser-args '-l {in_file}' /root

If you get the following error: "tracetools.results_data.ResultsException: Something went wrong and no mmap log exists. Did memory tracker log ever get enabled/populated?"

This means that the nothing ever got logged. This is likely due to the entrypoint (by default "main", otherwise specified using the -e parameter) never being invoked. Check the spelling of the symbol name and try running the application within gdb to determine what functions do get invoked.

Running tools outside a container

This is left as an exercise for the reader.

The Makefile builds the DynamoRIO-based memory and callstack tracing tools. It also has three "test" targets (test1, test2, test3) that runs pdfto{text,html} against PDFs in ../tests. Output is saved in ./build/memcalltrace.pdfto*.log. Be aware that output generated by these tests can be several hundred megabytes up through several hundred gigabytes (and possibly larger)

Printing/parsing trace output

tracetools/tools/print_log.py is a standalone python3 tool that simply parses the output generated by the memcalltrace tool and prints out the contents in a human-readable format.

Please see traetools/README.md for more information

License

This code is released under the MIT License

The MIT License (MIT)

Copyright (c) 2022 Narf Industries LLC

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.