-
Notifications
You must be signed in to change notification settings - Fork 31
Fuzzing xlnt project with sydr fuzz for fun and profit
"Syrio says that every hurt is a lesson, and every lesson makes you better"
― George R. R. Martin, A Song of Ice and Fire
This article is dedicated to fuzzing open source software. There are many proven fuzzers (libFuzzer, AFLplusplus, Honggfuzz, etc.) and much more well written tutorials. However, I'd like to show how to apply hybrid fuzzing approaches (combination of fuzzing and symbolic execution) to test open source software. For this purpose I will use our hybrid fuzzing tool sydr-fuzz, that combines the power of Sydr - dynamic symbolic execution tool and libFuzzer - an in-process, coverage-guided, evolutionary fuzzing engine. We will learn not only how to prepare fuzz targets, do hybrid fuzzing, but also how to use our crash triage tool casr, how to collect code coverage reports and apply Sydr to check security predicates for finding interesting bugs using symbolic execution techniques.
It is a hard task to find a single open source project to show all declared above features at once. For this purpose xlnt project suites very well. So, we are ready to start our journey for the glory of bug finding!
"It is no easy thing to slay a dragon, but it can be done."
― George R.R. Martin, A Song of Ice and Fire
The first thing we might want to do is to find a function or a code snippet for fuzzing. For complex projects like (suricata, postgresql, nginx, etc.) it is a hard task: you need a good comprehension of code internals and build system. Fortunately, our project is a library. It has nice API and it is easy to build.
So, what is xlnt? Xlnt is a modern C++ library for manipulating spreadsheets in memory and reading/writing them from/to XLSX files. Let's look at API. Maybe we could find some function that loads .xlsx file?
In xlnt.hpp header very interesting header is included:
#include <xlnt/workbook/workbook.hpp>
In this header there is a bunch of interesting functions that work with xlsx files. I put an eye on this function:
/// <summary>
/// Interprets byte vector data as an XLSX file and sets the content of this
/// workbook to match that file.
/// </summary>
void load(const std::vector<std::uint8_t> &data);
Yes, that's exactly what we need! The function interprets byte vector data as an XLSX file. In other words, it parses xls file and input function parameters suite well for libFuzzer. Ok, we have a target function, now we have a plan to do:
- Create a fuzz target for libFuzzer. If you aren't familiar with libFuzzer, you may look at this tutorial.
- Create a fuzz target for Sydr and code coverage. For this purpose, you just need to add main function that reads file and passes its contents to LLVMFuzzerTestOneInput.
- Build three executable binaries for libFuzzer, Sydr, and coverage.
- Prepare corpus.
Xlnt is already added to oss-sydr-fuzz. So, you can just clone this repository and build docker container following the instructions.
Next I will try to describe the basic concepts to prepare project for fuzzing in oss-sydr-fuzz. The fuzzing process is executed in docker container for convenience and reproducibility. Let's take a look at Dockerfile
for xlnt:
FROM sweetvishnya/ubuntu20.04-sydr-fuzz
MAINTAINER Alexey Vishnyakov
# Clone target from GitHub.
RUN git clone https://github.com/tfussell/xlnt
WORKDIR xlnt
# Checkout specified commit. It could be updated later.
RUN git checkout d88c901faa539f9272a81ba0bab72def70ca18d7 && git submodule update --init --recursive
# Copy build script and targets.
COPY load_fuzzer.cc load_sydr.cc build.sh ./
# Build fuzz targets.
RUN ./build.sh
WORKDIR ..
# Prepare seed corpus.
RUN mkdir /corpus && find /xlnt -name "*.xlsx" | xargs -I {} cp {} /corpus
We use our base image with clang-14 installed. Dockerfile contains commands that clone xlnt repository, checkout fixed commit for reproducibility, run build script, and prepare seed corpus. Let's take a look at build.sh script. This script builds three executables: libFuzzer, Sydr binary, and coverage binary. I will not show all script contents here due to its size (66 lines). The main idea is to recompile xlnt library with different flags (with sanitizers+libFuzzer, without santizers, with llvm-cov instrumentation):
# libFuzzer
cmake -D STATIC=ON -D TESTS=OFF \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_CXX_FLAGS="-g -fsanitize=fuzzer-no-link,address,bounds,integer,undefined,null,float-divide-by-zero" \
..
# Sydr
cmake -DSTATIC=ON -D TESTS=OFF \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_CXX_FLAGS=-g \
..
# Coverage
cmake -DSTATIC=ON -D TESTS=OFF \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_CXX_FLAGS="-fprofile-instr-generate -fcoverage-mapping" \
..
Now we are ready to look at the fuzz target:
#include <xlnt/xlnt.hpp>
#include <libstudxml/parser.hxx>
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
std::vector<uint8_t> v_data(data, data + size);
xlnt::workbook excelWorkbook;
try
{
excelWorkbook.load(v_data);
}
catch (const xlnt::exception& e)
{
return 0;
}
catch (const xml::parsing& e)
{
return 0;
}
return 0;
}
Here we perform two obvious operations: create vector and load workbook. A very important thing when you fuzz C++ code: it has exceptions. Some exceptions could be treated as bugs, but some not. Definitely unhandled exceptions from standard library could be considered as a bug. Also, exceptions from libraries that are in internals of your fuzz target could be treated as a bug as well. These exceptions should be handled by target library itself. But external (documented) exceptions should be handled in LLVMFuzzerTestOneInput
as you can see in xlnt example. You may also look at target for Sydr and coverage.
Before starting fuzzing we should prepare initial corpus. There are often some files inside target repository with needed format. We already saw command in Dockerfile
that creates initial corpus:
# Prepare seed corpus.
RUN mkdir /corpus && find /xlnt -name "*.xlsx" | xargs -I {} cp {} /corpus
Alright, we have prepared docker container for fuzzing. Now let the fuzzing begin!
"The night is dark and full of terrors."
― George R.R. Martin, A Song of Ice and Fire
Before starting sydr-fuzz for fuzzing we have to write simple config in toml format. Here below is a configuration file sydr-fuzz.toml for xlnt:
exit-on-time = 7200
[sydr]
args = "-s 90"
target = "/load_sydr @@"
jobs = 2
[libfuzzer]
path = "/load_fuzzer"
args = "-rss_limit_mb=8192 -timeout=10 -jobs=1000 -workers=8 /corpus"
[cov]
target = "/load_cov @@"
Let's have a brief look at this config file:
exit-on-time - is an optional parameter that takes time in seconds. If during this time (2 hours in our case) the coverage does not increase, fuzzing is automatically terminated.
[sydr]
table may contain the following parameters:
args is an arguments string for Sydr. Options for log files and input files are set automatically. It is recommended to use -s option for uniform input processing by Sydr.
target is a command line for target program to run. Instead input file name use @@
.
jobs is a number of Sydr's to run. Default is 1.
[libfuzzer]
table contains arguments for libFuzzer.
[cov]
table contains target run string for code coverage binary.
To sum up, we will start fuzzing with 8 libFuzzer workers and 2 Sydr jobs. Fuzzing process will stop if coverage (cov:) will not increase for 2 hours or libFuzzer finds 1000 crashes/oom/timeouts. Okay, let's run docker:
$ sudo docker run --privileged --network host -v /etc/localtime:/etc/localtime:ro --rm -it -v $PWD:/fuzz oss-sydr-fuzz-xlnt /bin/bash
Change directory to /fuzz
:
# cd /fuzz
Run hybrid fuzzing:
# sydr-fuzz run
At last we started sydr-fuzz. Firstly, sydr-fuzz merges initial corpus directories to it's project corpus directory. Let's look at logs. After merge step fuzzing was started. We see pretty-colored libFuzzer logs and it has already found a crash! A true segmentation fault, cool!!! Also, we can see information about Sydr. reloads{unique}
show how many inputs from Sydr are useful for libFuzzer (libFuzzer reloaded these inputs). Files from all Sydr instances are copied to project corpus directory. One file from Sydr could be reloaded by many libFuzzer workers, so, we also count unique files among all reloads. We update reloads statistics by timer to see profit from Sydr in real time, but information about generated inputs is updated per Sydr execution. Let's wait till fuzzing ends and look at logs again.
According to logs 1000 libfuzzer jobs exited after 25 minutes and found 5 crashes (different by data). Sydr was executed 17 times and generated 2579 new inputs (26 of them were useful). Before we start collecting coverage and checking security predicates, let's minimize resulting corpus:
$ sydr-fuzz cmin
Good, we have minimized corpus (185 files). Now we can collect code coverage and check security predicates.
"Never do what they expect"
― George R.R. Martin, A Song of Ice and Fire
To collect coverage in lcov format we run this command:
# sydr-fuzz cov-export -- -format=lcov > load.lcov
After getting .lcov file let's get html report:
# genhtml -o load-html load.lcov
Great, now we can see which parts of code sydr-fuzz has reached. It's time to check security predicates!
"Never forget what you are, the rest of the world will not. Wear it like armor and it can never be used to hurt you."
Tyrion Lannister"
― George R.R. Martin, A Song of Ice and Fire
The idea of using symbolic execution for purposeful search for errors is not new. We use Sydr and sydr-fuzz to directly search for errors after fuzzing process is ended. First, I have to tell you about security predicates in Sydr. In Sydr we use security predicates to find these types of errors:
- integer overflows
- out of bounds accesses
- and more
Integer overflow is a very interesting situation. Sometimes it is normal when you do some hashing computations for example, but sometimes it may lead to very critical bugs like buffer overflows. Also, it is hard to understand real buggy integer overflows in binary code. Integer overflow is an absolutely valid situation for large number arithmetic. To deal with such problems we check a subset of arithmetic instructions for integer overflow (we call them sources). When overflow is occurred, we check whether overflowed value is used later in:
- function argument
- memory address access
- branch condition
This technique helps us to reduce false positives on Sydr side, but they can still occur. Do you remember we built Sydr target with debug info? Sydr uses only binary code for symbolic execution, but we can map found integer overflows to source code. After we can check if this is a true integer overflow by running libFuzzer target with sanitizers on input from Sydr and matching lines of sanitizer's warnings with security predicates. This is how we avoid false positives. Now we need to insert security predicates check in fuzzing pipeline, here is where sydr-fuzz comes to play. Security predicates check is a heavy task, heavier than code exploration. So, we don't need to waste resources. The idea is to check security predicates on well-fuzzed, minimized input corpus.
Ok, it seems to me, I have already told about the idea behind security predicates. Let's run sydr-fuzz! We have already minimized corpus, so, just run sydr-fuzz security subcommand with 8 workers:
$ sydr-fuzz security --jobs 8
We see, after several executions sydr-fuzz found some integer overflows that lead our fuzz target to crash, awesome!
Let's wait awhile sydr-fuzz finishes it's work. Wow, after sydr-fuzz security we have 126 crashes. Too much for me to look at each single crash manually. I want some automation, casr
subcommand is exactly what I need!
"Fear cuts deeper than swords. The man who fears losing has already lost."
― George R.R. Martin, A Song of Ice and Fire
Let's start crash triage using this command:
$ sydr-fuzz casr
First, crash reports (.casrep) for each crash are created. We use libFuzzer binary with sanitizers for that. The main component of report is stacktrace. Our crash triaging is based on stacktrace comparation. Next step is deduplication is stacktraces are equal, we consider that crashes are the same. After deduplication phase crash clustering begin. Then crash inputs are copied next to reports and some information is updated. At the last step for each clustered crash input our caesar
tool is used to get crash report on binary with no sanitizers. It might be useful when analyzing crash to see how it works on executable with no sanitizers.
Good, here is in casr project directory we 7 different clusters with 12 crashes. In one cluster there are similar crashes. So 7 crashes sounds much better than 126 for manual analysis. Let's look at some crash report, for example. For that I use casr-cli
tool.
$ casr-cli ./sydr-fuzz-out/casr/cl5/crash-sydr_2cb2768bb7bd020f982362a74344d8e133d4089e_int_overflow_14_signed.casrep
So here we see asan report, and part of a source code, were violation occurred. It seems to me, security predicates check help us to find and interesting bug where std::vector
with very large size is constructed. From my point of view using such crash triage tools could definitely save your time while debugging found crashes.
In this article I tried to shed some light on how we use fuzzing and symbolic execution techniques for analysing open source software. I also want to admit that modern symbolic execution tools (Sydr, Fuzzolic, etc.) could stand side-by-side with fuzzer's making them a lit bit better:). Interesting symbolic execution techniques like security predicates could find novel crashes that might be missed during fuzzing. Crash triage is necessary part in code analysis. Without it there is so much pain to look at each crash under gdb. I hope it was interesting to read and you get something new).
Andrey Fedotov