-
Notifications
You must be signed in to change notification settings - Fork 944
Decompiler outputs
This page describes various RetDec decompilation outputs.
A default decompilation (without any special options listed below) of an input file input.exe
produces the following output files:
-
input.exe.dsm
: Disassembly output in our custom format. Instruction mnemonics are in the default Capstone format. -
input.exe.bc
: The final product of the Core decompilation part in the LLVM bitcode format. -
input.exe.ll
: Human-readable disassembly of LLVM bitcode in the LLVM IR format. -
input.exe.config.json
: Metadata produced by the decompilation process. -
input.exe.c
: The decompiled C code. This is the main output.
As you can see, the output file names are generated simply by adding proper suffixes to the input file name: <input_file>.{dsm, bc, ll, config.json, c}
.
The following options of retdec-decompiler
application control the output generation process:
-
-o FILE, --output FILE
If specified, the main decompilation output is stored toFILE
instead of<input_file>.c
. Furthermore,FILE
(without a potential suffix) is used as a base name to generate other output file names:<FILE_w/o_suffix>.{dsm, bc, ll, config.json}
. -
-f OUTPUT_FORMAT, --output-format OUTPUT_FORMAT
The defaultplain
option generates the main decompilation output directly as a high-level-language source code into an associated text file (e.g. C source code into a*.c
file). Thejson
andjson-human
options generate the output source code as a stream of lexer tokens, plus additional information. See the section below for a detailed format description. The suffix of the main decompilation output file is changed to.json
. -
--cleanup
Removes temporary files created during decompilation. Only the main decompilation output file and the disassembly file are preserved.
Run retdec-decompiler --help
for more info an all the available options.
Parsing high-level-language source code is not trivial. However, 3rd-party reversing tools might need to do just that in order to make use of output from RetDec. Furthermore, additional meta-information may be required to enhance user experience or automated analysis - information that is hard to convey in a traditional high-level-language source code. Usage examples:
- Syntax highlighting in RetDec IDA plugin.
- Relations between decompiled output lines/elements and the original disassembly instructions in RetDec IDA plugin and RetDec Radare2 plugin.
In order to make these applications possible, RetDec offers an option (see the previous section) to generate its output as a sequence of annotated lexer tokens into a JSON format. Two JSON flavors can be generated:
- Human-readable JSON containing proper indentation (option
json-human
). - Machine-readable JSON without any indentation (option
json
).
In order to parse both flavors with a single parser implementation, they both use the same keys and values.
The current JSON schema is the following:
{
"language" : "<language_ID>",
"tokens" :
[
{
"addr" : "<address_format>",
"kind" : "<kind_values>",
"val" : "<value>"
},
// ...
]
}
-
All values are of the string data type.
-
language
key identifies the high-level language being tokenized. Possible<language_ID>
values:Value Description C C language -
Source code is serialized in an array of token objects in a
tokens
array. -
Token object contains the following entries:
- Assembly address associated with the token with key
addr
and value in the prefixed hexadecimal format (e.g.0x8048577
). - Value
val
which holds the actual token string as would appear in the source code. - Token type with key
kind
and the following possible values:
Value Description Example(s) nl
New line. "\n"
ws
Any consecutive sequence of white spaces. " "
punc
A single punctuation character. "("
")"
"{"
"}"
"["
"]"
";"
op
Operator. "=="
"-"
"+"
"*"
"->"
"."
i_var
Global/Local variable identifier. "global_var"
i_mem
Structure/Class member identifier. "entry_1"
i_lab
Label identifier. "label_0x1234"
i_fnc
Function identifier. "ackermann"
i_arg
Function argument identifier. "argv"
keyw
High-level-language keyword. "while"
type
Data type. "uint64_t"
preproc
Preprocessor directive. "#include"
inc
String used in an #include
preprocessor directive. Including<>
."<stdlib.h>"
l_bool
Boolean literal. "true"
"false"
l_int
Integer literal. Including potential prefixes and suffixes. "123"
"0x213A"
"1234567890123456789LL"
l_fp
Floating point literal. Including potential prefixes and suffixes. "3.14"
"123.456e-67"
l_str
String literal. Including properly escaped ""
."\"ackerman( %d , %d ) = %d\\n\""
l_sym
Symbolic literal. "UNDEFINED"
l_ptr
Pointer literal. "NULL"
cmnt
Comment. Including delimiter like //
or/* */
."// Detected compiler/packer: gcc (4.7.2)"
- Assembly address associated with the token with key
-
Token
kind
andval
entries must always be used together, i.e. one is never used without the other. -
Concatenating all the
val
entries produces exactly the same string as would be generated using theplain
format option. -
Address entry does not have to be present in every token object. If it is missing, the token is associated with the last address entry.
-
Address entry can be used without
kind
andval
entries. In such a case, it effectively sets associated address for the upcoming tokens.{ "addr" : "0x80498e8" }, // sequence of tokens associated with address 0x80498e8 { "addr" : "0x80498f4" }, // sequence of tokens associated with address 0x80498f4
-
Not all tokens must (or can) be associated with an assembly address. Such tokens are associated with and empty address:
{ "addr" : "0x80498e8" }, // sequence of tokens associated with address 0x80498e8 { "addr" : "" }, // sequence of tokens unassociated with any address
-
Token-to-address association is not intrinsically line-based. For example, the following line:
printf("ackerman( %d , %d ) = %d\n", x, y, result);
can be broken down to several pieces, each associated with a different assembly instruction:
-
printf(
- associated withCALL
instruction. -
"ackerman( %d , %d ) = %d\n"
- associated withLOAD
instruction loading the call argument. -
, x
- associated withLOAD
instruction loading the call argument. -
, y
- associated withLOAD
instruction loading the call argument. -
, result
- associated withLOAD
instruction loading the call argument. -
);
- associated withCALL
instruction.
-