Skip to content

Decompiler outputs

Peter Matula edited this page May 4, 2023 · 10 revisions

This page describes various RetDec decompilation outputs.

Generated files

A default decompilation (without any special options listed below) of an input file input.exe produces the following output files:

  • input.exe.dsm: Disassembly output in our custom format. Instruction mnemonics are in the default Capstone format.
  • input.exe.bc: The final product of the Core decompilation part in the LLVM bitcode format.
  • input.exe.ll: Human-readable disassembly of LLVM bitcode in the LLVM IR format.
  • input.exe.config.json: Metadata produced by the decompilation process.
  • input.exe.c: The decompiled C code. This is the main output.

As you can see, the output file names are generated simply by adding proper suffixes to the input file name: <input_file>.{dsm, bc, ll, config.json, c}.

Output generation options

The following options of retdec-decompiler application control the output generation process:

  • -o FILE, --output FILE If specified, the main decompilation output is stored to FILE instead of <input_file>.c. Furthermore, FILE (without a potential suffix) is used as a base name to generate other output file names: <FILE_w/o_suffix>.{dsm, bc, ll, config.json}.
  • -f OUTPUT_FORMAT, --output-format OUTPUT_FORMAT The default plain option generates the main decompilation output directly as a high-level-language source code into an associated text file (e.g. C source code into a *.c file). The json and json-human options generate the output source code as a stream of lexer tokens, plus additional information. See the section below for a detailed format description. The suffix of the main decompilation output file is changed to .json.
  • --cleanup Removes temporary files created during decompilation. Only the main decompilation output file and the disassembly file are preserved.

Run retdec-decompiler --help for more info an all the available options.

JSON output file format

Parsing high-level-language source code is not trivial. However, 3rd-party reversing tools might need to do just that in order to make use of output from RetDec. Furthermore, additional meta-information may be required to enhance user experience or automated analysis - information that is hard to convey in a traditional high-level-language source code. Usage examples:

In order to make these applications possible, RetDec offers an option (see the previous section) to generate its output as a sequence of annotated lexer tokens into a JSON format. Two JSON flavors can be generated:

  • Human-readable JSON containing proper indentation (option json-human).
  • Machine-readable JSON without any indentation (option json).

In order to parse both flavors with a single parser implementation, they both use the same keys and values.

The current JSON schema is the following:

{
    "language" : "<language_ID>",
    "tokens" :
    [
        {
            "addr" : "<address_format>",
            "kind" : "<kind_values>",
            "val" : "<value>"
        },
        // ...
    ]
}
  • All values are of the string data type.

  • language key identifies the high-level language being tokenized. Possible <language_ID> values:

    Value Description
    C C language
  • Source code is serialized in an array of token objects in a tokens array.

  • Token object contains the following entries:

    • Assembly address associated with the token with key addr and value in the prefixed hexadecimal format (e.g. 0x8048577).
    • Value val which holds the actual token string as would appear in the source code.
    • Token type with key kind and the following possible values:
    Value Description Example(s)
    nl New line. "\n"
    ws Any consecutive sequence of white spaces. " "
    punc A single punctuation character. "(" ")" "{" "}" "[" "]" ";"
    op Operator. "==" "-" "+" "*" "->" "."
    i_var Global/Local variable identifier. "global_var"
    i_mem Structure/Class member identifier. "entry_1"
    i_lab Label identifier. "label_0x1234"
    i_fnc Function identifier. "ackermann"
    i_arg Function argument identifier. "argv"
    keyw High-level-language keyword. "while"
    type Data type. "uint64_t"
    preproc Preprocessor directive. "#include"
    inc String used in an #include preprocessor directive. Including <>. "<stdlib.h>"
    l_bool Boolean literal. "true" "false"
    l_int Integer literal. Including potential prefixes and suffixes. "123" "0x213A" "1234567890123456789LL"
    l_fp Floating point literal. Including potential prefixes and suffixes. "3.14" "123.456e-67"
    l_str String literal. Including properly escaped "". "\"ackerman( %d , %d ) = %d\\n\""
    l_sym Symbolic literal. "UNDEFINED"
    l_ptr Pointer literal. "NULL"
    cmnt Comment. Including delimiter like // or /* */. "// Detected compiler/packer: gcc (4.7.2)"
  • Token kind and val entries must always be used together, i.e. one is never used without the other.

  • Concatenating all the val entries produces exactly the same string as would be generated using the plain format option.

  • Address entry does not have to be present in every token object. If it is missing, the token is associated with the last address entry.

  • Address entry can be used without kind and val entries. In such a case, it effectively sets associated address for the upcoming tokens.

    {
        "addr" : "0x80498e8"
    },
    // sequence of tokens associated with address 0x80498e8
    {
        "addr" : "0x80498f4"
    },
    // sequence of tokens associated with address 0x80498f4
  • Not all tokens must (or can) be associated with an assembly address. Such tokens are associated with and empty address:

    {
        "addr" : "0x80498e8"
    },
    // sequence of tokens associated with address 0x80498e8
    {
        "addr" : ""
    },
    // sequence of tokens unassociated with any address
  • Token-to-address association is not intrinsically line-based. For example, the following line:

    printf("ackerman( %d , %d ) = %d\n", x, y, result);

    can be broken down to several pieces, each associated with a different assembly instruction:

    • printf( - associated with CALL instruction.
    • "ackerman( %d , %d ) = %d\n" - associated with LOAD instruction loading the call argument.
    • , x - associated with LOAD instruction loading the call argument.
    • , y - associated with LOAD instruction loading the call argument.
    • , result - associated with LOAD instruction loading the call argument.
    • ); - associated with CALL instruction.