Skip to content

Deobfuscation Passes

Lander Brandt edited this page Aug 12, 2021 · 2 revisions

At a high-level, unfuck operates as follows:

  1. The bytecode is walked and disassembled from the very first instruction.
  2. When a jump is encountered, both branches are queued for parsing unless the jump is to an invalid target.
  3. When a "terminating" instruction (such as a RETURN_VALUE) is encountered no more parsing happens down that execution path.
  4. When all instructions are parsed, a graph structure is built to construct all basic blocks (BBs).

Once the graph is constructed:

  1. BBs with bad/invalid instructions are replaced with a POP_TOP. This sometimes helps with decompilation by balancing the stack if constant propagation fails later on and a code path cannot be eliminated.
  2. Unnecesary JUMP_FORWARD instructions are eliminated and BBs that can be joined into one are joined.
  3. Const conditions/opaque predicates are removed. This is where VM execution occurs and taint tracking happens. The process of taint tracking is fairly simple: if a value is constant or derived from constants, its value is pushed to the VM stack as (Some(Value), Vec<InstructionIndex>) where the vector contains every instruction which helped construct the Some(Value). If the value cannot be determined, its value is simply (None, vec![]). When a conditional jump is encountered, we look at the top-of-stack (TOS) value and if it is Some(...), we can evaluate which branch would be taken and remove the other code path (taking into consideration paths that are provably always executed and paths that we cannot prove will not be executed). -- An example of something we can figure out is if {1, 2, 3} & {2}: .... The two sets are loaded as consts and an intersecting set is created via the & operator. The result is a set containing {2}, which as a truthy value, evaluates to True. -- An example of something we cannot figure out is:
def some_custom_function():
    True
if some_custom_function():
    ...

In theory this would not be too difficult to evaluate, but is not simple to architect in a clean way.

  1. After opaque predicates are removed, we simplify the BBs again.
  2. We update the BB offsets to be "normal". We just removed a bunch of code, so we need any relative/absolute offsets referenced in the instructions to match what they would be in a file.
  3. RETURN_VALUE calls are "massaged" so that a decompiler can make better sense of it. e.g. if a value is returned in a for loop and outside of the for loop, sometimes the decompiler will have a difficult time representing this cleanly. This will duplicate the RETURN_VALUE into its own basic block.
  4. BB offsets are updated
  5. JUMP_FORWARD 0 instructions are inserted in certain locations to help match "interesting" bytecode generated by the compiler. This JUMP_FORWARD may have been lost when we joined basic blocks, or perhaps it was never picked up because it came after a terminating instruction... but the decompiler may depend on this in order to figure out if something is an if/else.
  6. BB offsets are updated.
  7. New bytecode is serialized.

This process means that if an instruction cannot be naturally exercised through the VM, it will not be parsed. For example:

JUMP_FORWARD 3
POP_TOP
POP_TOP
POP_TOP
LOAD_CONST 0
RETURN_VALUE

The 3 POP_TOP instructions will never be queued for disassembly since there is no way for a real VM to ever exercise this code path. It also means that there may be some cases where dead code cannot safely be eliminated.

Limitations

unfuck currently does not handle the following scenarios:

  • Some set operations.
  • Some dict operations.
  • Constant propagation across code objects.
  • No out-of-box support for remapped opcodes in the unfuck CLI tool, although this could be done trivially.
  • Loop unrolling.
  • No tracing through generators.
  • No handling of exceptions.
  • No automatic unwrapping of nested, serialized code objects that are decoded at runtime and exec'd. This can be handled specifically for your application by leveraging the const_jmp_instruction_walker. e.g.:
trace!("Found the target function -- finding constant index");
const_jmp_instruction_walker(
    function_code.code.as_slice(),
    Arc::clone(&function_code.consts),
    |instr, _offset| {
        if let TargetOpcode::LOAD_CONST = instr.opcode {
            constant_index = Some(instr.arg.unwrap() as usize);
            // stop walking instructions
            WalkerState::Break
        } else {
            WalkerState::Continue
        }
    },
)
.expect("failed to walk function instructions");

While this is a very naive example, the walker provides a very powerful state machine that can be abused to single-step the VM by returning various WalkerState results and modifying your own state machine as you encounter instructions in the VM.

Clone this wiki locally