-
Notifications
You must be signed in to change notification settings - Fork 12
Deobfuscation Passes
At a high-level, unfuck
operates as follows:
- The bytecode is walked and disassembled from the very first instruction.
- When a jump is encountered, both branches are queued for parsing unless the jump is to an invalid target.
- When a "terminating" instruction (such as a
RETURN_VALUE
) is encountered no more parsing happens down that execution path. - When all instructions are parsed, a graph structure is built to construct all basic blocks (BBs).
Once the graph is constructed:
- BBs with bad/invalid instructions are replaced with a
POP_TOP
. This sometimes helps with decompilation by balancing the stack if constant propagation fails later on and a code path cannot be eliminated. - Unnecesary
JUMP_FORWARD
instructions are eliminated and BBs that can be joined into one are joined. - Const conditions/opaque predicates are removed. This is where VM execution occurs and taint tracking happens. The process of taint tracking is fairly simple: if a value is constant or derived from constants, its value is pushed to the VM stack as
(Some(Value), Vec<InstructionIndex>)
where the vector contains every instruction which helped construct theSome(Value)
. If the value cannot be determined, its value is simply(None, vec![])
. When a conditional jump is encountered, we look at the top-of-stack (TOS) value and if it isSome(...)
, we can evaluate which branch would be taken and remove the other code path (taking into consideration paths that are provably always executed and paths that we cannot prove will not be executed). -- An example of something we can figure out isif {1, 2, 3} & {2}: ...
. The two sets are loaded as consts and an intersecting set is created via the&
operator. The result is a set containing{2}
, which as a truthy value, evaluates toTrue
. -- An example of something we cannot figure out is:
def some_custom_function():
True
if some_custom_function():
...
In theory this would not be too difficult to evaluate, but is not simple to architect in a clean way.
- After opaque predicates are removed, we simplify the BBs again.
- We update the BB offsets to be "normal". We just removed a bunch of code, so we need any relative/absolute offsets referenced in the instructions to match what they would be in a file.
-
RETURN_VALUE
calls are "massaged" so that a decompiler can make better sense of it. e.g. if a value is returned in a for loop and outside of the for loop, sometimes the decompiler will have a difficult time representing this cleanly. This will duplicate theRETURN_VALUE
into its own basic block. - BB offsets are updated
-
JUMP_FORWARD 0
instructions are inserted in certain locations to help match "interesting" bytecode generated by the compiler. ThisJUMP_FORWARD
may have been lost when we joined basic blocks, or perhaps it was never picked up because it came after a terminating instruction... but the decompiler may depend on this in order to figure out if something is anif/else
. - BB offsets are updated.
- New bytecode is serialized.
This process means that if an instruction cannot be naturally exercised through the VM, it will not be parsed. For example:
JUMP_FORWARD 3
POP_TOP
POP_TOP
POP_TOP
LOAD_CONST 0
RETURN_VALUE
The 3 POP_TOP
instructions will never be queued for disassembly since there is no way for a real VM to ever exercise this code path. It also means that there may be some cases where dead code cannot safely be eliminated.
unfuck
currently does not handle the following scenarios:
- Some set operations.
- Some dict operations.
- Constant propagation across code objects.
- No out-of-box support for remapped opcodes in the
unfuck
CLI tool, although this could be done trivially. - Loop unrolling.
- No tracing through generators.
- No handling of exceptions.
- No automatic unwrapping of nested, serialized code objects that are decoded at runtime and
exec
'd. This can be handled specifically for your application by leveraging the const_jmp_instruction_walker. e.g.:
trace!("Found the target function -- finding constant index");
const_jmp_instruction_walker(
function_code.code.as_slice(),
Arc::clone(&function_code.consts),
|instr, _offset| {
if let TargetOpcode::LOAD_CONST = instr.opcode {
constant_index = Some(instr.arg.unwrap() as usize);
// stop walking instructions
WalkerState::Break
} else {
WalkerState::Continue
}
},
)
.expect("failed to walk function instructions");
While this is a very naive example, the walker provides a very powerful state machine that can be abused to single-step the VM by returning various WalkerState
results and modifying your own state machine as you encounter instructions in the VM.