This is an unofficial implementation of the asm2vec
model as a standalone python package. The details of the model can be found in the original paper: (sp'19) Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
This implementation is written in python 3.7 and it's recommended to use python 3.7+ as well. The only dependency of this package is numpy
which can be installed as follows:
python3 -m pip install numpy
To install the package, execute the following commands:
git clone https://github.com/lancern/asm2vec.git
Add the following line to the .bashrc
file to add asm2vec
to your python interpreter's search path for external packages:
export PYTHONPATH="path/to/asm2vec:$PYTHONPATH"
Replace path/to/asm2vec
with the directory you clone asm2vec
into. Then execute the following commands to update PYTHONPATH
:
source ~/.bashrc
You can also add the following code snippets to your python source code referring asm2vec
to guide python interpreter finding the package successfully:
import sys
sys.path.append('path/to/asm2vec')
In your python code, use the following import
statement to import this package:
import asm2vec.<module-name>
You have 2 approaches to define the binary program that will be sent to the asm2vec
model. The first approach is to build the CFG manually, as shown below:
from asm2vec.asm import BasicBlock
from asm2vec.asm import Function
from asm2vec.asm import parse_instruction
block1 = BasicBlock()
block1.add_instruction(parse_instruction('mov eax, ebx'))
block1.add_instruction(parse_instruction('jmp _loc'))
block2 = BasicBlock()
block2.add_instruction(parse_instruction('xor eax, eax'))
block2.add_instruction(parse_instruction('ret'))
block1.add_successor(block2)
block3 = BasicBlock()
block3.add_instruction(parse_instruction('sub eax, [ebp]'))
f1 = Function(block1, 'some_func')
f2 = Function(block3, 'another_func')
# block4 is ignore here for clarity
f3 = Function(block4, 'estimate_func')
And then you can train a model with the following code:
from asm2vec.model import Asm2Vec
model = Asm2Vec(d=200)
train_repo = model.make_function_repo([f1, f2, f3])
model.train(train_repo)
The second approach is using the parse
module provided by asm2vec
to build CFGs automatically from an assembly code source file:
from asm2vec.parse import parse_fp
with open('source.asm', 'r') as fp:
funcs = parse_fp(fp)
And then you can train a model with the following code:
from asm2vec.model import Asm2Vec
model = Asm2Vec(d=200)
train_repo = model.make_function_repo(funcs)
model.train(train_repo)
You can use the asm2vec.model.Asm2Vec.to_vec
method to convert a function into its vector representation.
The implementation support serialization on many of its internal data structures so that you can serialize the internal state of a trained model into disk for future use.
You can serialize two data structures to primitive data: the function repository and the model memento.
To be finished.
The constructor of asm2vec.model.Asm2Vec
class accepts some keyword arguments as hyper parameters of the model. The following table lists all the hyper parameters available:
Parameter Name | Type | Meaning | Default Value |
---|---|---|---|
d |
int |
The dimention of the vectors for tokens. | 200 |
initial_alpha |
float |
The initial learning rate. | 0.05 |
alpha_update_interval |
int |
How many tokens can be processed before changing the learning rate? | 10000 |
rnd_walks |
int |
How many random walks to perform to sequentialize a function? | 3 |
neg_samples |
int |
How many samples to take during negative sampling? | 25 |
iteration |
int |
How many iterations to perform? (This parameter is reserved for future use and is not implemented now) | 1 |
jobs |
int |
How many tasks to execute concurrently during training? | 4 |
For simplicity, the Selective Callee Expansion is not implemented in this early implementation. You have to do it manually before sending CFG into asm2vec
.