Hi there, welcome to this page!
The page contains the code and data used in the paper Vulnerability Discovery with Function Representation Learning from Unlabeled Projects by Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan and Yang Xiang.
- Tensorflow
- Keras
- Python >= 2.7
- CodeSensor
The dependencies can be installed using Anaconda. For example:
$ bash Anaconda3-5.0.1-Linux-x86_64.sh
The Vulnerabilities_info.xlsx file contains information of the collected function-level vulnerabilities. These vulnerabilities are from 3 open source projects: FFmpeg, LibTIFF and LibPNG. And vulnerability information was collected from National Vulnerability Database(NVD) until the mid of July 2017.
The "Data" folder contains the source code of vulnerable functions and non-vulnerable functions within the Zip file of the 3 projects. After unzipping the files, one will find that the source code of each vulnerable function was named with its CVE ID. For the non-vulnerable functions, they were named with "{filename}_{functionname}.c" format.
The "Code" folder contains the Python code samples.
- ProcessCFilesWithCodeSensor.py file is for invoking the CodeSensor to parse functions to ASTs in serialized format (for detail information and usage of CodeSensor, please visit the author's blog: http://codeexploration.blogspot.com.au/ for more details).
- ProcessRawASTs_DFT.py file is to process the output of ProcessCFilesWithCodeSensor.py and convert the serialized ASTs to textual vectors.
- BlurProjectSpecific.py file is to blur the project specific content and convert the textual vectors (the output of ProcessRawASTs_DFT.py) to numeric vectors which can be used as the input of ML algorithms.
- LSTM.py file contains the Python code sample for implementing LSTM network based on Keras with Tensorflow backend.
We used Understand which is a commercial code enhancement tool for extracting function-level code metrics. In CodeMetrics.xlsx file, we include 23 code metrics extracted from the vulnerable functions of 3 projects.
-
In our paper, we randomly selected one code metric which was the "essential complexity" as the proxy (used as the substitute of the actual label). It will be interesting to examine whether the performance can be further improved when combining multiple code metrics, since multiple code metrics can provide more information and are more indicative of potential vulnerability (i.e. overly complex code are difficult to understand, therefore harder to debug and maintain).
-
The proposed LSTM network structure is fairly simple. We believe that the performance can be improved by implementing more complex network structure. For instance, adding pooling layers and/or dropout. One can even try the attention mechanism with LSTM.
You are welcomed to improve our code as well as our method. Please kindly cite our paper if you use the code/data in your work. For acquiring more data or inquiries, please contact: junzhang@swin.edu.au.
Thanks!