This repository contains an pytorch implementation of a network combination of code2vec: Learning Distributed Representations of Code and code2seq: Generating Sequences from Structured Representations of Code.
The implementation is based on the pytorch code2vec implementation (https://github.com/bentrevett/code2vec) by bentrevett.
It uses an added LSTM path encoding from code2seq and softmax label classification.
- Python 3+
- PyTorch
- A CUDA compatible GPU
- CometML
./download_preprocess.sh
to get the datasets from the code2seq paper../preprocess.sh
to create necessary dictionary and format data. Also saves the files with suffix '.c2c'.python run.py
We have a training, testing and validation file, where:
- Each row is an example.
- Each example is a space-delimited list of fields, where:
- The first field is the target label, internally delimited by the "|" character
- Each of the following fields are contexts, where each context has three components separated by commas (","). None of these components can include spaces nor commas.
We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.
Each token is a token in the code.
Each path is a path between two tokens, split to path nodes (or other kinds of building blocks) using the "|" character.
One example would look like:
<label-1>|...|<label-n> <context-1> ... <context-m>
Where each context is:
<left-token>,<path-node-1>|...|<path-node>,<right-token>
Here left-token
and right-token
are tokens, and <subtoken-1>|...|<subtoken-p>
is the syntactic path that connects them.
One row/example in a file could look like:
target1|target2 token1,path|that|leads|to,token2 token3,another|path,token2
The examples are split up into 3 files:
<data_dir>/<data>/<data>.train.c2c
<data_dir>/<data>/<data>.test.c2c
<data_dir>/<data>/<data>.val.c2c
A dictionary (<data_dir>/<data>/<data>.dict.c2c
)is also required. This will be created by running ./preprocess.sh