The ptgnn
library offers model implementations for four sample tasks. This file
describes these tasks and how to run these models. We welcome external contributions for
other tasks.
The protein-protein interaction (PPI) task is a graph-labeling task where all nodes of the graph need to be labelled. To train and test a model, run:
python -m ptgnn.implementations.ppi.train DATA_PATH MODEL_FILENAME
where the DATA_PATH
contains the data extracted from the original work
of Zitnik and Leskovec, 2017 and
MODEL_FILENAME
is the filename (of form filename.pkl.gz
) where the trained model will be stored.
The variable misuse task (Allamanis et al., 2018) is the problem of detecting variable misuse bugs in source code. The task is formulated as a classification problem for picking the correct node among a few candidates nodes for a given location in a program (a sort of fill in the blank task). Each candidate node represents a single variable that could be placed at a given location in the program. The decision needs to be made by considering the context (a graph representation of a program) for a given location.
To train and test a model, run
python -m ptgnn.implementations.varmisuse.train TRAIN_DATA_PATH VALID_DATA_PATH TEST_DATA_PATH MODEL_FILENAME
where the data paths point to the train/validation/test folders and MODEL_FILENAME
is the
target filename of the trained model.
The data used in Allamanis et al., 2018 can download from here.
The input data format is documented in the VarMisuseSample
raw data type here.
The goal of Graph2Sequence model is to predict a sequence given an input graph structure.
To achieve this, a GNN processes a graph and a GRU predicts the output sequence
step-by-step. The GRU includes an attention mechanism and a copying mechanism similar
to standard sequence-to-sequence models.
The ptgnn
implementation is a variation of the GNN->GRU model of
Fernandes et. al., 2019.
python -m ptgnn.implementations.graph2seq.trainandtest TRAIN_DATA_PATH VALID_DATA_PATH TEST_DATA_PATH MODEL_FILENAME
where the data paths point to the train/validation/test .jsonl.gz
files
and MODEL_FILENAME
is the target filename of the trained model.
The input data used in Fernandes et. al., 2019 can be generated using these scripts.
The input data format is documented in the CodeGraph2Seq
raw data type here.
The goal of graph2class is to classify a subset of graph nodes. Each to-be-classified
node represents a symbol (variable, parameter, function) of a Python program and the goal is
to classify each symbol to its type (e.g. int
, str
).
To train and evaluate a model, run
python -m ptgnn.implementations.typilus.train TRAIN_DATA_PATH VALID_DATA_PATH TEST_DATA_PATH MODEL_FILENAME
The data used in Typilus can be generated following these steps.
The data generation process will create folders with .jsonl.gz
files containing the graphs.
The input data format is documented in the TypilusGraph
raw data type here.