Skip to content
This repository has been archived by the owner on Feb 24, 2022. It is now read-only.
/ CoDesc Public archive

A large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

License

Notifications You must be signed in to change notification settings

code-desc/CoDesc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoDesc

This is the anonymous repository for blind review and this is no longer active. Please visit: https://github.com/csebuetnlp/CoDesc for updated code and dataset.

A large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

This is the public release of code, and data of our paper titled "CoDesc: Large Code-Description Parallel Dataset", submitted to ACL, 2021.

Table of Contents

Quickstart

# clone this repository
git clone https://github.com/code-desc/CoDesc.git

# change permission of scripts
sudo chmod -R +x CoDesc
cd CoDesc/

# setup
./Setup/setup.sh

Introduction

CoDesc is a noise removed, large parallel dataset of source codes and corresponding natural language descriptions. This dataset is procured from several similar, but noisy datasets including CodeSearchNet, FunCom, DeepCom, and CONCODE. We have developed and released the noise removal and preprocessing source codes along with the dataset. We also demonstrate the usefulness of CoDesc dataset in two popular tasks: natural language code search and source code summarization.

CoDesc Dataset

After initial setup described at Quickstart, our dataset will be downloaded at data/ folder along with preprocessed data for code search task and code summarization task. We also provide the source datasets here. Following are the links and descriptions of the dataset and preprocessed data.

  1. CoDesc: This file contains our 4.2m dataset. The details of this dataset is given in our paper as well as in Dataset Description page.

  2. Original_data: This file contains the source data from where we have collected and preprocessed our 4.2m dataset.

  3. CSN_preprocessed_data: This file contains the preprocessed data for CodeSearchNet challenge. Here test and validation sets are the preprocessed datapoints from CodeSearchNet original test and validation sets.

  4. CSN_preprocessed_data_balanced_partition: This file contains the preprocessed data for CodeSearchNet networks. Here train, test, and validation sets are from our balanced partition described in our paper

  5. NCS_preprocessed_data: This file contains the preprocessed data for neural code summarization networks.

  6. BPE_Tokenized_NCS_preprocessed_data: This file contains the preprocessed data for neural code summarization networks with BPE tokenization.

Python to Java Translation

We have created a forked repository of Transcoder that facillicates parallel translation of source codes and speeds up the process by 16 times. Instructions to use Transcoder can be found in the above mentioned repository. The original work is published under the title "Unsupervised Translation of Programming Languages".

CoDesc Dataset Creation

As we have already mentioned, we have provided the original data from sources to the data/original_data/ folder. To create the 4.2m CoDesc dataset from original data, the following command should be used.

python Dataset_Preparation/Merge_Datasets.py

Preprocess CoDesc for Code Search

The following command preprocesses CoDesc dataset for CodeSearchNet Challenge. It also preprocesses their validation and test sets using the filters defined in our paper.

python Dataset_Preparation/Preprocess_CSN.py

To create a balanced train-valid-test split for CodeSearchNet networks, the command can be used.

python Dataset_Preparation/Preprocess_CSN_Balanced_Partition.py

Preprocess CoDesc for Code Summarization

The following command preprocesses CoDesc dataset for NeuralCodeSum networks.

python Dataset_Preparation/Preprocess_NCS.py

To train and create tokenized files using bpe, use the following command.

python Tokenizer/huggingface_bpe.py

Tokenizer

The tokenizers for source codes and natural language descriptions are given in the Tokenizer/ directory. To use the tokenizers in python, code_filter and nl_filter functions will have to be imported from Tokenizer/CodePreprocess_final.py and Tokenizer/NLPreprocess_final.py. Moreover, two json files named code_filter_flag.json and nl_filter_flag.json containing the options to preprocess code and description data will have to be present in the working directory. These two files must follow the formats given the Tokenizer/ folder. These flag options are also briefly described in the above mentioned json files.

The code for bpe tokenization is given at Tokenizer/huggingface_bpe.py.

Code Search

During the initial setup described at Quickstart, a forked version of CodeSearchNet is cloned into the working directory, and the preprocessed data of CoDesc will be copied to CodeSearchNet/resources/data/ directory. To use the preprocessed dataset of balanced partition, clear the above mentioned folder, and copy the content inside of data/csn_preprocessed_data_balanced_partition/ into it.

Then the following commands will train and test code search networks:

cd CodeSearchNet/

script/console
wandb login

python train.py --model neuralbowmodel --run-name nbow_CoDesc
python train.py --model rnnmodel --run-name rnn_CoDesc
python train.py --model selfattentionmodel --run-name attn_CoDesc
python train.py --model convolutionalmodel --run-name conv_CoDesc
python train.py --model convselfattentionmodel --run-name convattn_CoDesc

Code Summarization

We used the original implementation of Code Summarization of NeuralCodeSum. Please refer to this guide for instructions on how to train the code summarization network.

Licenses

Codes, dataset and models from CodeSearchNet, and NeuralCodeSum are used with the licenses provided at their respective repositories.
These codes, dataset, and preprocessed data are released under the MIT license.

About

A large dataset of 4.2m Java source code and parallel data of their description from code search, and code summarization studies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published