Skip to content

[NLBSE 2024] Dopamin: Transformer-based Comment Classifiers through Domain Post-training and Multi-level layer aggregation (Best Tool Award)

License

Notifications You must be signed in to change notification settings

FSoft-AI4Code/Dopamin

Repository files navigation

Dopamin: Transformer-based Comment Classifiers through Domain Post-training and Multi-level layer aggregation

This repository includes our implementation for training, testing, and utilizing Dopamin, which is our submission for NLBSE'24 Tool Competition: Code Comment Classification.

Quickstart Guide

Set up

Clone Dopamin repo:

git clone https://github.com/FSoft-AI4Code/Dopamin.git
cd Dopamin

Python >= 3.8

Install requirements: pip install -r requirements.txt

Note: We employ 2 NVIDIA A100 GPUs for training the model, configuring a batch size of 32 per GPU, thus the total batchsize is 64. However, replication may not be feasible when utilizing a single GPU with a batch size of 64.

Data preparation

Create data for the post-training stage:

python process_data.py --save_dir ./code-comment-classification/processed_data/all --post_training

Create training and evaluation set:

python process_data.py --save_dir ./code-comment-classification/processed_data/valid --validation

Original_data:

python process_data.py --save_dir ./code-comment-classification/processed_data/novalid

Training

All training and evaluation scripts can be found in training Dopamin

Post-training stage

python training/autorun.py --output_dir ./models/Dopamin_post_training --post_training

You can reuse the post-trained model at dopamin-post-training. Skip this stage to reuse the post-trained model.

Training Dopamin for each category

  1. Training model with validation set to obtain the best checkpoint step
python training/autorun.py --output_dir ./models/Dopamin_valid --validation
  1. Training model with original training data with the found optimal step
python training/autorun.py --output_dir ./models/Dopamin --optimal_step_dir ./models/Dopamin_valid

Evaluation

To run the evaluation of Dopamin, please refer to the evaluation notebook or if you want to use the script:

python training/predict.py --model_name codebert-hsum \
                           --model_path ./models/Dopamin \

All model checkpoints are publicity available at Huggingface Hub - Dopamin for replication purposes.

Citation

@software{
  Dopamin_2024,
  author = {Hai, Nam Le and Bui, Nghi DQ},
  year = {2024},
  title = {Dopamin: Transformer-based Comment Classifiers through Domain Post-training and Multi-level layer aggregation},
  url = {https://github.com/FSoft-AI4Code/Dopamin},
  huggingface= {https://huggingface.co/collections/Fsoft-AIC/dopamin-6575bdeb7068a850897e4404}
}

About

[NLBSE 2024] Dopamin: Transformer-based Comment Classifiers through Domain Post-training and Multi-level layer aggregation (Best Tool Award)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published