Skip to content

vecxoz/ai4code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google AI4Code competition. 31st/34th solution

Point-wise ranking approach. Model is based on CodeT5-base Encoder with 1024 sequence length.
Writeup and discussion on Kaggle.

Requirements

Hardware

Training: 4 cores, 16 GB RAM, TPUv3-8
Inference: 2 cores, 12 GB RAM, P100

Software

Ubuntu 18.04
Python: 3.9.7
CUDA: 11.2 (for GPU inference)
cuDNN: 8.1.1 (for GPU inference)

Install

In fact I used TF 2.8.0 for training, but newer versions should also be OK

git clone https://github.com/vecxoz/ai4code
cd ai4code
pip3 install -r requirements.txt

Inference using trained weights

Inference time is about 7 hours for Kaggle’s hidden test dataset.

kaggle competitions download -c AI4Code
kaggle datasets download vecxoz/model-codet5base
kaggle datasets download vecxoz/ai4code-weights

unzip -q AI4Code.zip -d AI4Code
unzip -q model-codet5base.zip -d model-codet5base
unzip -q ai4code-weights.zip -d ai4code-weights

python3 infer.py --data_dir=AI4Code --weight_dir=ai4code-weights --model_dir_or_name=model-codet5base

If you use newly trained models for inference, adjust ensemble coefficients according to their performance.
On Kaggle choose P100 GPU notebook, attach 2 datasets model-codet5base and ai4code-weights, and set paths accordingly.

Create training data

It takes about 3 hours to create the data on a GCP VM.
For some reason it may take much longer on Kaggle's latest notebook environment.

python3 create_data.py --data_dir=AI4Code --out_dir=ai4code-tfrec

There is a prebuilt dataset on Kaggle. You can attach it to your notebook or download:

kaggle datasets download vecxoz/ai4code-tfrec
unzip -q ai4code-tfrec.zip -d ai4code-tfrec

Train

I trained two first folds (0 and 1) for 20 and 7 full epochs respectively.
Both were interrupted before full convergence.
Training time is about 3.5 hours per epoch.

python3 train.py --data_tfrec_dir=ai4code-tfrec --initial_fold=0 --final_fold=2

On Kaggle choose TPU notebook, attach dataset ai4code-tfrec, and set path accordingly.
Due to Kaggle time limits one needs to train each fold in several separate sessions.

Acknowledgement

Many thanks to the TRC program for TPU resources.

About

Google AI4Code 31st/34th solution

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages