Google AI4Code competition. 31st/34th solution
Point-wise ranking approach. Model is based on CodeT5-base Encoder with 1024 sequence length.
Writeup and discussion on Kaggle.
Hardware
Training: 4 cores, 16 GB RAM, TPUv3-8
Inference: 2 cores, 12 GB RAM, P100
Software
Ubuntu 18.04
Python: 3.9.7
CUDA: 11.2 (for GPU inference)
cuDNN: 8.1.1 (for GPU inference)
In fact I used TF 2.8.0 for training, but newer versions should also be OK
git clone https://github.com/vecxoz/ai4code
cd ai4code
pip3 install -r requirements.txt
Inference time is about 7 hours for Kaggle’s hidden test dataset.
kaggle competitions download -c AI4Code
kaggle datasets download vecxoz/model-codet5base
kaggle datasets download vecxoz/ai4code-weights
unzip -q AI4Code.zip -d AI4Code
unzip -q model-codet5base.zip -d model-codet5base
unzip -q ai4code-weights.zip -d ai4code-weights
python3 infer.py --data_dir=AI4Code --weight_dir=ai4code-weights --model_dir_or_name=model-codet5base
If you use newly trained models for inference, adjust ensemble coefficients according to their performance.
On Kaggle choose P100 GPU notebook, attach 2 datasets model-codet5base and ai4code-weights, and set paths accordingly.
It takes about 3 hours to create the data on a GCP VM.
For some reason it may take much longer on Kaggle's latest notebook environment.
python3 create_data.py --data_dir=AI4Code --out_dir=ai4code-tfrec
There is a prebuilt dataset on Kaggle. You can attach it to your notebook or download:
kaggle datasets download vecxoz/ai4code-tfrec
unzip -q ai4code-tfrec.zip -d ai4code-tfrec
I trained two first folds (0 and 1) for 20 and 7 full epochs respectively.
Both were interrupted before full convergence.
Training time is about 3.5 hours per epoch.
python3 train.py --data_tfrec_dir=ai4code-tfrec --initial_fold=0 --final_fold=2
On Kaggle choose TPU notebook, attach dataset ai4code-tfrec, and set path accordingly.
Due to Kaggle time limits one needs to train each fold in several separate sessions.
Many thanks to the TRC program for TPU resources.