Skip to content

Training through Kaggle GPU

Fabian Fichter edited this page Apr 19, 2024 · 1 revision

If you don't have a GPU that can handle CUDA, or if you're having trouble setting up your Python environment, Kaggle could be a good option for you. The reason being that CUDA, PyTorch, and the necessary environment are already included by default, so you can easily follow the instructions to train NNUE.

Notebook: https://www.kaggle.com/fabianfichter/variant-nnue-demo

What is Kaggle ?

Kaggle is a crowd-sourced platform for data scientists to solve their problems. Provided by Google, they offer free GPU quota to everyone (30 hr/week), which is quite enough for NNUE training. Before you proceed, you must register a Kaggle account first.

See Kaggle Docs for further information.

Preparation

  1. follow training data generation to generate enough data.
  2. fork this repository
  3. change the code by the output of the data generator.

Training

1. upload the training data (Homepage -> Datasets -> new dataset)

2. open the notebook and click Edit on the top.

3. make sure to enable "Internet" in the settings after verifying your phone.

4. turn on the GPU

5. add training data (sidebar -> add data -> your data)

you should see it under Input

6. change the repository in the secound command into your own fork

in my case , it would be

7. change the fifth command into

!cd variant-nnue-pytorch && python train.py --gpus 1 --max_epochs 10 location_of_your_data location_of_your_data 

in my case , it's

!cd variant-nnue-pytorch && python train.py --gpus 1 --max_epochs 10 /kaggle/input/xiangqi-data/xiangqi_data.bin /kaggle/input/xiangqi-data/xiangqi_data.bin

you can copy the location by simply clicking on the copy botton next to your training data

for more information about the parameters, see https://github.com/fairy-stockfish/variant-nnue-pytorch/wiki/NNUE-training#training-example.

8. run all the cells.

Result

After it finishes, you may add this commamd to see whether it has successfully stored the checkpoint files.

!cd variant-nnue-pytorch/logs/default/version_0/checkpoints && ls

If it contains files like checkpoint_0.ckpt or last.ckpt, then the training is complete. You may want to export the files into NNUE files:

!cd variant-nnue-pytorch && python serialize.py --features HalfKAv2^ /kaggle/working/variant-nnue-pytorch/logs/default/version_0/checkpoints/last.ckpt Your_NNUE.nnue

Note that the name of the nnue file should follow the naming rule.

Save & Export

click on the Save version button on the top, then select Quick Save

Exit and go back to the homepage of the notebook. Click on Data then you'll see the output. You may download the whole file by clicking on the triple spots.

the .nnue file will locate in the variant-nnue-pytorch directory.

TroubleShooting

1. The process stop automatically without warning

Kaggle will reset every 12 hr, so you'll have to control the time and save before it reset. Note that all the thing will be cleared out except those files you save.

2. I copy the whole notebook and follow all the commands, but there's runtime error during the training process

Kaggle will update it's packages once in a while. Make sure you change the environment to the same as the notebook's.

image

also, make sure you've turned on the GPU.