To use the GPU machines on cluster, you need to log in to cluster. Here is the nice introduction of how to log in: https://www.alessandravalcarcel.com/blog/2019-04-23-ssh/
lpcgpu01
, which is the host for GPU machines, is accessible via takim
server. You may enter the takim
server by typing this:
ssh -X <pennkey>@takim.pmacs.upenn.edu
*** Don't read this if you are not familiar with GPU computing ***
For advanced user, looking for additional GPU power, we have six additional GPU cores, exclusively available to us. takim2
is a submit host (not a excutible host) that can be used for GPU computing. You can access it by typing this:
ssh -X <pennkey>@takim2.pmacs.upenn.edu
or
ssh -X <pennkey>@takim2
Note that this is a submit host, not a executable host. You can directly use it as a interactive session, but can't submit a normal job to takim2
.
If you intend to use an interactive session, consider using screen so that you don't lose your work. You can open a screen by
screen -S <Screen-name>
and start your work. You can find the details about how to use a screen at: https://www.alessandravalcarcel.com/blog/2019-06-12-interactivesession1/
Once you are in the executable host, you can open an interactive session by typing this:
bsub -Is -q lpcgpu -gpu "num=1" -n 1 "bash"
Make sure you request gpu. "num=1" requests the number of GPU whereas "-n 1" requests the number of CPU.
Next, load torch and tensorflow to activate CUDA
module load torch
module load tensorflow/2.3-GPU
To check whether your CUDA is running, run this:
python
In Python, run this:
import torch
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)
The response should be
torch.cuda.is_available() : True
torch.cuda.device_count() : 1
torch.cuda.current_device() : 0
torch.cuda.device(0) : <torch.cuda.device object at 0x2ae68db7cb20>
torch.cuda.get_device_name(0) : 'NVIDIA GeForce RTX 2080 Ti'
Now, you are ready to use the GPU!
Once you are in the executable host, you can submit a normal job, usually with the bash file.
bsub -q lpcgpu -gpu "num=1" -n 1 -J "orig[1-3]" -o <where to save your log file> <location of your bash file>
For example, I save my log file at /home/ecbae/nnUNet.txt
and bash file at /home/ecbae/orig.sh
. Note that my job index is [1-3], which in result requests 3 GPU cores and 3 CPU cores.
My bash file looks like this:
module load torch
module load tensorflow/2.3-GPU
nnUNet_plan_and_preprocess -t $(( $LSB_JOBINDEX +149 ))
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 0 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 1 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 2 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 3 --npz
nnUNet_train 2d nnUNetTrainerV2 $(( $LSB_JOBINDEX +149 )) 4 --npz
Note that we have 10 GPU cores in lpcgpu01
host. If you need more GPU cores, you may want to use the takim2
host, discussed in the first section.
Also, I am running a pre-installed python package nnUNet
. To install the existing package, you should submit a ticket to PMACS or send an email to Martin Das.
We need more GPU!