Singularity brings containers into traditional HPC use cases and centers. (FYI: It has been moved into the Linux Foundation and renamed Apptainer).
We first need download and use one of the images created by Pangeo. They are all hosted on Dockerhub
After ssh-ing into your HPC system, load Singularity:
module load singularity
Pull the desired image (for example the ml-notebook which uses TensorFlow and GPUs) under our name of choice (in our case tensorflow.sif
):
singularity pull tensorflow.sif docker://pangeo/ml-notebook
Note I: Depending on the size of the image, this could take some time and some warnings may appear. It may be a good idea to do some other work in the meantime.
Note II: If we were to choose a different image, just change what follows after docker://
for the name of the image appearing on Dockerhub.
After being patient, the file tensorflow.sif
should be available in the home folder.
To request resources and have a Jupyter Notebook running on a computing node it is necessary to have a batch script under, for example, the name batch_tflw_v100s.sh
.
To create it run vi batch_tflw_v100s.sh
and paste the below command text.
Important: Make sure all paths are the relevant to your given case.
#!/bin/sh
#
#SBATCH --account=abernathey # The account name for the job.
#SBATCH --job-name=jupyter # The job name.
#SBATCH --gres=gpu:1 # Request 1 gpu (Up to 2 gpus per GPU node)
#SBATCH --partition=ocp_gpu
#SBATCH --constraint=v100s
#SBATCH -c 32 # The number of cpu cores to use.
#SBATCH --time=0-04:00 # The time the job will take to run in D-HH:MM
#SBATCH --output=/home/$USER/jupyter.log # Important to retrieve the port where the notebook is running, if not included a slurm file with the job-id will be outputted.
module load singularity
cat /etc/hosts
singularity exec --nv --cleanenv --bind /home/$USER:/run/user tensorflow.sif jupyter notebook --notebook-dir=/home/$USER --no-browser --ip=0.0.0.0
To exit the Vi editor, make sure to be in command mode by pressing ESC
and then :wq
to write and quit.
In this case, V100s GPUs are allocated if available. The Ginsburg official guide describes how to manage resource requests.
To enter the queue, run:
sbatch batch_tflw_v100s.sh
Once the job is running, the jupyter.log
file should show which node you are using and in which port it is referenced. To obtain this, run:
cat jupyter.log
A line like [I 15:14:51.868 NotebookApp] http://g051:8888/
should appear. In this case, g051
is the node name and 8888
is the port.
In your local computer's terminal, forward the port by running (change USER@domain.edu
for your account),
ssh -N -L lochalhost:8080:g051:8888 USER@domain.edu
This forwards your port 8888
from the HPC system to your port 8080
on your machine.
Then, in a web browser you should be able to access the Jupyter Notebook by writing:
- The requested GPU should support your TensorFlow notebook:
- The python kernel should be the
Python 3 (ipykernel)
- Which points to the following path:
If you are using VSCode, it is possible to connect to the allocated node directly without forwarding the port.
Make sure Remote-SSH, Python and Jupyter extensions are installed on VSCode.
-
Connect to your HPC system with Remote-SSH in VSCode.
-
Open a Jupyter Notebook.
-
Select the kernel on the top right under the gear wheel and then
Connect to a Jupyter Server
. -
Introduce the URL (
http://g051:8888/
) and select thePython 3 (ipykernel)
-
Check for GPUs