Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Debugging on AzureML

Anton Schwaighofer edited this page Sep 29, 2020 · 4 revisions

Necessary setup

Create the AzureML cluster with SSH enabled

When creating the AzureML cluster, you need to tick the "Enable ssh" section. Pick your authentication method.

Instrumenting your Python code

import rpdb
rpdb_port = 4444
rpdb.handle_trap(port=rpdb_port)
logging.info(f"rpdb is handling traps. To debug: identify the main runner.py process, then as root: "
             f"kill -TRAP <process_id>; nc 127.0.0.1 {rpdb_port}")

This is already done by the InnerEye toolbox, just adding here for completeness.

Identifying the AzureML node that runs your job  

  • From the "Details" tab in the run's page, note the Run ID, then click on the target name under "Compute target".
  • Click on the "Nodes" tab, and identify the node whose "Current run ID" is that of your run.
  • Copy the contents of the "Connection string" column for that node to the clipboard (ssh user@...) and execute it in a shell. You need to have ssh installed obviously.
  • Type "bash" for a nicer command shell (optional).

Debugging with pdb

  • Run sudo docker ps to see if Docker is running. You should see an output that lists 1 Docker container ID.
  • Identify the main python process with a command such as
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'

You may need to vary this if it does not yield exactly one line of output.

  • Note the process identifier (the value in the PID column, generally the second one).
  • Issue the commands
kill -TRAP nnnn
nc 127.0.0.1 4444

where nnnn is the process identifier. If the python process is in a state where it can accept the connection, the "nc" command will print a prompt from which you can issue pdb commands.

Notes:

  • The last step (kill and nc) can be successfully issued at most once for a given process. Thus if you might want a colleague to carry out the debugging, think carefully before issuing these commands yourself.

Inside pdb

PDB doc

Quick summary:

  • w for where, full stack trace
  • u and d for up and down, go one frame up/down
  • s for step, execute one step
  • help

When exiting that via Ctrl-C, the process will be stuck at PDB prompt, and we can't re-connect, so have to kill the job.

Debug in interactive session inside the Docker container

  • Run sudo docker ps to see the container ID.
  • Run sudo docker exec -it <containerID> /bin/bash to start bash inside the container
  • Install additional tools:
apt-get update
apt-get install htop gdb vim netcat
  • Run htop to see a multi-CPU utilization chart and info
  • Run the kill/nc as described above

Installing gdb

  • Go inside the Docker container as described above.
  • Install gdb
  • Install pip install cython
  • Execute which python. This will print something like /azureml-envs/azureml_1234abc/bin/python
  • Using vim or a reasonable editor, edit ~/.gdbinit and add this line (replacing azureml_1234abc with the folder where your Python resides)
source /azureml-envs/azureml_1234abc/lib/python-3.7/site-packages/Cython/Debugger/libpython.py
  • Start gdb via gdb python nnn, where nnn is the process ID of the Python job (check via top)

Inside of gdb

  • py-bt to get a trace of where the process presently is
  • info th to see which threads are running
  • thread 2 to switch to thread 2, then you can run py-bt to see where thread is
  • Traces are printed out with innermost stackframe at top
  • Watch out for "Waiting for the GIL" at the top of the stacktrace - this would indicate thread contention,
  • py-up to move up the stack
  • c to continue running, Ctrl-C to interrupt

General diagnostics

  • Run top to see if there's a Python process running
  • Run nvidia-smi to check GPU status