This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 144
Debugging on AzureML
Anton Schwaighofer edited this page Sep 29, 2020
·
4 revisions
When creating the AzureML cluster, you need to tick the "Enable ssh" section. Pick your authentication method.
import rpdb
rpdb_port = 4444
rpdb.handle_trap(port=rpdb_port)
logging.info(f"rpdb is handling traps. To debug: identify the main runner.py process, then as root: "
f"kill -TRAP <process_id>; nc 127.0.0.1 {rpdb_port}")
This is already done by the InnerEye toolbox, just adding here for completeness.
- From the "Details" tab in the run's page, note the Run ID, then click on the target name under "Compute target".
- Click on the "Nodes" tab, and identify the node whose "Current run ID" is that of your run.
- Copy the contents of the "Connection string" column for that node to the clipboard (
ssh user@...
) and execute it in a shell. You need to havessh
installed obviously. - Type "bash" for a nicer command shell (optional).
- Run
sudo docker ps
to see if Docker is running. You should see an output that lists 1 Docker container ID. - Identify the main python process with a command such as
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'
You may need to vary this if it does not yield exactly one line of output.
- Note the process identifier (the value in the PID column, generally the second one).
- Issue the commands
kill -TRAP nnnn
nc 127.0.0.1 4444
where nnnn
is the process identifier. If the python process is in a state where it can
accept the connection, the "nc" command will print a prompt from which you can issue pdb
commands.
Notes:
- The last step (
kill
andnc
) can be successfully issued at most once for a given process. Thus if you might want a colleague to carry out the debugging, think carefully before issuing these commands yourself.
Quick summary:
-
w
forwhere
, full stack trace -
u
andd
forup
anddown
, go one frame up/down -
s
forstep
, execute one step help
When exiting that via Ctrl-C, the process will be stuck at PDB prompt, and we can't re-connect, so have to kill the job.
- Run
sudo docker ps
to see the container ID. - Run
sudo docker exec -it <containerID> /bin/bash
to start bash inside the container - Install additional tools:
apt-get update
apt-get install htop gdb vim netcat
- Run
htop
to see a multi-CPU utilization chart and info - Run the
kill
/nc
as described above
- Go inside the Docker container as described above.
- Install
gdb
- Install
pip install cython
- Execute
which python
. This will print something like/azureml-envs/azureml_1234abc/bin/python
- Using
vim
or a reasonable editor, edit~/.gdbinit
and add this line (replacingazureml_1234abc
with the folder where your Python resides)
source /azureml-envs/azureml_1234abc/lib/python-3.7/site-packages/Cython/Debugger/libpython.py
- Start
gdb
viagdb python nnn
, wherennn
is the process ID of the Python job (check viatop
)
-
py-bt
to get a trace of where the process presently is -
info th
to see which threads are running -
thread 2
to switch to thread 2, then you can runpy-bt
to see where thread is - Traces are printed out with innermost stackframe at top
- Watch out for "Waiting for the GIL" at the top of the stacktrace - this would indicate thread contention,
-
py-up
to move up the stack -
c
to continue running, Ctrl-C to interrupt
- Run
top
to see if there's a Python process running - Run
nvidia-smi
to check GPU status