Requirements:
conda install cython pandas numba && pip install fire pyinstrument && pip install -e torch-tracer/recorder
Test:
CUDA_LAUNCH_BLOCKING=1 /usr/local/cuda-9.1/bin/nvprof --profile-from-start on -f -o cuda.prof -- python torch-tracer/torchtracer.py torch-tracer/test.py
This will create cpu.db
(hardcoded) and cuda.prof
.
To see the results:
python torch-tracer/merge.py --cpu-file cpu.db --cuda-file cuda.prof --output out.json
├─ 4.703 backward torch/tensor.py:74
│ [10 frames hidden] torch
│ 4.702 backward torch/autograd/__init__.py:38
│ ├─ 2.604 AddmmBackward (addmm:1) ../<cuda>:0
│ │ └─ 2.499 mm:0 ../<cuda>:0
│ ├─ 1.019 AddmmBackward (addmm:0) ../<cuda>:0
│ │ └─ 0.978 mm:1 ../<cuda>:0
│ ├─ 0.532 add:0 ../<cuda>:0
│ ├─ 0.295 sum:1 ../<cuda>:0
├─ 2.639 __call__ torch/nn/modules/module.py:483
│ [4 frames hidden] torch
│ 2.639 forward test.py:18
│ ├─ 1.891 second test.py:14
│ │ └─ 1.862 __call__ torch/nn/modules/module.py:483
│ │ [14 frames hidden] torch
│ │ 1.797 linear torch/nn/functional.py:1336
│ │ └─ 1.728 addmm:1 ../<cuda>:0
│ └─ 0.747 first test.py:10
│ └─ 0.735 __call__ torch/nn/modules/module.py:483
│ [14 frames hidden] torch
│ 0.707 linear torch/nn/functional.py:1336
│ └─ 0.678 addmm:0 ../<cuda>:0
The aggregating part of merger.py
can take a lot of time to finish, usually much longer than the original script that was profiled. You can use a c++ implementation of aggregation to get the same results. F.e., on ubuntu 18.04:
sudo apt-get install libsqlite3-dev
mkdir torch-tracer/bin
g++ torch-tracer/aggregate.cpp -o torch-tracer/bin/aggregate --std=c++17 -l sqlite3 -O2
and then:
torch-tracer/bin/aggregate cpu.db cuda.prof out.json
python torch-tracer/merge.py --json-file out.json
CUDA operations in forward and backward passes can be matched by the sequence number. F.e., addmm: 0
in backward pass is a result of linear
called in first()
.
Profiler overhead:
- cuda launch blocking
- record buffer resizing
- recording