GitHub - n-waves/torch-tracer

Requirements:

conda install cython pandas numba && pip install fire pyinstrument && pip install -e torch-tracer/recorder

Test:

CUDA_LAUNCH_BLOCKING=1 /usr/local/cuda-9.1/bin/nvprof --profile-from-start on -f -o cuda.prof -- python torch-tracer/torchtracer.py torch-tracer/test.py

This will create cpu.db (hardcoded) and cuda.prof.

To see the results:

python torch-tracer/merge.py --cpu-file cpu.db --cuda-file cuda.prof --output out.json

   ├─ 4.703 backward  torch/tensor.py:74
   │     [10 frames hidden]  torch
   │        4.702 backward  torch/autograd/__init__.py:38
   │        ├─ 2.604 AddmmBackward (addmm:1)  ../<cuda>:0
   │        │  └─ 2.499 mm:0  ../<cuda>:0
   │        ├─ 1.019 AddmmBackward (addmm:0)  ../<cuda>:0
   │        │  └─ 0.978 mm:1  ../<cuda>:0
   │        ├─ 0.532 add:0  ../<cuda>:0
   │        ├─ 0.295 sum:1  ../<cuda>:0
   ├─ 2.639 __call__  torch/nn/modules/module.py:483
   │     [4 frames hidden]  torch
   │        2.639 forward  test.py:18
   │        ├─ 1.891 second  test.py:14
   │        │  └─ 1.862 __call__  torch/nn/modules/module.py:483
   │        │        [14 frames hidden]  torch
   │        │           1.797 linear  torch/nn/functional.py:1336
   │        │           └─ 1.728 addmm:1  ../<cuda>:0
   │        └─ 0.747 first  test.py:10
   │           └─ 0.735 __call__  torch/nn/modules/module.py:483
   │                 [14 frames hidden]  torch
   │                    0.707 linear  torch/nn/functional.py:1336
   │                    └─ 0.678 addmm:0  ../<cuda>:0

The aggregating part of merger.py can take a lot of time to finish, usually much longer than the original script that was profiled. You can use a c++ implementation of aggregation to get the same results. F.e., on ubuntu 18.04:

sudo apt-get install libsqlite3-dev
mkdir torch-tracer/bin
g++ torch-tracer/aggregate.cpp -o torch-tracer/bin/aggregate --std=c++17 -l sqlite3 -O2

and then:

torch-tracer/bin/aggregate cpu.db cuda.prof out.json
python torch-tracer/merge.py --json-file out.json

CUDA operations in forward and backward passes can be matched by the sequence number. F.e., addmm: 0 in backward pass is a result of linear called in first().

Profiler overhead:

cuda launch blocking
record buffer resizing
recording

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
nlohmann		nlohmann
recorder		recorder
LICENSE		LICENSE
README.md		README.md
aggregate.cpp		aggregate.cpp
merge.py		merge.py
test.py		test.py
torchtracer.py		torchtracer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

n-waves/torch-tracer

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages