Can't profile benchmark with ncu, nsys #183

WyldeCat · 2023-10-29T07:03:21Z

Tried to profile llama 7b benchmark but failed to obtain reports.

root@nf5688m7-release:/code/tensorrt_llm/benchmarks/python# ncu --target-processes all python benchmark.py  -m llama_7b  --mode plugin --batch_size "64" --input_output_len "128,128" --enable_fp8 --fp8_kv_cache
==PROF== Connected to process 36055 (/usr/bin/nvidia-smi)
==PROF== Disconnected from process 36055
==PROF== Target process 36054 terminated before first instrumented API call.
==PROF== Connected to process 35989 (/usr/bin/python3.10)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py:658: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  torch.nested.nested_tensor(split_ids_list,
==PROF== Target process 41832 terminated before first instrumented API call.
[BENCHMARK] model_name llama_7b world_size 1 num_heads 32 num_kv_heads 32 num_layers 32 hidden_size 4096 vocab_size 32000 precision float16 batch_size 64 input_length 128 output_length 128 gpu_peak_mem(gb) 21.35 build_time(s) 367.44 tokens_per_sec 260.23 percentile95(ms) 31915.803 percentile99(ms) 31915.803 lat
ency(ms) 31479.389 compute_cap sm90
==PROF== Target process 41835 terminated before first instrumented API call.
==PROF== Disconnected from process 35989
==WARNING== No kernels were profiled.

When using nsys, the following error occurs

root@nf5688m7-release:/code/tensorrt_llm/benchmarks/python# /opt/nvidia/nsight-systems/2023.3.1/bin/nsys profile python benchmark.py  -m llama_7b  --mode plugin  --batch_size "64" --input_output_len "128,128" --enable_fp8 --fp8_kv_cache
[10/29/2023-07:07:28] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::62] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:28] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:279] Error 719 destroying stream '0x560e6297eb80'.)
[10/29/2023-07:07:28] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception unspecified launch failure
[10/29/2023-07:07:31] [TRT] [E] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:31] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception [tunable_graph.cpp:create:114] autotuning: User allocator error allocating 54002688-byte buffer
[10/29/2023-07:07:31] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception Assertion engine failed.
[10/29/2023-07:07:31] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/layers/0/attention/dense/CONSTANT_1...LLaMAForCausalLM/layers/1/attention/qkv/MATRIX_MULTIPLY_0]}.
[10/29/2023-07:07:31] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (unspecified launch failure)
[10/29/2023-07:07:31] [TRT] [E] 10: [optimizer.cpp::computeCosts::4051] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[LLaMAForCausalLM/layers/0/attention/dense/CONSTANT_1...LLaMAForCausalLM/layers/1/attention/qkv/MATRIX_MULTIPLY_0]}.)
[10/29/2023-07:07:31] [TRT-LLM] [E] Engine building failed, please check the error log.
Traceback (most recent call last):
  File "/code/tensorrt_llm/benchmarks/python/benchmark.py", line 322, in <module>
    main(args)
  File "/code/tensorrt_llm/benchmarks/python/benchmark.py", line 219, in main
    benchmarker = GPTBenchmark(
  File "/code/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 144, in __init__
    assert engine_buffer is not None
AssertionError
Generating '/tmp/nsys-report-17ae.qdstrm'
[1/1] [========================100%] report2.nsys-rep
Generated:
    /code/tensorrt_llm/benchmarks/python/report2.nsys-rep

What do I need to do to get reports?

The text was updated successfully, but these errors were encountered:

juney-nvidia · 2023-10-29T10:55:00Z

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

docker run --cap-add=SYS_ADMIN ...

WyldeCat · 2023-10-29T13:02:35Z

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

juney-nvidia · 2023-10-29T13:46:41Z

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

Have you tried to run with smaller batch size, smaller in/out length to see whether the issue still exist? And what is hardware you are using？

June

WyldeCat · 2023-10-29T13:59:32Z

@WyldeCat you can follow the guide mentioned in the documentation link posted by you, by passing --cap-add=SYS_ADMIN when you start the docker container, something like:

@juney-nvidia Thanks, it seems to have become possible to use the ncu profiler, but nsys still doesn't work. Is there any way to use nsys profiler?

Have you tried to run with smaller batch size, smaller in/out length to see whether the issue still exist? And what is hardware you are using？

@juney-nvidia
I've tried batch size 1 and problem still exists. I'm using H100 80GB.
I found that running benchmarks with explicit engine directory makes nsys work.
(by giving --engine_dir my_engine_dir argument to command)
So I think, building engines on-air before benchmarking makes nsys error.

Because running benchmarks with on-air engine build is much more comfortable, it would be great if there's a way to use it with nsys.

jdemouth-nvidia · 2023-10-30T05:16:21Z

Thanks for the feedback @WyldeCat . It would also mean that you would "pollute" your NSYS trace with a lot of extra kernel launches that are not relevant for your application (all the auto-tuning done by TensorRT) and you will end up with a much bigger NSYS output file. I'm pretty sure it would make the analysis of the NSYS report a lot harder. For now, I'm going to close the issue as "closed" but feel free to open a "feature request" if you think it's really a needed feature.

WyldeCat changed the title ~~No kernels profiled with ncu~~ Can't profile benchmark with ncu, nsys Oct 29, 2023

juney-nvidia self-assigned this Oct 29, 2023

juney-nvidia added the triaged Issue has been triaged by maintainers label Oct 29, 2023

jdemouth-nvidia closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't profile benchmark with ncu, nsys #183

Can't profile benchmark with ncu, nsys #183

WyldeCat commented Oct 29, 2023 •

edited

Loading

juney-nvidia commented Oct 29, 2023 •

edited

Loading

WyldeCat commented Oct 29, 2023 •

edited

Loading

juney-nvidia commented Oct 29, 2023 •

edited

Loading

WyldeCat commented Oct 29, 2023

jdemouth-nvidia commented Oct 30, 2023

Can't profile benchmark with ncu, nsys #183

Can't profile benchmark with ncu, nsys #183

Comments

WyldeCat commented Oct 29, 2023 • edited Loading

juney-nvidia commented Oct 29, 2023 • edited Loading

WyldeCat commented Oct 29, 2023 • edited Loading

juney-nvidia commented Oct 29, 2023 • edited Loading

WyldeCat commented Oct 29, 2023

jdemouth-nvidia commented Oct 30, 2023

WyldeCat commented Oct 29, 2023 •

edited

Loading

juney-nvidia commented Oct 29, 2023 •

edited

Loading

WyldeCat commented Oct 29, 2023 •

edited

Loading

juney-nvidia commented Oct 29, 2023 •

edited

Loading