Failed CI on A100 #2064

xuzhao9 · 2023-11-27T19:33:46Z

The test test_llama_v2_7b_16h_example_cuda failed between 20231115 and 20231116.

Failed workflow: https://github.com/pytorch/benchmark/actions/runs/7006721966/job/19059198530

Detailed error and command to reproduce:

$ python run.py llama_v2_7b_16h -d cuda --accuracy
fp64 golden ref were not generated for llama_v2_7b_16h. Setting accuracy check to cosine
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 510, in check_accuracy
    correct_result = run_n_iterations(
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 395, in run_n_iterations
    _model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 393, in _model_iter_fn
    return forward_pass(mod, inputs, contexts, collect_outputs)
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 370, in forward_pass
    return mod(*inputs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Running eval method from llama_v2_7b_16h on cuda in eager mode with input batch size 1 and precision fp16.
Accuracy:              eager_1st_run_fail

Bisection workflow: https://github.com/pytorch/benchmark/actions/runs/6985353191
Root cause commit: 12b2dd16b050e6495910fc564517fbb51dde1f20 (pytorch/pytorch@12b2dd1)

The text was updated successfully, but these errors were encountered:

xuzhao9 · 2024-01-15T15:17:10Z

Fixed by upstream.

Summary: This PR partially reverts #2095, since #2064 seems not to be an issue anymore. Pull Request resolved: #2124 Reviewed By: suez1224 Differential Revision: D53093766 Pulled By: xuzhao9 fbshipit-source-id: 157a01dec22e48b5ee1cb1260070a6d270aec4f8

xuzhao9 assigned aaronenyeshi Nov 27, 2023

This was referenced Dec 20, 2023

Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" pytorch/pytorch#116201

Closed

import torch results in cuInit call pytorch/pytorch#116276

Closed

xuzhao9 closed this as completed Jan 15, 2024

ysiraichi mentioned this issue Jan 22, 2024

Do not skip llama_v2_7b_16h on NVIDIA A100 40G. #2124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed CI on A100 #2064

Failed CI on A100 #2064

xuzhao9 commented Nov 27, 2023 •

edited

Loading

xuzhao9 commented Jan 15, 2024

Failed CI on A100 #2064

Failed CI on A100 #2064

Comments

xuzhao9 commented Nov 27, 2023 • edited Loading

xuzhao9 commented Jan 15, 2024

xuzhao9 commented Nov 27, 2023 •

edited

Loading