We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got this error when running llama_inference.py:
$ CUDA_VISIBLE_DEVICES=0 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" Loading model ... Found 3 unique KN Linear values. Warming up autotune cache ... 0%| | 0/12 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40) TypeError: do_bench() got an unexpected keyword argument 'percentiles' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 110, in <module> model = load_quant(args.model, args.load, args.wbits, args.groupsize, fused_mlp=args.fused_mlp) File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/llama_inference.py", line 66, in load_quant quant.autotune_warmup_linear(model, transpose=not (eval)) File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 419, in autotune_warmup_linear matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq) File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248 matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0), File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp> timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} File "/home/prasanna/src/third-party/GPTQ-for-LLaMa/quant/custom_autotune.py", line 73, in _bench except triton.compiler.OutOfResources: AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'
The issue is in quant/custom_autotune.py:72. The param percentiles has been changed to quantiles in triton.testing.do_bench()
The text was updated successfully, but these errors were encountered:
Maybe you can try this huggingface/text-generation-inference@773aabd @prasanna
Sorry, something went wrong.
No branches or pull requests
Got this error when running llama_inference.py:
The issue is in quant/custom_autotune.py:72. The param percentiles has been changed to quantiles in triton.testing.do_bench()
The text was updated successfully, but these errors were encountered: