-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError #8194
Comments
Does this also happen without multi-step scheduling? |
Try removing Flash attn is not supported on Volta and Turing GPUs, therefore this assertion will fail. |
Hi, thank you all for the answers. Here are few points:
|
Right. What I really mean is that only Ampere or newer GPUs are supported. For early GPUs, vllm will use |
Ok but Ada Lovelace is Ampere's successor generation. It's a “consumer” architecture, even if the L40S are server GPUs. If someone assures me that this architecture in particular is not supported, fine, but for the moment I have the impression that my hardware perfectly meets all the conditions for supporting flash attention (and indeed it does) and flashinfer. |
My bad. I don't know Ada Lovelace is newer. Currently, this new feature only supports If you are using
I don't know if this method would work around since my GPU is Volta architecture. But you can have it a try. |
Yes I passed the environment variable VLLM_ATTENTION_BACKEND=FLASHINFER as you can see in my command and I can affirm I have the log |
Try removing this environment variable, then vllm will use |
I need flashinfer backend with |
Any news about this error? |
Maybe you should ask the contributor of this feature. |
flashinfer+multi-step will be supported by this PR #7928 |
The PR is merged now. |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
Hi,
I am using vLLM v0.6.0 from the commit 8685ba1 and I built the docker image using this command :
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai:v0.6.0-flashinfer --build-arg max_jobs=32 --build-arg nvcc_threads=8 --build-arg torch_cuda_arch_list=""
I built the image on my own because of this but not matter.
Here is the command I try to use :
Here is the bug I get when I send a request :
I am opening this issue because of
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: