You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
Nvidia driver version: 565.57.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
How would you like to use vllm
Im toying around with setting up a small self hosted LLM for a small pool of users, i have an Ubuntu vm with 2x Nvidia a40 gpu's.
Im using open-webui in docker as a front end and vllm docker container to handle the LLM, its all working fine but no matter what i try and what model i use when ever the context window fills up the LLM grinds to a halt and then crashes with an error for the context window being full (for that chat window). Am i doing something stupid or have i missed something as i would have thought that there would have been some kind of sliding context window or some other way of managing it without having to start a new chat windows every time it fills up. i just want each user to have a context window of around 4000 at any one time.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
How would you like to use vllm
Im toying around with setting up a small self hosted LLM for a small pool of users, i have an Ubuntu vm with 2x Nvidia a40 gpu's.
Im using open-webui in docker as a front end and vllm docker container to handle the LLM, its all working fine but no matter what i try and what model i use when ever the context window fills up the LLM grinds to a halt and then crashes with an error for the context window being full (for that chat window). Am i doing something stupid or have i missed something as i would have thought that there would have been some kind of sliding context window or some other way of managing it without having to start a new chat windows every time it fills up. i just want each user to have a context window of around 4000 at any one time.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: