Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable weight compression on optimum-intel conversion path #25

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion use_with_openvino.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ python3 benchmark_serving.py --backend openai --endpoint /v1/completions --port
```


## Use vLLM offline
## Use vLLM offline

_All below steps assume you are in `vllm` root directory._

Expand Down Expand Up @@ -82,3 +82,11 @@ docker run --rm -it --entrypoint python3 -v $HOME/.cache/huggingface:/root/.cach
# --num-prompts <number of requests to send> (default: 1000)
# --swap-space <GiB for KV cache> (default: 50)
```

## Use Int-8 Weights Compression

Weights int-8 compression is disabled by default. For better performance and lesser memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`.
To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above.

The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit).
Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop.
2 changes: 2 additions & 0 deletions vllm/model_executor/openvino_model_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -599,10 +599,12 @@ def get_model(model_config: ModelConfig,
else:
print(f'[ INFO ] OpenVINO IR is avaialble for provided model id {model_config.model}. '
'This IR will be used for inference as-is, all possible options that may affect model conversion are ignored.')
load_in_8bit = None if os.environ.get('VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS', '0') == '1' else False
slyalin marked this conversation as resolved.
Show resolved Hide resolved
pt_model = OVModelForCausalLM.from_pretrained(
model_config.model,
export=export,
compile=False,
load_in_8bit=load_in_8bit,
trust_remote_code=model_config.trust_remote_code
)
patch_stateful_model(pt_model.model, kv_cache_dtype, device_config.device.type == "cpu")
Expand Down