Openvino model executor #1

ilya-lavrenov · 2024-03-12T13:20:59Z

Branch with PagedAttention PagedAttention operation openvinotoolkit/openvino_contrib#867
Branch with PyTorch FE ModuleExtension https://github.com/slyalin/openvino/tree/pytorch_module_extension

Install:

# Note: install openvino with pytorch_module_extension first!
cd vllm
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
VLLM_OPENVINO=1 pip install -e .

Run sample via vLLM modeling:

python3 examples/offline_inference.py

Run via optimum modeling:

VLLM_OPENVINO_OPTIMUM=1 python3 examples/offline_inference.py

…oject/vllm into add-executor-abstraction

Use PagedAttentionExtension from OV without contrib dependency

…nverted on-the-fly (not from IR).

Disable NPU merged to OV master recently

Add bitsandbytes to requirements and use fixed vllm version in the client

…disable_int8

Update optimum-intel

Disable weight compression on optimum-intel conversion path

Produce artifacts for bare metal installation in Dockerfile.openvino

…ype=u8' to enable u8 kvcache

[CPU] PagedAttention support u8 kvcache

…penvino"

Revert "Produce artifacts for bare metal installation in Dockerfile.openvino"

Transformers 4.39

Enabled int8 weights by default

[CPU] Add comment for u8 kvcache layout

zhuohan123 and others added 28 commits March 5, 2024 06:27

Add distributed model executor abstraction

d8c0998

fix

16de289

fix

15a1fe7

Merge branch 'main' into add-executor-abstraction

ac2e888

format

675190d

health check

2592130

pull out common functionalities and fix tests

e381ca3

Merge branch 'main' into add-executor-abstraction

3bdda0b

fix lora test

c348371

Fix style

002c67f

fix review comments

198e794

rename

f82841b

Merge branch 'main' into add-executor-abstraction

390dbaf

Add base class

ebcd813

refactor async executors

fe2ef93

fix async style

89e0cac

Merge branch 'add-executor-abstraction' of https://github.com/vllm-pr…

22ee8ca

…oject/vllm into add-executor-abstraction

lazy import

1c77da8

Merge branch 'main' into add-executor-abstraction

4b4206d

Generic changes ported from 1028

f4f0162

Added OpenVINO backend

1ee42ca

Added OpenVINO worker

261f23d

Reverted some changes

9d0a984

Added OpenVINO cache engine

42833a9

Merge remote-tracking branch 'vllm/main' into openvino-model-executor

30345f0

Restore cuda memory profiling

d946943

Compare with HuggingFace

894bae9

Moved patching to get_model, removed custom InputMetadata

228e3c0

ilya-lavrenov force-pushed the openvino-model-executor branch from 735fcf4 to 228cbf7 Compare March 12, 2024 20:49

Added GPU profiling

797f1b1

ilya-lavrenov and others added 28 commits April 4, 2024 13:16

Merge pull request #23 from slyalin/paged_attention_in_openvino

dbed638

Use PagedAttentionExtension from OV without contrib dependency

Disable weight compression on optimum-intel path if model is being co…

0acb46c

…nverted on-the-fly (not from IR).

Disable NPU merged to OV master recently

90b8aca

Merge pull request #26 from ilya-lavrenov/disable-npu

818e384

Disable NPU merged to OV master recently

Merge branch 'openvino-model-executor' into produce_artifacts

22d0f2b

revert contrib building

26f5b28

add bitsandbytes to requirements, use fixed vllm version in the client

601115c

relax openvino requirement to support latest master

2547b12

Merge pull request #27 from mzegla/missing_req

948137a

Add bitsandbytes to requirements and use fixed vllm version in the client

Enable int8 weight compression via env var

0bb4a52

Merge remote-tracking branch 'lavrenov/openvino-model-executor' into …

786c6e5

…disable_int8

Update optimum-intel

ea934ee

Merge pull request #29 from ilya-lavrenov/update-optimum

f73cfd2

Update optimum-intel

Describe weights comression option in the documentation

02a108a

Merge pull request #25 from slyalin/disable_int8

f35263f

Disable weight compression on optimum-intel conversion path

Merge pull request #20 from mzegla/produce_artifacts

3570043

Produce artifacts for bare metal installation in Dockerfile.openvino

int8 kvcache support

f2f839a

use 'VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8' instead of 'kv_cache_dt…

378de66

…ype=u8' to enable u8 kvcache

add description for 'VLLM_OPENVINO_CPU_KV_CACHE_PRECISION'

14ea134

Merge pull request #28 from luo-cheng2021/luocheng/pa-kv-u8

307a6d1

[CPU] PagedAttention support u8 kvcache

Merge branch 'openvino-model-executor' into transformers_4_39

fc6302a

Revert "Produce artifacts for bare metal installation in Dockerfile.o…

30ea687

…penvino"

Merge pull request #30 from ilya-lavrenov/revert-20-produce_artifacts

388450f

Revert "Produce artifacts for bare metal installation in Dockerfile.openvino"

Merge pull request #12 from slyalin/transformers_4_39

d848897

Transformers 4.39

Enabled int8 weights by default for performnace benchmarking purposes

4931727

Merge pull request #31 from slyalin/int8_enabled_by_default

469a4d0

Enabled int8 weights by default

comment for u8 kvcache layout

560c2ce

Merge pull request #32 from luo-cheng2021/luocheng/pa-kv-u8-desc

2e5648a

[CPU] Add comment for u8 kvcache layout

ilya-lavrenov closed this May 21, 2024

ilya-lavrenov deleted the openvino-model-executor branch October 9, 2024 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openvino model executor #1

Openvino model executor #1

ilya-lavrenov commented Mar 12, 2024 •

edited

Loading

Openvino model executor #1

Openvino model executor #1

Conversation

ilya-lavrenov commented Mar 12, 2024 • edited Loading

ilya-lavrenov commented Mar 12, 2024 •

edited

Loading