Skip to content

Commit

Permalink
[MFM] Merge changes from vllm-project/vllm main (#28)
Browse files Browse the repository at this point in the history
* [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [ci/build] fix wheel size check (vllm-project#12396)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382)

* [ci/build] sync default value for wheel size (vllm-project#12398)

Signed-off-by: youkaichao <youkaichao@gmail.com>

* [Misc] Enable proxy support in benchmark script (vllm-project#12356)

Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>

* [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Misc] Remove deprecated code (vllm-project#12383)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413)

* [Bugfix] Fix BLIP-2 processing (vllm-project#12412)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422)

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

* [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435)

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>

* [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094)

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

* [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439)

Signed-off-by: Roger Wang <ywang@roblox.com>

* [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445)

* [Frontend] generation_config.json for  maximum tokens(vllm-project#12242)

Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>

* [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450)

* [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* [Build/CI] Fix libcuda.so linkage (vllm-project#12424)

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [Frontend] Rerank API (Jina- and Cohere-compatible API)  (vllm-project#12376)

Signed-off-by: Kyle Mistele <kyle@mistele.com>

* [DOC] Add link to vLLM blog (vllm-project#12460)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* [V1] Avoid list creation in input preparation (vllm-project#12457)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Frontend] Support scores endpoint in run_batch (vllm-project#12430)

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

* [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464)

Signed-off-by: Isotr0py <2037008807@qq.com>

* [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

* [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

Signed-off-by: Isotr0py <2037008807@qq.com>

---------

Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Kyle Mistele <kyle@mistele.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Keyun Tong <tongkeyun@gmail.com>
Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Kyle Mistele <kyle@mistele.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
  • Loading branch information
1 parent a1a36f3 commit e8e548c
Show file tree
Hide file tree
Showing 68 changed files with 3,639 additions and 526 deletions.
7 changes: 5 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))


def print_top_10_largest_files(zip_file):
Expand Down
6 changes: 5 additions & 1 deletion CMakeLists.txt
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()

message(STATUS "Enabling C extension.")
if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_C_LIBS cuda)
endif()
define_gpu_extension_target(
_C
DESTINATION vllm
Expand All @@ -454,6 +457,7 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
LIBRARIES ${VLLM_C_LIBS}
USE_SABI 3
WITH_SOABI)

Expand Down Expand Up @@ -576,7 +580,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 90eacc1af2a7c3de62ea249e929ed5faccf38954
GIT_TAG d4e09037abf588af1ec47d0e966b237ee376876c
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \

# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py
# Default max size of the wheel is 250MB
ARG VLLM_MAX_SIZE_MB=250
# sync the default value with .buildkite/check-wheel-size.py
ARG VLLM_MAX_SIZE_MB=300
ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG RUN_WHEEL_CHECK=true
RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.tpu
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ARG NIGHTLY_DATE="20250122"
ARG NIGHTLY_DATE="20250124"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"

FROM $BASE_IMAGE
Expand Down
15 changes: 10 additions & 5 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ async def async_request_tgi(
api_url = request_func_input.api_url
assert api_url.endswith("generate_stream")

async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
params = {
"best_of": request_func_input.best_of,
"max_new_tokens": request_func_input.output_len,
Expand Down Expand Up @@ -123,7 +124,8 @@ async def async_request_trt_llm(
api_url = request_func_input.api_url
assert api_url.endswith("generate_stream")

async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1
payload = {
"accumulate_tokens": True,
Expand Down Expand Up @@ -187,7 +189,8 @@ async def async_request_deepspeed_mii(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1

payload = {
Expand Down Expand Up @@ -235,7 +238,8 @@ async def async_request_openai_completions(
("completions", "profile")
), "OpenAI Completions API URL must end with 'completions' or 'profile'."

async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
payload = {
"model": request_func_input.model_name \
if request_func_input.model_name else request_func_input.model,
Expand Down Expand Up @@ -333,7 +337,8 @@ async def async_request_openai_chat_completions(
"chat/completions"
), "OpenAI Chat Completions API URL must end with 'chat/completions'."

async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
content = [{"type": "text", "text": request_func_input.prompt}]
if request_func_input.multi_modal_content:
content.append(request_func_input.multi_modal_content)
Expand Down
32 changes: 12 additions & 20 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ def sample_sonnet_requests(
return sampled_requests


def sample_mmmu_pro_vision_requests(
def sample_vision_arena_requests(
dataset,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
Expand All @@ -212,13 +212,7 @@ def sample_mmmu_pro_vision_requests(
if len(sampled_requests) == num_requests:
break

# MMMU-Pro vision direct prompt
# Ref: https://github.com/MMMU-Benchmark/MMMU/blob/6ce42f4d8f70c1841c67867152648974415b5cac/mmmu-pro/prompts.yaml#L5
prompt = (
"Answer with the option letter from the given choices directly. "
"The last line of your response should be of the following "
"format: 'Answer: $LETTER' (without quotes) where LETTER is one of "
"options.")
prompt = data["turns"][0][0]['content']

prompt_token_ids = tokenizer(prompt).input_ids
if fixed_output_len is None:
Expand All @@ -230,10 +224,10 @@ def sample_mmmu_pro_vision_requests(
output_len = fixed_output_len

assert isinstance(
data["image"],
data["images"][0],
Image), ("Input image format must be `PIL.Image.Image`, "
f"given {type(data['image'])}.")
image: Image = data["image"]
image: Image = data["images"][0]
image = image.convert("RGB")
image_data = io.BytesIO()
image.save(image_data, format='JPEG')
Expand All @@ -252,27 +246,25 @@ def sample_mmmu_pro_vision_requests(

def sample_hf_requests(
dataset_path: str,
dataset_subset: str,
dataset_subset: Optional[str],
dataset_split: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
random_seed: int,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:

# Special case for MMMU-Pro vision dataset
if dataset_path == 'MMMU/MMMU_Pro' and dataset_subset == 'vision':
assert dataset_split == "test"
# Special case for vision_arena dataset
if dataset_path == 'lmarena-ai/vision-arena-bench-v0.1' \
and dataset_subset is None:
assert dataset_split == "train"
dataset = load_dataset(dataset_path,
name=dataset_subset,
split=dataset_split,
streaming=True)
assert "image" in dataset.features, (
"MMMU/MMMU_Pro vision dataset must have 'image' column.")
filter_func = lambda x: isinstance(x["image"], Image)
dataset = dataset.shuffle(seed=random_seed).filter(filter_func)
return sample_mmmu_pro_vision_requests(dataset, num_requests,
tokenizer, fixed_output_len)
dataset = dataset.shuffle(seed=random_seed)
return sample_vision_arena_requests(dataset, num_requests, tokenizer,
fixed_output_len)

dataset = load_dataset(dataset_path,
name=dataset_subset,
Expand Down
4 changes: 3 additions & 1 deletion csrc/moe/moe_align_sum_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@ __global__ void moe_align_block_size_kernel(scalar_t* __restrict__ topk_ids,

extern __shared__ int32_t shared_mem[];
int32_t* cumsum = shared_mem; // 1d tensor with shape (num_experts + 1)
token_cnts_t* tokens_cnts = (token_cnts_t*)(shared_mem + blockDim.x + 1);
token_cnts_t* tokens_cnts =
(token_cnts_t*)(shared_mem + num_experts +
1); // 2d tensor with shape (blockDim.x + 1, num_experts)

for (int i = 0; i < num_experts; ++i) {
tokens_cnts[index(num_experts, threadIdx.x + 1, i)] = 0;
Expand Down
3 changes: 3 additions & 0 deletions docs/source/community/blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# vLLM Blog

vLLM blog posts are published [here](https://blog.vllm.ai/).
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ To build and install vLLM from source, run:
```console
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements-hpu.txt
python setup.py develop
```

Expand All @@ -68,6 +69,7 @@ Currently, the latest features and performance optimizations are developed in Ga
git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout habana_main
pip install -r requirements-hpu.txt
python setup.py develop
```

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ api/model/index
:caption: Community
:maxdepth: 1
community/blog
community/meetups
community/sponsors
```
Expand Down
92 changes: 92 additions & 0 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@ In addition, we have the following custom APIs:
- Applicable to all [pooling models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).

(chat-template)=

Expand Down Expand Up @@ -473,3 +478,90 @@ The following extra parameters are supported:
:start-after: begin-score-extra-params
:end-before: end-score-extra-params
```

(rerank-api)=

### Re-rank API

Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
a scale of 0 to 1.

You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.

Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>

#### Example Request

Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.

Request:

```bash
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
```

Response:

```bash
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
```

#### Extra parameters

The following [pooling parameters](#pooling-params) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
```
33 changes: 32 additions & 1 deletion examples/offline_inference/openai/openai_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The OpenAI batch file format consists of a series of json objects on new lines.
Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details.

```{note}
We currently only support `/v1/chat/completions` and `/v1/embeddings` endpoints (completions coming soon).
We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` endpoints (completions coming soon).
```

## Pre-requisites
Expand Down Expand Up @@ -203,3 +203,34 @@ $ cat results.jsonl
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
...
```

## Example 5: Using score endpoint

### Additional prerequisites

* Ensure you are using `vllm >= 0.7.0`.

### Step 1: Create your batch file

Add score requests to your batch file. The following is an example:

```
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
```

You can mix chat completion, embedding, and score requests in the batch file, as long as the model you are using supports them all (note that all requests must use the same model).

### Step 2: Run the batch

You can run the batch using the same command as in earlier examples.

### Step 3: Check your results

You can check your results by running `cat results.jsonl`

```
$ cat results.jsonl
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
```
32 changes: 32 additions & 0 deletions examples/online_serving/cohere_rerank_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""
Example of using the OpenAI entrypoint's rerank API which is compatible with
the Cohere SDK: https://github.com/cohere-ai/cohere-python
run: vllm serve BAAI/bge-reranker-base
"""
import cohere

# cohere v1 client
co = cohere.Client(base_url="http://localhost:8000", api_key="sk-fake-key")
rerank_v1_result = co.rerank(
model="BAAI/bge-reranker-base",
query="What is the capital of France?",
documents=[
"The capital of France is Paris", "Reranking is fun!",
"vLLM is an open-source framework for fast AI serving"
])

print(rerank_v1_result)

# or the v2
co2 = cohere.ClientV2("sk-fake-key", base_url="http://localhost:8000")

v2_rerank_result = co2.rerank(
model="BAAI/bge-reranker-base",
query="What is the capital of France?",
documents=[
"The capital of France is Paris", "Reranking is fun!",
"vLLM is an open-source framework for fast AI serving"
])

print(v2_rerank_result)
Loading

0 comments on commit e8e548c

Please sign in to comment.