[MFM] Merge changes from vllm-project/vllm main (#28)

* [Misc] Use VisionArena Dataset for VLM Benchmarking (vllm-project#12389) Signed-off-by: Roger Wang <ywang@roblox.com> * [ci/build] fix wheel size check (vllm-project#12396) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Hardware][Gaudi][Doc] Add missing step in setup instructions (vllm-project#12382) * [ci/build] sync default value for wheel size (vllm-project#12398) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Enable proxy support in benchmark script (vllm-project#12356) Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> * [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (vllm-project#12375) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Misc] Remove deprecated code (vllm-project#12383) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (vllm-project#12405) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Bugfix][Kernel] Fix moe align block issue for mixtral (vllm-project#12413) * [Bugfix] Fix BLIP-2 processing (vllm-project#12412) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (vllm-project#12408) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Misc] Add FA2 support to ViT MHA layer (vllm-project#12355) Signed-off-by: Isotr0py <2037008807@qq.com> * [TPU][CI] Update torchxla version in requirement-tpu.txt (vllm-project#12422) Signed-off-by: Siyuan Liu <lsiyuan@google.com> * [Misc][Bugfix] FA3 support to ViT MHA layer (vllm-project#12435) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (vllm-project#12094) Signed-off-by: Keyun Tong <tongkeyun@gmail.com> * [V1][Bugfix] Fix assertion when mm hashing is turned off (vllm-project#12439) Signed-off-by: Roger Wang <ywang@roblox.com> * [Misc] Revert FA on ViT vllm-project#12355 and vllm-project#12435 (vllm-project#12445) * [Frontend] generation_config.json for maximum tokens(vllm-project#12242) Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (vllm-project#12417) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> * [Bugfix/CI] Fix broken kernels/test_mha.py (vllm-project#12450) * [Bugfix][Kernel] Fix perf regression caused by PR vllm-project#12405 (vllm-project#12434) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * [Build/CI] Fix libcuda.so linkage (vllm-project#12424) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Frontend] Rerank API (Jina- and Cohere-compatible API) (vllm-project#12376) Signed-off-by: Kyle Mistele <kyle@mistele.com> * [DOC] Add link to vLLM blog (vllm-project#12460) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [V1] Avoid list creation in input preparation (vllm-project#12457) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Frontend] Support scores endpoint in run_batch (vllm-project#12430) Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> * [Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com> * [V1][Minor] Minor optimizations for update_from_output (vllm-project#12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Bugfix] Fix gpt2 GGUF inference (vllm-project#12467) Signed-off-by: Isotr0py <2037008807@qq.com> --------- Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: Keyun Tong <tongkeyun@gmail.com> Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Kyle Mistele <kyle@mistele.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Junichi Sato <junichi.sato@sbintuitions.co.jp> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Keyun Tong <tongkeyun@gmail.com> Co-authored-by: Matthew Hendrey <matthew.hendrey@gmail.com> Co-authored-by: shangmingc <caishangming@linux.alibaba.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Kyle Mistele <kyle@mistele.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Pooya Davoodi <pooya.davoodi@parasail.io>
EmbeddedLLM · Jan 27, 2025 · e8e548c · e8e548c
1 parent a1a36f3
commit e8e548c
Show file tree

Hide file tree

Showing 68 changed files with 3,639 additions and 526 deletions.
diff --git a/.buildkite/check-wheel-size.py b/.buildkite/check-wheel-size.py
@@ -2,8 +2,11 @@
 import sys
 import zipfile
 
-# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
-VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
+# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
+# Note that we have 400 MiB quota, please use it wisely.
+# See https://github.com/pypi/support/issues/3792 .
+# Please also sync the value with the one in Dockerfile.
+VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))
 
 
 def print_top_10_largest_files(zip_file):

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -446,6 +446,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
 endif()
 
 message(STATUS "Enabling C extension.")
+if(VLLM_GPU_LANG STREQUAL "CUDA")
+  list(APPEND VLLM_C_LIBS cuda)
+endif()
 define_gpu_extension_target(
   _C
   DESTINATION vllm
@@ -454,6 +457,7 @@ define_gpu_extension_target(
   COMPILE_FLAGS ${VLLM_GPU_FLAGS}
   ARCHITECTURES ${VLLM_GPU_ARCHES}
   INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
+  LIBRARIES ${VLLM_C_LIBS}
   USE_SABI 3
   WITH_SOABI)
 
@@ -576,7 +580,7 @@ else()
   FetchContent_Declare(
           vllm-flash-attn
           GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
-          GIT_TAG 90eacc1af2a7c3de62ea249e929ed5faccf38954
+          GIT_TAG d4e09037abf588af1ec47d0e966b237ee376876c
           GIT_PROGRESS TRUE
           # Don't share the vllm-flash-attn build between build types
           BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn

diff --git a/Dockerfile b/Dockerfile
@@ -126,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
 
 # Check the size of the wheel if RUN_WHEEL_CHECK is true
 COPY .buildkite/check-wheel-size.py check-wheel-size.py
-# Default max size of the wheel is 250MB
-ARG VLLM_MAX_SIZE_MB=250
+# sync the default value with .buildkite/check-wheel-size.py
+ARG VLLM_MAX_SIZE_MB=300
 ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
 ARG RUN_WHEEL_CHECK=true
 RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \

diff --git a/Dockerfile.tpu b/Dockerfile.tpu
@@ -1,4 +1,4 @@
-ARG NIGHTLY_DATE="20250122"
+ARG NIGHTLY_DATE="20250124"
 ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"
 
 FROM $BASE_IMAGE

diff --git a/benchmarks/backend_request_func.py b/benchmarks/backend_request_func.py
@@ -51,7 +51,8 @@ async def async_request_tgi(
     api_url = request_func_input.api_url
     assert api_url.endswith("generate_stream")
 
-    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+    async with aiohttp.ClientSession(trust_env=True,
+                                     timeout=AIOHTTP_TIMEOUT) as session:
         params = {
             "best_of": request_func_input.best_of,
             "max_new_tokens": request_func_input.output_len,
@@ -123,7 +124,8 @@ async def async_request_trt_llm(
     api_url = request_func_input.api_url
     assert api_url.endswith("generate_stream")
 
-    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+    async with aiohttp.ClientSession(trust_env=True,
+                                     timeout=AIOHTTP_TIMEOUT) as session:
         assert request_func_input.best_of == 1
         payload = {
             "accumulate_tokens": True,
@@ -187,7 +189,8 @@ async def async_request_deepspeed_mii(
     request_func_input: RequestFuncInput,
     pbar: Optional[tqdm] = None,
 ) -> RequestFuncOutput:
-    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+    async with aiohttp.ClientSession(trust_env=True,
+                                     timeout=AIOHTTP_TIMEOUT) as session:
         assert request_func_input.best_of == 1
 
         payload = {
@@ -235,7 +238,8 @@ async def async_request_openai_completions(
         ("completions", "profile")
     ), "OpenAI Completions API URL must end with 'completions' or 'profile'."
 
-    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+    async with aiohttp.ClientSession(trust_env=True,
+                                     timeout=AIOHTTP_TIMEOUT) as session:
         payload = {
             "model": request_func_input.model_name \
                 if request_func_input.model_name else request_func_input.model,
@@ -333,7 +337,8 @@ async def async_request_openai_chat_completions(
         "chat/completions"
     ), "OpenAI Chat Completions API URL must end with 'chat/completions'."
 
-    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+    async with aiohttp.ClientSession(trust_env=True,
+                                     timeout=AIOHTTP_TIMEOUT) as session:
         content = [{"type": "text", "text": request_func_input.prompt}]
         if request_func_input.multi_modal_content:
             content.append(request_func_input.multi_modal_content)

diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py
@@ -200,7 +200,7 @@ def sample_sonnet_requests(
     return sampled_requests
 
 
-def sample_mmmu_pro_vision_requests(
+def sample_vision_arena_requests(
     dataset,
     num_requests: int,
     tokenizer: PreTrainedTokenizerBase,
@@ -212,13 +212,7 @@ def sample_mmmu_pro_vision_requests(
         if len(sampled_requests) == num_requests:
             break
 
-        # MMMU-Pro vision direct prompt
-        # Ref: https://github.com/MMMU-Benchmark/MMMU/blob/6ce42f4d8f70c1841c67867152648974415b5cac/mmmu-pro/prompts.yaml#L5
-        prompt = (
-            "Answer with the option letter from the given choices directly. "
-            "The last line of your response should be of the following "
-            "format: 'Answer: $LETTER' (without quotes) where LETTER is one of "
-            "options.")
+        prompt = data["turns"][0][0]['content']
 
         prompt_token_ids = tokenizer(prompt).input_ids
         if fixed_output_len is None:
@@ -230,10 +224,10 @@ def sample_mmmu_pro_vision_requests(
         output_len = fixed_output_len
 
         assert isinstance(
-            data["image"],
+            data["images"][0],
             Image), ("Input image format must be `PIL.Image.Image`, "
                      f"given {type(data['image'])}.")
-        image: Image = data["image"]
+        image: Image = data["images"][0]
         image = image.convert("RGB")
         image_data = io.BytesIO()
         image.save(image_data, format='JPEG')
@@ -252,27 +246,25 @@ def sample_mmmu_pro_vision_requests(
 
 def sample_hf_requests(
     dataset_path: str,
-    dataset_subset: str,
+    dataset_subset: Optional[str],
     dataset_split: str,
     num_requests: int,
     tokenizer: PreTrainedTokenizerBase,
     random_seed: int,
     fixed_output_len: Optional[int] = None,
 ) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
 
-    # Special case for MMMU-Pro vision dataset
-    if dataset_path == 'MMMU/MMMU_Pro' and dataset_subset == 'vision':
-        assert dataset_split == "test"
+    # Special case for vision_arena dataset
+    if dataset_path == 'lmarena-ai/vision-arena-bench-v0.1' \
+        and dataset_subset is None:
+        assert dataset_split == "train"
         dataset = load_dataset(dataset_path,
                                name=dataset_subset,
                                split=dataset_split,
                                streaming=True)
-        assert "image" in dataset.features, (
-            "MMMU/MMMU_Pro vision dataset must have 'image' column.")
-        filter_func = lambda x: isinstance(x["image"], Image)
-        dataset = dataset.shuffle(seed=random_seed).filter(filter_func)
-        return sample_mmmu_pro_vision_requests(dataset, num_requests,
-                                               tokenizer, fixed_output_len)
+        dataset = dataset.shuffle(seed=random_seed)
+        return sample_vision_arena_requests(dataset, num_requests, tokenizer,
+                                            fixed_output_len)
 
     dataset = load_dataset(dataset_path,
                            name=dataset_subset,

diff --git a/csrc/moe/moe_align_sum_kernels.cu b/csrc/moe/moe_align_sum_kernels.cu
@@ -33,7 +33,9 @@ __global__ void moe_align_block_size_kernel(scalar_t* __restrict__ topk_ids,
 
   extern __shared__ int32_t shared_mem[];
   int32_t* cumsum = shared_mem;  // 1d tensor with shape (num_experts + 1)
-  token_cnts_t* tokens_cnts = (token_cnts_t*)(shared_mem + blockDim.x + 1);
+  token_cnts_t* tokens_cnts =
+      (token_cnts_t*)(shared_mem + num_experts +
+                      1);  // 2d tensor with shape (blockDim.x + 1, num_experts)
 
   for (int i = 0; i < num_experts; ++i) {
     tokens_cnts[index(num_experts, threadIdx.x + 1, i)] = 0;

diff --git a/docs/source/community/blog.md b/docs/source/community/blog.md
@@ -0,0 +1,3 @@
+# vLLM Blog
+
+vLLM blog posts are published [here](https://blog.vllm.ai/).
diff --git a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -59,6 +59,7 @@ To build and install vLLM from source, run:
 ```console
 git clone https://github.com/vllm-project/vllm.git
 cd vllm
+pip install -r requirements-hpu.txt
 python setup.py develop
 ```
 
@@ -68,6 +69,7 @@ Currently, the latest features and performance optimizations are developed in Ga
 git clone https://github.com/HabanaAI/vllm-fork.git
 cd vllm-fork
 git checkout habana_main
+pip install -r requirements-hpu.txt
 python setup.py develop
 ```
 

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -184,6 +184,7 @@ api/model/index
 :caption: Community
 :maxdepth: 1
 
+community/blog
 community/meetups
 community/sponsors
 ```

diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md
@@ -50,6 +50,11 @@ In addition, we have the following custom APIs:
   - Applicable to all [pooling models](../models/pooling_models.md).
 - [Score API](#score-api) (`/score`)
   - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
+- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
+  - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
+  - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
+  - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
+  - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
 
 (chat-template)=
 
@@ -473,3 +478,90 @@ The following extra parameters are supported:
 :start-after: begin-score-extra-params
 :end-before: end-score-extra-params
 ```
+
+(rerank-api)=
+
+### Re-rank API
+
+Our Re-rank API applies a cross-encoder model to predict relevant scores between a single query, and
+each of a list of documents. Usually, the score for a sentence pair refers to the similarity between two sentences, on
+a scale of 0 to 1.
+
+You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
+
+The rerank endpoints support popular re-rank models such as `BAAI/bge-reranker-base` and other models supporting the
+`score` task. Additionally, `/rerank`, `/v1/rerank`, and `/v2/rerank`
+endpoints are compatible with both [Jina AI's re-rank API interface](https://jina.ai/reranker/) and
+[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
+popular open-source tools.
+
+Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
+
+#### Example Request
+
+Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
+Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
+
+Request:
+
+```bash
+curl -X 'POST' \
+  'http://127.0.0.1:8000/v1/rerank' \
+  -H 'accept: application/json' \
+  -H 'Content-Type: application/json' \
+  -d '{
+  "model": "BAAI/bge-reranker-base",
+  "query": "What is the capital of France?",
+  "documents": [
+    "The capital of Brazil is Brasilia.",
+    "The capital of France is Paris.",
+    "Horses and cows are both animals"
+  ]
+}'
+```
+
+Response:
+
+```bash
+{
+  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
+  "model": "BAAI/bge-reranker-base",
+  "usage": {
+    "total_tokens": 56
+  },
+  "results": [
+    {
+      "index": 1,
+      "document": {
+        "text": "The capital of France is Paris."
+      },
+      "relevance_score": 0.99853515625
+    },
+    {
+      "index": 0,
+      "document": {
+        "text": "The capital of Brazil is Brasilia."
+      },
+      "relevance_score": 0.0005860328674316406
+    }
+  ]
+}
+```
+
+#### Extra parameters
+
+The following [pooling parameters](#pooling-params) are supported.
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-rerank-pooling-params
+:end-before: end-rerank-pooling-params
+```
+
+The following extra parameters are supported:
+
+```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
+:language: python
+:start-after: begin-rerank-extra-params
+:end-before: end-rerank-extra-params
+```
diff --git a/examples/offline_inference/openai/openai_batch.md b/examples/offline_inference/openai/openai_batch.md
@@ -13,7 +13,7 @@ The OpenAI batch file format consists of a series of json objects on new lines.
 Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details.
 
 ```{note}
-We currently only support `/v1/chat/completions` and `/v1/embeddings` endpoints (completions coming soon).
+We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` endpoints (completions coming soon).
 ```
 
 ## Pre-requisites
@@ -203,3 +203,34 @@ $ cat results.jsonl
 {"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
 ...
 ```
+
+## Example 5: Using score endpoint
+
+### Additional prerequisites
+
+* Ensure you are using `vllm >= 0.7.0`.
+
+### Step 1: Create your batch file
+
+Add score requests to your batch file. The following is an example:
+
+```
+{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
+{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
+```
+
+You can mix chat completion, embedding, and score requests in the batch file, as long as the model you are using supports them all (note that all requests must use the same model).
+
+### Step 2: Run the batch
+
+You can run the batch using the same command as in earlier examples.
+
+### Step 3: Check your results
+
+You can check your results by running `cat results.jsonl`
+
+```
+$ cat results.jsonl
+{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
+{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
+```
diff --git a/examples/online_serving/cohere_rerank_client.py b/examples/online_serving/cohere_rerank_client.py
@@ -0,0 +1,32 @@
+"""
+Example of using the OpenAI entrypoint's rerank API which is compatible with
+the Cohere SDK: https://github.com/cohere-ai/cohere-python
+
+run: vllm serve BAAI/bge-reranker-base
+"""
+import cohere
+
+# cohere v1 client
+co = cohere.Client(base_url="http://localhost:8000", api_key="sk-fake-key")
+rerank_v1_result = co.rerank(
+    model="BAAI/bge-reranker-base",
+    query="What is the capital of France?",
+    documents=[
+        "The capital of France is Paris", "Reranking is fun!",
+        "vLLM is an open-source framework for fast AI serving"
+    ])
+
+print(rerank_v1_result)
+
+# or the v2
+co2 = cohere.ClientV2("sk-fake-key", base_url="http://localhost:8000")
+
+v2_rerank_result = co2.rerank(
+    model="BAAI/bge-reranker-base",
+    query="What is the capital of France?",
+    documents=[
+        "The capital of France is Paris", "Reranking is fun!",
+        "vLLM is an open-source framework for fast AI serving"
+    ])
+
+print(v2_rerank_result)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# vLLM Blog

		vLLM blog posts are published [here](https://blog.vllm.ai/).