Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] AsyncLLM Implementation #9826

Merged
merged 175 commits into from
Nov 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
8f8662e
prototype
robertgshaw2-neuralmagic Oct 26, 2024
01c4ca8
revert spurious 2.5 changes
robertgshaw2-neuralmagic Oct 26, 2024
1ad8a48
stash
robertgshaw2-neuralmagic Oct 26, 2024
f9084f6
cleanup
robertgshaw2-neuralmagic Oct 26, 2024
72bccd9
add MQLLMEnginev1
robertgshaw2-neuralmagic Oct 26, 2024
a6cab52
work with MQLLMEngine
robertgshaw2-neuralmagic Oct 27, 2024
885ed16
format
robertgshaw2-neuralmagic Oct 27, 2024
3ed66cf
cleanup formatting
robertgshaw2-neuralmagic Oct 27, 2024
8ae8ce9
revert exmple change
robertgshaw2-neuralmagic Oct 27, 2024
5c72515
update comment
robertgshaw2-neuralmagic Oct 27, 2024
f9b33fa
formatting
robertgshaw2-neuralmagic Oct 27, 2024
82539b9
updated
robertgshaw2-neuralmagic Oct 27, 2024
d42a54e
stash
robertgshaw2-neuralmagic Oct 27, 2024
3a2d02a
format
robertgshaw2-neuralmagic Oct 27, 2024
6028ee1
Merge branch 'main' into rs-prototype-2
robertgshaw2-neuralmagic Oct 27, 2024
6bd37c1
update
robertgshaw2-neuralmagic Oct 27, 2024
196d822
revert bind/connect
robertgshaw2-neuralmagic Oct 27, 2024
a089cd1
revert comment
robertgshaw2-neuralmagic Oct 27, 2024
974aa06
formatting
robertgshaw2-neuralmagic Oct 27, 2024
fe1e1b4
formatting tweaks
robertgshaw2-neuralmagic Oct 27, 2024
9c27fbb
move detokenizer into engine
robertgshaw2-neuralmagic Oct 27, 2024
95b5af1
format
robertgshaw2-neuralmagic Oct 27, 2024
3999279
stash
robertgshaw2-neuralmagic Oct 27, 2024
b4dd571
revert bad import
robertgshaw2-neuralmagic Oct 27, 2024
f01f992
format
robertgshaw2-neuralmagic Oct 28, 2024
be333fa
format
robertgshaw2-neuralmagic Oct 28, 2024
aefb498
add files
robertgshaw2-neuralmagic Oct 28, 2024
6d7f473
stash
robertgshaw2-neuralmagic Oct 28, 2024
f431f8a
update
robertgshaw2-neuralmagic Oct 29, 2024
be431e4
update
robertgshaw2-neuralmagic Oct 29, 2024
36b7fa5
fix api client example to work with v1
robertgshaw2-neuralmagic Oct 29, 2024
3a5ce74
formatting
robertgshaw2-neuralmagic Oct 29, 2024
0d0251e
updated
robertgshaw2-neuralmagic Oct 29, 2024
046d78f
update
robertgshaw2-neuralmagic Oct 29, 2024
34c0665
update
robertgshaw2-neuralmagic Oct 29, 2024
52b790f
stash
robertgshaw2-neuralmagic Oct 30, 2024
4f9a86e
Stash
robertgshaw2-neuralmagic Oct 30, 2024
697b98f
stash
robertgshaw2-neuralmagic Oct 30, 2024
fa5c01d
LLMEngineWorking
robertgshaw2-neuralmagic Oct 30, 2024
0ca42d8
format
robertgshaw2-neuralmagic Oct 30, 2024
b6497d5
updated
robertgshaw2-neuralmagic Oct 30, 2024
ae88c73
updated
robertgshaw2-neuralmagic Oct 30, 2024
2161152
update
robertgshaw2-neuralmagic Oct 31, 2024
6a57297
aded processor
robertgshaw2-neuralmagic Oct 31, 2024
3665602
udpated
robertgshaw2-neuralmagic Oct 31, 2024
ed567ca
updated
robertgshaw2-neuralmagic Oct 31, 2024
f4005da
updated formats
robertgshaw2-neuralmagic Oct 31, 2024
67a53ed
revert
robertgshaw2-neuralmagic Oct 31, 2024
458b54f
finished
robertgshaw2-neuralmagic Oct 31, 2024
75ff707
updated
robertgshaw2-neuralmagic Oct 31, 2024
669648f
split core process into separate class
njhill Oct 31, 2024
127f09c
stash
robertgshaw2-neuralmagic Oct 31, 2024
99f683e
Merge pull request #22 from njhill/rework-splitcore
robertgshaw2-neuralmagic Oct 31, 2024
dc6163c
updated
robertgshaw2-neuralmagic Oct 31, 2024
d21cb8f
updated
robertgshaw2-neuralmagic Oct 31, 2024
565ffa6
working again
robertgshaw2-neuralmagic Oct 31, 2024
2960fbc
format
robertgshaw2-neuralmagic Oct 31, 2024
5d23709
updated
robertgshaw2-neuralmagic Oct 31, 2024
f2f2e40
updated
robertgshaw2-neuralmagic Oct 31, 2024
c10c9d8
better interface
robertgshaw2-neuralmagic Oct 31, 2024
b8767a9
formatting
robertgshaw2-neuralmagic Oct 31, 2024
ab783e1
format
robertgshaw2-neuralmagic Oct 31, 2024
423f47d
update
robertgshaw2-neuralmagic Oct 31, 2024
7c977d3
updated
robertgshaw2-neuralmagic Nov 1, 2024
3c14bdf
format
robertgshaw2-neuralmagic Nov 1, 2024
2ff6fb4
update
robertgshaw2-neuralmagic Nov 1, 2024
8e4fb05
make incremental detokenization OO, handle stop strings
njhill Oct 31, 2024
a155dd8
fix new_char_count
njhill Oct 31, 2024
5bb79f1
Address @varun-sundar-rabindranath's comment
njhill Nov 1, 2024
cdcb746
don't reuse RequestOutput objects
njhill Nov 1, 2024
70c8344
fix finish_reason string
njhill Nov 1, 2024
ed8ef9d
Merge pull request #23 from njhill/stop-strings
robertgshaw2-neuralmagic Nov 2, 2024
2c90b6f
working with OpenAI client
robertgshaw2-neuralmagic Nov 2, 2024
1f5bf42
fix
robertgshaw2-neuralmagic Nov 2, 2024
2681105
remove protocol
robertgshaw2-neuralmagic Nov 2, 2024
dcecba6
actually remove protocol
robertgshaw2-neuralmagic Nov 2, 2024
c9b8aeb
stash
robertgshaw2-neuralmagic Nov 2, 2024
8809228
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic Nov 2, 2024
54f930d
updated
robertgshaw2-neuralmagic Nov 2, 2024
341479a
added core client
robertgshaw2-neuralmagic Nov 2, 2024
33e58ab
updated
robertgshaw2-neuralmagic Nov 2, 2024
abbaa57
remove
robertgshaw2-neuralmagic Nov 2, 2024
d4acc95
update
robertgshaw2-neuralmagic Nov 2, 2024
5e3b0f2
updated
robertgshaw2-neuralmagic Nov 2, 2024
4159ee3
setup env vars for disable
robertgshaw2-neuralmagic Nov 2, 2024
05e7523
revert changes to example
robertgshaw2-neuralmagic Nov 2, 2024
ae7769d
revert changes to api_server.py
robertgshaw2-neuralmagic Nov 2, 2024
237e648
further cleanup
robertgshaw2-neuralmagic Nov 2, 2024
489f3cb
fixed issue
robertgshaw2-neuralmagic Nov 2, 2024
e100f06
updated
robertgshaw2-neuralmagic Nov 2, 2024
9e5f9e0
updated
robertgshaw2-neuralmagic Nov 2, 2024
103011b
updated
robertgshaw2-neuralmagic Nov 2, 2024
2ea9355
added missing
robertgshaw2-neuralmagic Nov 2, 2024
bb1a75b
update
robertgshaw2-neuralmagic Nov 2, 2024
981d3db
don't do zmq i/o on critical path
njhill Nov 2, 2024
a904bad
also move ser/deser into separate threads
njhill Nov 2, 2024
e3014e2
Merge pull request #25 from njhill/overlap_io
robertgshaw2-neuralmagic Nov 4, 2024
8e811f2
updated
robertgshaw2-neuralmagic Nov 4, 2024
d9bdb1d
updated
robertgshaw2-neuralmagic Nov 4, 2024
f30cd0c
make both false
robertgshaw2-neuralmagic Nov 4, 2024
7bb4aad
update to both true
robertgshaw2-neuralmagic Nov 4, 2024
0529e8d
added detokenizer test
robertgshaw2-neuralmagic Nov 5, 2024
14ecf11
testing
robertgshaw2-neuralmagic Nov 5, 2024
a49e769
undo changes
robertgshaw2-neuralmagic Nov 5, 2024
6dad3fc
updated
robertgshaw2-neuralmagic Nov 5, 2024
ab3c63e
updated
robertgshaw2-neuralmagic Nov 5, 2024
38127fc
updated
robertgshaw2-neuralmagic Nov 5, 2024
3ad8684
added engine core test
robertgshaw2-neuralmagic Nov 5, 2024
35580c6
Plumb aborts
Nov 3, 2024
d517017
remove unused import
Nov 5, 2024
ac7b8a7
Merge pull request #24 from neuralmagic/varun/9826-stop-strings
robertgshaw2-neuralmagic Nov 5, 2024
2d61a3f
update test cases
robertgshaw2-neuralmagic Nov 5, 2024
c671231
cleanup
robertgshaw2-neuralmagic Nov 5, 2024
8548a75
fixed!
robertgshaw2-neuralmagic Nov 5, 2024
be8540c
added test case
robertgshaw2-neuralmagic Nov 5, 2024
fa128cc
test engine core
robertgshaw2-neuralmagic Nov 5, 2024
777b49b
updated
robertgshaw2-neuralmagic Nov 5, 2024
b4ed9c3
updated
robertgshaw2-neuralmagic Nov 5, 2024
b3c9c06
updated
robertgshaw2-neuralmagic Nov 5, 2024
cfa1e58
updated
robertgshaw2-neuralmagic Nov 5, 2024
8a43bd1
add test coverage for EngineCoreClientAsync
robertgshaw2-neuralmagic Nov 5, 2024
0b6651f
updated
robertgshaw2-neuralmagic Nov 5, 2024
f3f8e56
updated
robertgshaw2-neuralmagic Nov 5, 2024
248e936
updated
robertgshaw2-neuralmagic Nov 5, 2024
01a6bc2
stash
robertgshaw2-neuralmagic Nov 5, 2024
1a51ffa
added load test
robertgshaw2-neuralmagic Nov 5, 2024
1650b38
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic Nov 5, 2024
61c239e
merged
robertgshaw2-neuralmagic Nov 5, 2024
ca6f09b
format
robertgshaw2-neuralmagic Nov 5, 2024
71e6a11
updated
robertgshaw2-neuralmagic Nov 5, 2024
222062d
mypy
robertgshaw2-neuralmagic Nov 5, 2024
203989b
switch env var to the opposite
robertgshaw2-neuralmagic Nov 5, 2024
f354867
updated
robertgshaw2-neuralmagic Nov 6, 2024
71ef0a7
fix mypy
robertgshaw2-neuralmagic Nov 6, 2024
31aee9f
removed polling
robertgshaw2-neuralmagic Nov 6, 2024
2d59f50
camel case
robertgshaw2-neuralmagic Nov 6, 2024
8f2efe1
Update vllm/v1/engine/core.py
robertgshaw2-neuralmagic Nov 6, 2024
6a45772
updated
robertgshaw2-neuralmagic Nov 6, 2024
4da787f
Merge branch 'rework-rs-proto' of https://github.com/neuralmagic/vllm…
robertgshaw2-neuralmagic Nov 6, 2024
90cba0f
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic Nov 6, 2024
9350d5a
updated
robertgshaw2-neuralmagic Nov 6, 2024
7d3c114
skip v1 tests on non-cuda
robertgshaw2-neuralmagic Nov 6, 2024
cf5e63c
format
robertgshaw2-neuralmagic Nov 6, 2024
11498be
updated
robertgshaw2-neuralmagic Nov 6, 2024
e5414b4
Various updates
njhill Nov 6, 2024
1591996
Merge pull request #28 from njhill/v1/updates
robertgshaw2-neuralmagic Nov 6, 2024
d0587d4
added e2e accuracy test
robertgshaw2-neuralmagic Nov 7, 2024
6f346ae
correctness testing with asyncllmengine
robertgshaw2-neuralmagic Nov 7, 2024
1d1b5ca
added entrypoints LLM test
robertgshaw2-neuralmagic Nov 7, 2024
7d6bba4
passing end-to-end tests
robertgshaw2-neuralmagic Nov 7, 2024
9fab540
fix accuracy test
robertgshaw2-neuralmagic Nov 7, 2024
2db0494
fix for TPU
robertgshaw2-neuralmagic Nov 7, 2024
ec9ba3d
added basic logging
robertgshaw2-neuralmagic Nov 7, 2024
1b18a8b
stashing current state
robertgshaw2-neuralmagic Nov 7, 2024
068295f
format
robertgshaw2-neuralmagic Nov 7, 2024
2024ad9
updated
robertgshaw2-neuralmagic Nov 7, 2024
1e651ee
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic Nov 7, 2024
e4916dd
merged
robertgshaw2-neuralmagic Nov 7, 2024
dabb89d
updated
robertgshaw2-neuralmagic Nov 7, 2024
34659cb
remove prints
robertgshaw2-neuralmagic Nov 7, 2024
41e5735
updated
robertgshaw2-neuralmagic Nov 7, 2024
f799854
updated
robertgshaw2-neuralmagic Nov 7, 2024
b2e5554
merged
robertgshaw2-neuralmagic Nov 7, 2024
9715943
update
robertgshaw2-neuralmagic Nov 7, 2024
7a8c9e2
Fix cancellation propagation, log req completion consistently
njhill Nov 7, 2024
e9c77f8
Merge pull request #30 from njhill/v1/fix-cancel
njhill Nov 7, 2024
10d0e48
fix import
njhill Nov 7, 2024
957cc1e
don't use NOBLOCK in zmq socket ops
njhill Nov 8, 2024
9bd8ac6
move v1 tests
njhill Nov 8, 2024
2f7bb4d
Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…
njhill Nov 8, 2024
8ddd7a0
revert unrelated test changes inadvertently committed
njhill Nov 8, 2024
a903a56
Merge remote-tracking branch 'origin/main' into rework-rs-proto
njhill Nov 8, 2024
2fb76d9
Update tests/v1/engine/test_engine_core.py
robertgshaw2-neuralmagic Nov 8, 2024
8c47b3c
Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…
njhill Nov 11, 2024
7cb08b7
Address some minor review comments
njhill Nov 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,14 @@ steps:
# OOM in the CI unless we run this separately
- pytest -v -s tokenization

- label: V1 Test
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/v1
commands:
- pytest -v -s v1

- label: Examples Test # 15min
working_dir: "/vllm-workspace/examples"
#mirror_hardwares: [amd]
Expand Down
56 changes: 56 additions & 0 deletions tests/entrypoints/llm/test_accuracy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
This file test accuracy of the vLLM server via LMEval.
It uses local-completions, which interacts with vLLM
through the OAI API with N concurrent connections.
This simulates real work usage of the API and makes
sure that the zmq frontend mp RPC message passing and
AsyncLLMEngine are working correctly.
"""

import lm_eval
import pytest

from vllm.platforms import current_platform

MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct"
NUM_CONCURRENT = 500
TASK = "gsm8k"
FILTER = "exact_match,strict-match"
RTOL = 0.03
EXPECTED_VALUE = 0.58


def run_test():
"""Run the end to end accuracy test."""

model_args = f"pretrained={MODEL_NAME},max_model_len=2048"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks="gsm8k",
batch_size="auto",
)

measured_value = results["results"][TASK][FILTER]
assert (measured_value - RTOL < EXPECTED_VALUE
and measured_value + RTOL > EXPECTED_VALUE
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"


@pytest.mark.skipif(not current_platform.is_cuda(),
reason="V1 is currently only supported on CUDA.")
def test_lm_eval_accuracy_v1_engine(monkeypatch):
"""Run with the V1 Engine."""

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")
run_test()


def test_lm_eval_accuracy_v0_engine(monkeypatch):
"""Run with the V0 Engine."""

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "0")
run_test()
25 changes: 22 additions & 3 deletions tests/entrypoints/openai/test_accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@
MAX_WAIT_SECONDS = 600


@pytest.mark.parametrize("more_args", MORE_ARGS_LIST)
def test_lm_eval_accuracy(more_args):
def run_test(more_args):
"""Run the end to end accuracy test."""

args = list(DEFAULT_ARGS)
args.extend(more_args)

print(f"Running with: {args}")

with RemoteOpenAIServer(
Expand All @@ -64,3 +64,22 @@ def test_lm_eval_accuracy(more_args):
assert (measured_value - RTOL < EXPECTED_VALUE
and measured_value + RTOL > EXPECTED_VALUE
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}"


@pytest.mark.skipif(not current_platform.is_cuda(),
reason="V1 currently only supported on CUDA")
def test_lm_eval_accuracy_v1_engine(monkeypatch):
"""Run with the V1 Engine."""

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")
run_test([])


@pytest.mark.parametrize("more_args", MORE_ARGS_LIST)
def test_lm_eval_accuracy_v0_engine(monkeypatch, more_args):
"""Run with the V0 Engine."""

with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "0")
run_test(more_args)
File renamed without changes.
Empty file added tests/v1/engine/__init__.py
Empty file.
66 changes: 66 additions & 0 deletions tests/v1/engine/test_async_llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import asyncio
from typing import Tuple

import pytest

from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.platforms import current_platform
from vllm.v1.engine.async_llm import AsyncLLM

if not current_platform.is_cuda():
pytest.skip(reason="V1 currently only supported on CUDA.",
allow_module_level=True)

ENGINE_ARGS = AsyncEngineArgs(model="meta-llama/Llama-3.2-1B",
disable_log_requests=True)


async def generate(engine: AsyncLLM, request_id: str,
max_tokens: int) -> Tuple[int, str]:
count = 0
async for _ in engine.generate(request_id=request_id,
prompt="Hello my name is Robert and",
sampling_params=SamplingParams(
max_tokens=max_tokens, temperature=0)):

count += 1
await asyncio.sleep(0.)

return count, request_id


@pytest.mark.asyncio
async def test_load(monkeypatch):
with monkeypatch.context() as m:
m.setenv("VLLM_USE_V1", "1")

engine = AsyncLLM.from_engine_args(ENGINE_ARGS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely out of scope for this PR but regarding the followup point

More AsyncLLM and LLMEngine tests (abort, stop string, other unit)

We should be able to reuse all the existing tests since the interfaces are the same, right? It'd just be a matter of hooking up a fixture to set the appropriate v1 environment variables and making sure we're initializing the engine under test with a method that returns the appropriate one. I'd be happy to make that my project for the next few days while y'all focus on building the fast stuff

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there are features (e.g. logprobs) that need to land before we can turn on existing unit tests. Additionally, some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine. So a lot of the tests may need refactoring given these changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah of course. There's a @skip_v1 mark for tests for unsupported features.

some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine

Oof, yeah those sound like some unideal nosy tests, I'll have to look more at them


NUM_REQUESTS = 10000
NUM_EXPECTED_TOKENS = 10

request_ids = [f"request-{i}" for i in range(NUM_REQUESTS)]

# Create concurrent requests.
tasks = []
for request_id in request_ids:
tasks.append(
asyncio.create_task(
generate(engine, request_id, NUM_EXPECTED_TOKENS)))

# Confirm that we got all the EXPECTED tokens from the requests.
failed_request_id = None
tokens = None
for task in tasks:
num_generated_tokens, request_id = await task
if (num_generated_tokens != NUM_EXPECTED_TOKENS
and failed_request_id is None):
failed_request_id = request_id
tokens = num_generated_tokens

assert failed_request_id is None, (
f"{failed_request_id} generated {tokens} but "
f"expected {NUM_EXPECTED_TOKENS}")

engine.shutdown()
205 changes: 205 additions & 0 deletions tests/v1/engine/test_detokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
from typing import List

import pytest
from transformers import AutoTokenizer

from vllm.sampling_params import RequestOutputKind
from vllm.v1.engine import EngineCoreOutput
from vllm.v1.engine.detokenizer import Detokenizer, DetokenizerRequest

TOKENIZER_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)

FULL_STRINGS = [
"My name is Robert from Neural Magic and I love working on vLLM so much!",
"Red Hat is the best open source company by far across Linux, K8s, and AI.",
"Nick is the name of my brother in addition to my colleague from Red Hat.",
]

STOP_STRINGS = ["I love working on", "company by far", "brother in"]

FULL_TOKENS = [tokenizer(text).input_ids for text in FULL_STRINGS]
PROMPT_LEN = 5
PROMPT_TOKENS = [
tokenizer(text).input_ids[:PROMPT_LEN] for text in FULL_STRINGS
]
GENERATION_TOKENS = [
tokenizer(text).input_ids[PROMPT_LEN:] for text in FULL_STRINGS
]
PROMPT_STRINGS = [
tokenizer.decode(prompt_tokens, skip_special_tokens=True)
for prompt_tokens in PROMPT_TOKENS
]
PROMPT_STRINGS_LEN = [len(prompt_string) for prompt_string in PROMPT_STRINGS]
GENERATION_STRINGS = [
text[prompt_len:]
for text, prompt_len in zip(FULL_STRINGS, PROMPT_STRINGS_LEN)
]


class MockEngineCore:
"""Mock outputs form premade tokens lists."""

def __init__(self, tokens_list: List[List[int]]):
self.tokens_list = tokens_list
self.current_idx = 0

def get_outputs(self) -> List[EngineCoreOutput]:
token_idx = self.current_idx
self.current_idx += 1

outputs = []
for req_idx, token_ids in enumerate(self.tokens_list):
if len(token_ids) > token_idx:
output = EngineCoreOutput(request_id=f"request-{req_idx}",
new_token_ids=[token_ids[token_idx]],
finished=False)
if token_idx == len(token_ids) - 1:
output.finished = True
output.finish_reason = "stopped"
outputs.append(output)

return outputs


@pytest.mark.parametrize(
"request_output_kind",
[RequestOutputKind.DELTA, RequestOutputKind.FINAL_ONLY])
def test_incremental_detokenization(request_output_kind: RequestOutputKind):
detokenizer = Detokenizer(TOKENIZER_NAME)
engine_core = MockEngineCore(GENERATION_TOKENS)

# Make N requests.
requests = [
DetokenizerRequest(
request_id=f"request-{idx}",
prompt=prompt,
prompt_token_ids=prompt_tokens,
skip_special_tokens=False,
spaces_between_special_tokens=False,
output_kind=request_output_kind,
stop=[],
include_stop_str_in_output=False,
) for idx, (
prompt,
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS))
]

# Add requests to the detokenizer.
for request in requests:
detokenizer.add_request(request)

gen_strings = {}
gen_tokens = {}
while True:
# Mock output from the EngineCore.
outputs = engine_core.get_outputs()
if len(outputs) == 0:
break

# Step the Detokenizer.
request_outputs, requests_to_abort = detokenizer.step(outputs)
assert len(requests_to_abort) == 0

# Update tracking.
for request_output in request_outputs:
request_id = request_output.request_id
new_text = request_output.outputs[0].text
new_tokens = request_output.outputs[0].token_ids
if request_id not in gen_strings:
gen_strings[request_id] = new_text
gen_tokens[request_id] = new_tokens
else:
gen_strings[request_id] += new_text
gen_tokens[request_id].extend(new_tokens)

# Confirmed tracked values matches what we expected.
for idx, (ref_gen_str, ref_gen_toks) in enumerate(
zip(GENERATION_STRINGS, GENERATION_TOKENS)):
gen_str = gen_strings[f"request-{idx}"]
gen_toks = gen_tokens[f"request-{idx}"]

assert gen_str == ref_gen_str, f"{gen_str=}, {ref_gen_str=}"
assert gen_toks == ref_gen_toks, f"{gen_toks=}, {ref_gen_toks=}"

assert detokenizer.get_num_unfinished_requests() == 0
assert not detokenizer.has_unfinished_requests()


@pytest.mark.parametrize("include_stop_str_in_output", [True, False])
def test_stop_string(include_stop_str_in_output: bool):
detokenizer = Detokenizer(TOKENIZER_NAME)
engine_core = MockEngineCore(GENERATION_TOKENS)

# Make N requests.
requests = [
DetokenizerRequest(
request_id=f"request-{idx}",
prompt=prompt,
prompt_token_ids=prompt_tokens,
skip_special_tokens=False,
spaces_between_special_tokens=False,
output_kind=RequestOutputKind.DELTA,
stop=STOP_STRINGS,
include_stop_str_in_output=include_stop_str_in_output,
) for idx, (
prompt,
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS))
]

# Add requests to the detokenizer.
for request in requests:
detokenizer.add_request(request)

gen_strings = {}
aborted = []
while True:
# Mock output from the EngineCore.
outputs = engine_core.get_outputs()
if len(outputs) == 0:
break

# Step the Detokenizer.
request_outputs, requests_to_abort = detokenizer.step(outputs)
for request_output in request_outputs:
# If aborted, we should not get a request output.
assert request_output.request_id not in aborted
aborted.extend(requests_to_abort)

# Update tracking.
for request_output in request_outputs:
if request_output.finished:
assert request_output.outputs[0].finish_reason == "stop"

request_id = request_output.request_id
new_text = request_output.outputs[0].text
if request_id not in gen_strings:
gen_strings[request_id] = new_text
else:
gen_strings[request_id] += new_text

# Confirmed tracked values matches what we expected.
for idx, (ref_gen_str,
stop_str) in enumerate(zip(GENERATION_STRINGS, STOP_STRINGS)):

# Request should be aborted.
request_id = f"request-{idx}"
assert request_id in aborted

# Collected values that were generated.
gen_str = gen_strings[request_id]

# Construct reference strings.
stop_str_idx = ref_gen_str.find(stop_str)
ref_str_exc_stop = ref_gen_str[:stop_str_idx]
ref_str_inc_stop = ref_gen_str[:stop_str_idx] + stop_str

if include_stop_str_in_output:
assert gen_str == ref_str_inc_stop, (
f"{gen_str=}, {ref_str_inc_stop=}")
else:
assert gen_str == ref_str_exc_stop, (
f"{gen_str=}, {ref_str_exc_stop=}")

assert detokenizer.get_num_unfinished_requests() == 0
assert not detokenizer.has_unfinished_requests()
Loading