-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLM
Implementation
#9826
Merged
robertgshaw2-neuralmagic
merged 175 commits into
vllm-project:main
from
neuralmagic:rework-rs-proto
Nov 11, 2024
Merged
Changes from all commits
Commits
Show all changes
175 commits
Select commit
Hold shift + click to select a range
8f8662e
prototype
robertgshaw2-neuralmagic 01c4ca8
revert spurious 2.5 changes
robertgshaw2-neuralmagic 1ad8a48
stash
robertgshaw2-neuralmagic f9084f6
cleanup
robertgshaw2-neuralmagic 72bccd9
add MQLLMEnginev1
robertgshaw2-neuralmagic a6cab52
work with MQLLMEngine
robertgshaw2-neuralmagic 885ed16
format
robertgshaw2-neuralmagic 3ed66cf
cleanup formatting
robertgshaw2-neuralmagic 8ae8ce9
revert exmple change
robertgshaw2-neuralmagic 5c72515
update comment
robertgshaw2-neuralmagic f9b33fa
formatting
robertgshaw2-neuralmagic 82539b9
updated
robertgshaw2-neuralmagic d42a54e
stash
robertgshaw2-neuralmagic 3a2d02a
format
robertgshaw2-neuralmagic 6028ee1
Merge branch 'main' into rs-prototype-2
robertgshaw2-neuralmagic 6bd37c1
update
robertgshaw2-neuralmagic 196d822
revert bind/connect
robertgshaw2-neuralmagic a089cd1
revert comment
robertgshaw2-neuralmagic 974aa06
formatting
robertgshaw2-neuralmagic fe1e1b4
formatting tweaks
robertgshaw2-neuralmagic 9c27fbb
move detokenizer into engine
robertgshaw2-neuralmagic 95b5af1
format
robertgshaw2-neuralmagic 3999279
stash
robertgshaw2-neuralmagic b4dd571
revert bad import
robertgshaw2-neuralmagic f01f992
format
robertgshaw2-neuralmagic be333fa
format
robertgshaw2-neuralmagic aefb498
add files
robertgshaw2-neuralmagic 6d7f473
stash
robertgshaw2-neuralmagic f431f8a
update
robertgshaw2-neuralmagic be431e4
update
robertgshaw2-neuralmagic 36b7fa5
fix api client example to work with v1
robertgshaw2-neuralmagic 3a5ce74
formatting
robertgshaw2-neuralmagic 0d0251e
updated
robertgshaw2-neuralmagic 046d78f
update
robertgshaw2-neuralmagic 34c0665
update
robertgshaw2-neuralmagic 52b790f
stash
robertgshaw2-neuralmagic 4f9a86e
Stash
robertgshaw2-neuralmagic 697b98f
stash
robertgshaw2-neuralmagic fa5c01d
LLMEngineWorking
robertgshaw2-neuralmagic 0ca42d8
format
robertgshaw2-neuralmagic b6497d5
updated
robertgshaw2-neuralmagic ae88c73
updated
robertgshaw2-neuralmagic 2161152
update
robertgshaw2-neuralmagic 6a57297
aded processor
robertgshaw2-neuralmagic 3665602
udpated
robertgshaw2-neuralmagic ed567ca
updated
robertgshaw2-neuralmagic f4005da
updated formats
robertgshaw2-neuralmagic 67a53ed
revert
robertgshaw2-neuralmagic 458b54f
finished
robertgshaw2-neuralmagic 75ff707
updated
robertgshaw2-neuralmagic 669648f
split core process into separate class
njhill 127f09c
stash
robertgshaw2-neuralmagic 99f683e
Merge pull request #22 from njhill/rework-splitcore
robertgshaw2-neuralmagic dc6163c
updated
robertgshaw2-neuralmagic d21cb8f
updated
robertgshaw2-neuralmagic 565ffa6
working again
robertgshaw2-neuralmagic 2960fbc
format
robertgshaw2-neuralmagic 5d23709
updated
robertgshaw2-neuralmagic f2f2e40
updated
robertgshaw2-neuralmagic c10c9d8
better interface
robertgshaw2-neuralmagic b8767a9
formatting
robertgshaw2-neuralmagic ab783e1
format
robertgshaw2-neuralmagic 423f47d
update
robertgshaw2-neuralmagic 7c977d3
updated
robertgshaw2-neuralmagic 3c14bdf
format
robertgshaw2-neuralmagic 2ff6fb4
update
robertgshaw2-neuralmagic 8e4fb05
make incremental detokenization OO, handle stop strings
njhill a155dd8
fix new_char_count
njhill 5bb79f1
Address @varun-sundar-rabindranath's comment
njhill cdcb746
don't reuse RequestOutput objects
njhill 70c8344
fix finish_reason string
njhill ed8ef9d
Merge pull request #23 from njhill/stop-strings
robertgshaw2-neuralmagic 2c90b6f
working with OpenAI client
robertgshaw2-neuralmagic 1f5bf42
fix
robertgshaw2-neuralmagic 2681105
remove protocol
robertgshaw2-neuralmagic dcecba6
actually remove protocol
robertgshaw2-neuralmagic c9b8aeb
stash
robertgshaw2-neuralmagic 8809228
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic 54f930d
updated
robertgshaw2-neuralmagic 341479a
added core client
robertgshaw2-neuralmagic 33e58ab
updated
robertgshaw2-neuralmagic abbaa57
remove
robertgshaw2-neuralmagic d4acc95
update
robertgshaw2-neuralmagic 5e3b0f2
updated
robertgshaw2-neuralmagic 4159ee3
setup env vars for disable
robertgshaw2-neuralmagic 05e7523
revert changes to example
robertgshaw2-neuralmagic ae7769d
revert changes to api_server.py
robertgshaw2-neuralmagic 237e648
further cleanup
robertgshaw2-neuralmagic 489f3cb
fixed issue
robertgshaw2-neuralmagic e100f06
updated
robertgshaw2-neuralmagic 9e5f9e0
updated
robertgshaw2-neuralmagic 103011b
updated
robertgshaw2-neuralmagic 2ea9355
added missing
robertgshaw2-neuralmagic bb1a75b
update
robertgshaw2-neuralmagic 981d3db
don't do zmq i/o on critical path
njhill a904bad
also move ser/deser into separate threads
njhill e3014e2
Merge pull request #25 from njhill/overlap_io
robertgshaw2-neuralmagic 8e811f2
updated
robertgshaw2-neuralmagic d9bdb1d
updated
robertgshaw2-neuralmagic f30cd0c
make both false
robertgshaw2-neuralmagic 7bb4aad
update to both true
robertgshaw2-neuralmagic 0529e8d
added detokenizer test
robertgshaw2-neuralmagic 14ecf11
testing
robertgshaw2-neuralmagic a49e769
undo changes
robertgshaw2-neuralmagic 6dad3fc
updated
robertgshaw2-neuralmagic ab3c63e
updated
robertgshaw2-neuralmagic 38127fc
updated
robertgshaw2-neuralmagic 3ad8684
added engine core test
robertgshaw2-neuralmagic 35580c6
Plumb aborts
d517017
remove unused import
ac7b8a7
Merge pull request #24 from neuralmagic/varun/9826-stop-strings
robertgshaw2-neuralmagic 2d61a3f
update test cases
robertgshaw2-neuralmagic c671231
cleanup
robertgshaw2-neuralmagic 8548a75
fixed!
robertgshaw2-neuralmagic be8540c
added test case
robertgshaw2-neuralmagic fa128cc
test engine core
robertgshaw2-neuralmagic 777b49b
updated
robertgshaw2-neuralmagic b4ed9c3
updated
robertgshaw2-neuralmagic b3c9c06
updated
robertgshaw2-neuralmagic cfa1e58
updated
robertgshaw2-neuralmagic 8a43bd1
add test coverage for EngineCoreClientAsync
robertgshaw2-neuralmagic 0b6651f
updated
robertgshaw2-neuralmagic f3f8e56
updated
robertgshaw2-neuralmagic 248e936
updated
robertgshaw2-neuralmagic 01a6bc2
stash
robertgshaw2-neuralmagic 1a51ffa
added load test
robertgshaw2-neuralmagic 1650b38
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic 61c239e
merged
robertgshaw2-neuralmagic ca6f09b
format
robertgshaw2-neuralmagic 71e6a11
updated
robertgshaw2-neuralmagic 222062d
mypy
robertgshaw2-neuralmagic 203989b
switch env var to the opposite
robertgshaw2-neuralmagic f354867
updated
robertgshaw2-neuralmagic 71ef0a7
fix mypy
robertgshaw2-neuralmagic 31aee9f
removed polling
robertgshaw2-neuralmagic 2d59f50
camel case
robertgshaw2-neuralmagic 8f2efe1
Update vllm/v1/engine/core.py
robertgshaw2-neuralmagic 6a45772
updated
robertgshaw2-neuralmagic 4da787f
Merge branch 'rework-rs-proto' of https://github.com/neuralmagic/vllm…
robertgshaw2-neuralmagic 90cba0f
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic 9350d5a
updated
robertgshaw2-neuralmagic 7d3c114
skip v1 tests on non-cuda
robertgshaw2-neuralmagic cf5e63c
format
robertgshaw2-neuralmagic 11498be
updated
robertgshaw2-neuralmagic e5414b4
Various updates
njhill 1591996
Merge pull request #28 from njhill/v1/updates
robertgshaw2-neuralmagic d0587d4
added e2e accuracy test
robertgshaw2-neuralmagic 6f346ae
correctness testing with asyncllmengine
robertgshaw2-neuralmagic 1d1b5ca
added entrypoints LLM test
robertgshaw2-neuralmagic 7d6bba4
passing end-to-end tests
robertgshaw2-neuralmagic 9fab540
fix accuracy test
robertgshaw2-neuralmagic 2db0494
fix for TPU
robertgshaw2-neuralmagic ec9ba3d
added basic logging
robertgshaw2-neuralmagic 1b18a8b
stashing current state
robertgshaw2-neuralmagic 068295f
format
robertgshaw2-neuralmagic 2024ad9
updated
robertgshaw2-neuralmagic 1e651ee
Merge branch 'main' into rework-rs-proto
robertgshaw2-neuralmagic e4916dd
merged
robertgshaw2-neuralmagic dabb89d
updated
robertgshaw2-neuralmagic 34659cb
remove prints
robertgshaw2-neuralmagic 41e5735
updated
robertgshaw2-neuralmagic f799854
updated
robertgshaw2-neuralmagic b2e5554
merged
robertgshaw2-neuralmagic 9715943
update
robertgshaw2-neuralmagic 7a8c9e2
Fix cancellation propagation, log req completion consistently
njhill e9c77f8
Merge pull request #30 from njhill/v1/fix-cancel
njhill 10d0e48
fix import
njhill 957cc1e
don't use NOBLOCK in zmq socket ops
njhill 9bd8ac6
move v1 tests
njhill 2f7bb4d
Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…
njhill 8ddd7a0
revert unrelated test changes inadvertently committed
njhill a903a56
Merge remote-tracking branch 'origin/main' into rework-rs-proto
njhill 2fb76d9
Update tests/v1/engine/test_engine_core.py
robertgshaw2-neuralmagic 8c47b3c
Merge remote-tracking branch 'refs/remotes/origin/main' into rework-r…
njhill 7cb08b7
Address some minor review comments
njhill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
""" | ||
This file test accuracy of the vLLM server via LMEval. | ||
It uses local-completions, which interacts with vLLM | ||
through the OAI API with N concurrent connections. | ||
This simulates real work usage of the API and makes | ||
sure that the zmq frontend mp RPC message passing and | ||
AsyncLLMEngine are working correctly. | ||
""" | ||
|
||
import lm_eval | ||
import pytest | ||
|
||
from vllm.platforms import current_platform | ||
|
||
MODEL_NAME = "Qwen/Qwen2-1.5B-Instruct" | ||
NUM_CONCURRENT = 500 | ||
TASK = "gsm8k" | ||
FILTER = "exact_match,strict-match" | ||
RTOL = 0.03 | ||
EXPECTED_VALUE = 0.58 | ||
|
||
|
||
def run_test(): | ||
"""Run the end to end accuracy test.""" | ||
|
||
model_args = f"pretrained={MODEL_NAME},max_model_len=2048" | ||
|
||
results = lm_eval.simple_evaluate( | ||
model="vllm", | ||
model_args=model_args, | ||
tasks="gsm8k", | ||
batch_size="auto", | ||
) | ||
|
||
measured_value = results["results"][TASK][FILTER] | ||
assert (measured_value - RTOL < EXPECTED_VALUE | ||
and measured_value + RTOL > EXPECTED_VALUE | ||
), f"Expected: {EXPECTED_VALUE} | Measured: {measured_value}" | ||
|
||
|
||
@pytest.mark.skipif(not current_platform.is_cuda(), | ||
reason="V1 is currently only supported on CUDA.") | ||
def test_lm_eval_accuracy_v1_engine(monkeypatch): | ||
"""Run with the V1 Engine.""" | ||
|
||
with monkeypatch.context() as m: | ||
m.setenv("VLLM_USE_V1", "1") | ||
run_test() | ||
|
||
|
||
def test_lm_eval_accuracy_v0_engine(monkeypatch): | ||
"""Run with the V0 Engine.""" | ||
|
||
with monkeypatch.context() as m: | ||
m.setenv("VLLM_USE_V1", "0") | ||
run_test() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
import asyncio | ||
from typing import Tuple | ||
|
||
import pytest | ||
|
||
from vllm import SamplingParams | ||
from vllm.engine.arg_utils import AsyncEngineArgs | ||
from vllm.platforms import current_platform | ||
from vllm.v1.engine.async_llm import AsyncLLM | ||
|
||
if not current_platform.is_cuda(): | ||
pytest.skip(reason="V1 currently only supported on CUDA.", | ||
allow_module_level=True) | ||
|
||
ENGINE_ARGS = AsyncEngineArgs(model="meta-llama/Llama-3.2-1B", | ||
disable_log_requests=True) | ||
|
||
|
||
async def generate(engine: AsyncLLM, request_id: str, | ||
max_tokens: int) -> Tuple[int, str]: | ||
count = 0 | ||
async for _ in engine.generate(request_id=request_id, | ||
prompt="Hello my name is Robert and", | ||
sampling_params=SamplingParams( | ||
max_tokens=max_tokens, temperature=0)): | ||
|
||
count += 1 | ||
await asyncio.sleep(0.) | ||
|
||
return count, request_id | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_load(monkeypatch): | ||
with monkeypatch.context() as m: | ||
m.setenv("VLLM_USE_V1", "1") | ||
|
||
engine = AsyncLLM.from_engine_args(ENGINE_ARGS) | ||
|
||
NUM_REQUESTS = 10000 | ||
NUM_EXPECTED_TOKENS = 10 | ||
|
||
request_ids = [f"request-{i}" for i in range(NUM_REQUESTS)] | ||
|
||
# Create concurrent requests. | ||
tasks = [] | ||
for request_id in request_ids: | ||
tasks.append( | ||
asyncio.create_task( | ||
generate(engine, request_id, NUM_EXPECTED_TOKENS))) | ||
|
||
# Confirm that we got all the EXPECTED tokens from the requests. | ||
failed_request_id = None | ||
tokens = None | ||
for task in tasks: | ||
num_generated_tokens, request_id = await task | ||
if (num_generated_tokens != NUM_EXPECTED_TOKENS | ||
and failed_request_id is None): | ||
failed_request_id = request_id | ||
tokens = num_generated_tokens | ||
|
||
assert failed_request_id is None, ( | ||
f"{failed_request_id} generated {tokens} but " | ||
f"expected {NUM_EXPECTED_TOKENS}") | ||
|
||
engine.shutdown() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
from typing import List | ||
|
||
import pytest | ||
from transformers import AutoTokenizer | ||
|
||
from vllm.sampling_params import RequestOutputKind | ||
from vllm.v1.engine import EngineCoreOutput | ||
from vllm.v1.engine.detokenizer import Detokenizer, DetokenizerRequest | ||
|
||
TOKENIZER_NAME = "mistralai/Mistral-7B-Instruct-v0.3" | ||
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME) | ||
|
||
FULL_STRINGS = [ | ||
"My name is Robert from Neural Magic and I love working on vLLM so much!", | ||
"Red Hat is the best open source company by far across Linux, K8s, and AI.", | ||
"Nick is the name of my brother in addition to my colleague from Red Hat.", | ||
] | ||
|
||
STOP_STRINGS = ["I love working on", "company by far", "brother in"] | ||
|
||
FULL_TOKENS = [tokenizer(text).input_ids for text in FULL_STRINGS] | ||
PROMPT_LEN = 5 | ||
PROMPT_TOKENS = [ | ||
tokenizer(text).input_ids[:PROMPT_LEN] for text in FULL_STRINGS | ||
] | ||
GENERATION_TOKENS = [ | ||
tokenizer(text).input_ids[PROMPT_LEN:] for text in FULL_STRINGS | ||
] | ||
PROMPT_STRINGS = [ | ||
tokenizer.decode(prompt_tokens, skip_special_tokens=True) | ||
for prompt_tokens in PROMPT_TOKENS | ||
] | ||
PROMPT_STRINGS_LEN = [len(prompt_string) for prompt_string in PROMPT_STRINGS] | ||
GENERATION_STRINGS = [ | ||
text[prompt_len:] | ||
for text, prompt_len in zip(FULL_STRINGS, PROMPT_STRINGS_LEN) | ||
] | ||
|
||
|
||
class MockEngineCore: | ||
"""Mock outputs form premade tokens lists.""" | ||
|
||
def __init__(self, tokens_list: List[List[int]]): | ||
self.tokens_list = tokens_list | ||
self.current_idx = 0 | ||
|
||
def get_outputs(self) -> List[EngineCoreOutput]: | ||
token_idx = self.current_idx | ||
self.current_idx += 1 | ||
|
||
outputs = [] | ||
for req_idx, token_ids in enumerate(self.tokens_list): | ||
if len(token_ids) > token_idx: | ||
output = EngineCoreOutput(request_id=f"request-{req_idx}", | ||
new_token_ids=[token_ids[token_idx]], | ||
finished=False) | ||
if token_idx == len(token_ids) - 1: | ||
output.finished = True | ||
output.finish_reason = "stopped" | ||
outputs.append(output) | ||
|
||
return outputs | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"request_output_kind", | ||
[RequestOutputKind.DELTA, RequestOutputKind.FINAL_ONLY]) | ||
def test_incremental_detokenization(request_output_kind: RequestOutputKind): | ||
detokenizer = Detokenizer(TOKENIZER_NAME) | ||
engine_core = MockEngineCore(GENERATION_TOKENS) | ||
|
||
# Make N requests. | ||
requests = [ | ||
DetokenizerRequest( | ||
request_id=f"request-{idx}", | ||
prompt=prompt, | ||
prompt_token_ids=prompt_tokens, | ||
skip_special_tokens=False, | ||
spaces_between_special_tokens=False, | ||
output_kind=request_output_kind, | ||
stop=[], | ||
include_stop_str_in_output=False, | ||
) for idx, ( | ||
prompt, | ||
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS)) | ||
] | ||
|
||
# Add requests to the detokenizer. | ||
for request in requests: | ||
detokenizer.add_request(request) | ||
|
||
gen_strings = {} | ||
gen_tokens = {} | ||
while True: | ||
# Mock output from the EngineCore. | ||
outputs = engine_core.get_outputs() | ||
if len(outputs) == 0: | ||
break | ||
|
||
# Step the Detokenizer. | ||
request_outputs, requests_to_abort = detokenizer.step(outputs) | ||
assert len(requests_to_abort) == 0 | ||
|
||
# Update tracking. | ||
for request_output in request_outputs: | ||
request_id = request_output.request_id | ||
new_text = request_output.outputs[0].text | ||
new_tokens = request_output.outputs[0].token_ids | ||
if request_id not in gen_strings: | ||
gen_strings[request_id] = new_text | ||
gen_tokens[request_id] = new_tokens | ||
else: | ||
gen_strings[request_id] += new_text | ||
gen_tokens[request_id].extend(new_tokens) | ||
|
||
# Confirmed tracked values matches what we expected. | ||
for idx, (ref_gen_str, ref_gen_toks) in enumerate( | ||
zip(GENERATION_STRINGS, GENERATION_TOKENS)): | ||
gen_str = gen_strings[f"request-{idx}"] | ||
gen_toks = gen_tokens[f"request-{idx}"] | ||
|
||
assert gen_str == ref_gen_str, f"{gen_str=}, {ref_gen_str=}" | ||
assert gen_toks == ref_gen_toks, f"{gen_toks=}, {ref_gen_toks=}" | ||
|
||
assert detokenizer.get_num_unfinished_requests() == 0 | ||
assert not detokenizer.has_unfinished_requests() | ||
|
||
|
||
@pytest.mark.parametrize("include_stop_str_in_output", [True, False]) | ||
def test_stop_string(include_stop_str_in_output: bool): | ||
detokenizer = Detokenizer(TOKENIZER_NAME) | ||
engine_core = MockEngineCore(GENERATION_TOKENS) | ||
|
||
# Make N requests. | ||
requests = [ | ||
DetokenizerRequest( | ||
request_id=f"request-{idx}", | ||
prompt=prompt, | ||
prompt_token_ids=prompt_tokens, | ||
skip_special_tokens=False, | ||
spaces_between_special_tokens=False, | ||
output_kind=RequestOutputKind.DELTA, | ||
stop=STOP_STRINGS, | ||
include_stop_str_in_output=include_stop_str_in_output, | ||
) for idx, ( | ||
prompt, | ||
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS)) | ||
] | ||
|
||
# Add requests to the detokenizer. | ||
for request in requests: | ||
detokenizer.add_request(request) | ||
|
||
gen_strings = {} | ||
aborted = [] | ||
while True: | ||
# Mock output from the EngineCore. | ||
outputs = engine_core.get_outputs() | ||
if len(outputs) == 0: | ||
break | ||
|
||
# Step the Detokenizer. | ||
request_outputs, requests_to_abort = detokenizer.step(outputs) | ||
for request_output in request_outputs: | ||
# If aborted, we should not get a request output. | ||
assert request_output.request_id not in aborted | ||
aborted.extend(requests_to_abort) | ||
|
||
# Update tracking. | ||
for request_output in request_outputs: | ||
if request_output.finished: | ||
assert request_output.outputs[0].finish_reason == "stop" | ||
|
||
request_id = request_output.request_id | ||
new_text = request_output.outputs[0].text | ||
if request_id not in gen_strings: | ||
gen_strings[request_id] = new_text | ||
else: | ||
gen_strings[request_id] += new_text | ||
|
||
# Confirmed tracked values matches what we expected. | ||
for idx, (ref_gen_str, | ||
stop_str) in enumerate(zip(GENERATION_STRINGS, STOP_STRINGS)): | ||
|
||
# Request should be aborted. | ||
request_id = f"request-{idx}" | ||
assert request_id in aborted | ||
|
||
# Collected values that were generated. | ||
gen_str = gen_strings[request_id] | ||
|
||
# Construct reference strings. | ||
stop_str_idx = ref_gen_str.find(stop_str) | ||
ref_str_exc_stop = ref_gen_str[:stop_str_idx] | ||
ref_str_inc_stop = ref_gen_str[:stop_str_idx] + stop_str | ||
|
||
if include_stop_str_in_output: | ||
assert gen_str == ref_str_inc_stop, ( | ||
f"{gen_str=}, {ref_str_inc_stop=}") | ||
else: | ||
assert gen_str == ref_str_exc_stop, ( | ||
f"{gen_str=}, {ref_str_exc_stop=}") | ||
|
||
assert detokenizer.get_num_unfinished_requests() == 0 | ||
assert not detokenizer.has_unfinished_requests() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely out of scope for this PR but regarding the followup point
We should be able to reuse all the existing tests since the interfaces are the same, right? It'd just be a matter of hooking up a fixture to set the appropriate
v1
environment variables and making sure we're initializing the engine under test with a method that returns the appropriate one. I'd be happy to make that my project for the next few days while y'all focus on building the fast stuffThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but there are features (e.g. logprobs) that need to land before we can turn on existing unit tests. Additionally, some of the internal classes that are used for testing (e.g. scheduled) are now in engine core, so we cannot access them directly from llm_engine. So a lot of the tests may need refactoring given these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah of course. There's a
@skip_v1
mark for tests for unsupported features.Oof, yeah those sound like some unideal nosy tests, I'll have to look more at them