Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] Update CPU tests to include all "standard" tests #5481

Merged
merged 29 commits into from
Nov 8, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Jun 13, 2024

This change should help catch issues related to VLMs that are specific to CPU (e.g. #5451, #7735, #8061).

Edit: Updated the list of related issues in light of recent PRs.

@Isotr0py
Copy link
Collaborator

@DarkLight1337 I noticed that when running the test_llava.py with cpu environment, the hf_runner is locked to use only one thread for model forward (while vllm_runner doesn't have this issue):
20240613152712

I'm afraid that this will significantly slow down the test. What do you think about this?

@DarkLight1337
Copy link
Member Author

@DarkLight1337 I noticed that when running the test_llava.py with cpu environment, the hf_runner is locked to use only one thread for model forward (while vllm_runner doesn't have this issue): 20240613152712

I'm afraid that this will significantly slow down the test. What do you think about this?

Is this specific to LLaVA model or does this also occur for the other models? If it's the latter case then I think this change would have a relatively small impact compared to the baseline.

@Isotr0py
Copy link
Collaborator

It looks like that this is specific to vision model, which occurred in both llava and phi3v test.

I also test test_models.py with phi-2, and threads are handled normally:
20240613154922

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Jun 13, 2024

Hmm, this may be because the vision tower and multi-modal projector have not been optimized for vLLM yet. Let's wait for it to be implemented as described in #4194.

@DarkLight1337 DarkLight1337 marked this pull request as draft June 13, 2024 07:58
@zhouyuan
Copy link
Contributor

@DarkLight1337 @Isotr0py hi, thanks for looking on this, Initially when enabling on the CPU CI, I find there are some issues on Llava CPU backend, so I disabled that part firstly

image

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Jun 13, 2024

@DarkLight1337 @Isotr0py hi, thanks for looking on this, Initially when enabling on the CPU CI, I find there are some issues on Llava CPU backend, so I disabled that part firstly

Hmm... maybe CPUModelRunner.device somehow does not match the device of some model parameters.

@zhouyuan
Copy link
Contributor

@DarkLight1337 Hi, just did a quick check locally, with the latest code, the llava test will fail due to some result mismatch due to float16 vs bfloat16. below diff can help to fix this issue.

diff --git a/tests/models/test_llava.py b/tests/models/test_llava.py
index cc0685ca..57a92a7a 100644
--- a/tests/models/test_llava.py
+++ b/tests/models/test_llava.py
@@ -76,9 +76,14 @@ def sanitize_vllm_output(vllm_output: Tuple[List[int], str],
     return sanitized_input_ids, sanitzied_output_str


+#TODO: remove this after CPU float16 support ready
+target_dtype = "float"
+if torch.cuda.is_available():
+    target_dtype = "half"
+
 @pytest.mark.parametrize("worker_use_ray", [False])
 @pytest.mark.parametrize("model_and_config", model_and_vl_config)
-@pytest.mark.parametrize("dtype", ["half"])
+@pytest.mark.parametrize("dtype", [target_dtype])
 @pytest.mark.parametrize("max_tokens", [128])
 def test_models(hf_runner, vllm_runner, hf_image_prompts, hf_images,
                 vllm_image_prompts, vllm_images, model_and_config, dtype: str,

@DarkLight1337 DarkLight1337 changed the title [CI/Build] Enable LLaVA test in CPU [CI/Build] Enable LLaVA CPU test Jun 14, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Jun 14, 2024

According to the CI log, it currently takes 5-10 seconds for each LLaVA-1.5 iteration and 20-40 seconds for each LLaVA-NeXT iteration. This is much longer than the other models which take less than 2 seconds (you can verify this by searching for the s/it string outputted by tqdm).

@DarkLight1337
Copy link
Member Author

#5591 has been merged. Let's see the performance now...

@DarkLight1337
Copy link
Member Author

Getting this error:

Error in calling custom op gelu_quick: '_OpNamespace' '_C' object has no attribute 'gelu_quick'

Does the CPU test not recompile vLLM? @WoosukKwon

@ywang96
Copy link
Member

ywang96 commented Jun 20, 2024

Getting this error:

Error in calling custom op gelu_quick: '_OpNamespace' '_C' object has no attribute 'gelu_quick'

Does the CPU test not recompile vLLM? @WoosukKwon

Hmm...gelu_quick was actually added in #5591 as well though I'm not sure how to add that to be compatible with CPU

Edit: I see activation.cpp under cpu directory now, will add to it.

@zhouyuan
Copy link
Contributor

zhouyuan commented Jun 21, 2024 via email

@ywang96
Copy link
Member

ywang96 commented Jun 21, 2024

Hi Roger, The patch in #5591 is adding CUDA kernel only - should be OK to add the CPU related kernel under: csrc/cpu​We could also help to do this if required. CC @bigPYJ1151 Thanks, -yuan

________________________________ From: Roger Wang @.> Sent: Friday, June 21, 2024 12:29 AM To: vllm-project/vllm @.> Cc: Yuan @.>; Comment @.> Subject: Re: [vllm-project/vllm] [CI/Build] Enable LLaVA CPU test (PR #5481) Getting this error: Error in calling custom op gelu_quick: '_OpNamespace' '_C' object has no attribute 'gelu_quick' Does the CPU test not recompile vLLM? @WoosukKwonhttps://github.com/WoosukKwon Hmm...gelu_quick was actually added in #5591<#5591> as well though I'm not sure how to add that to be compatible with CPU — Reply to this email directly, view it on GitHub<#5481 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAAKXDOUUS4GTDUREPZS3RTZIL7PFAVCNFSM6AAAAABJHMQPFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBRGA4TQMZUGA. You are receiving this because you commented.Message ID: @.***>

@zhouyuan yea - I already made a PR #5717 and it's just waiting for review now

@DarkLight1337
Copy link
Member Author

There is no observable speed increase so far. Perhaps the multi-modal projector also has to be optimized?

@DarkLight1337 DarkLight1337 changed the title [CI/Build] Enable LLaVA CPU test [CI/Build] Enable CPU test for VLMs Jun 21, 2024
@ywang96
Copy link
Member

ywang96 commented Jun 21, 2024

There is no observable speed increase so far. Perhaps the multi-modal projector also has to be optimized?

Hmm... yea - the other place to optimize is CLIPAttention itself. Right now it's still imported from transformers.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Nov 7, 2024

Waiting for #10108

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Nov 7, 2024

The test duration has gone up from 18 minutes to 30 minutes. Given we currently merge 10-20 PRs per day, if we assume that CI AWS is triggered 3x per commit (the minimum is 2x - once pre-merge and once post-merge, but it's unlikely that the CI passes on the first try after ready label is added), we may accumulate a backlog if the CPU tests are only backed by a single agent.

@bigPYJ1151 do you know whether it's possible to increase the number of agents to 2? Otherwise, I'll prune some tests from this PR.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@bigPYJ1151
Copy link
Contributor

bigPYJ1151 commented Nov 8, 2024

@DarkLight1337 The hf_runner is too slow on CPU. I suggest to only use some small, typical VLM tests (e.g., qwen2_vl, mark it with core_model_cpu). The main purpose is to verify the VLM support in cpu_model_runner and TorchSDPA, so some typical cases are enough.

Looks like some audio language model tests require chunked-prefill, will open a PR for it recently.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member Author

After removing the tests for unsupported models (involving embedding and chunked prefill), the test duration is down to 26 minutes, which should be OK for now.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@DarkLight1337
Copy link
Member Author

@DarkLight1337 The hf_runner is too slow on CPU. I suggest to only use some small, typical VLM tests (e.g., qwen2_vl, mark it with core_model_cpu). The main purpose is to verify the VLM support in cpu_model_runner and TorchSDPA, so some typical cases are enough.

Looks like some audio language model tests require chunked-prefill, will open a PR for it recently.

I have added cpu_model tag to further trim down the tests. The latest run now only takes 21 minutes.

@DarkLight1337
Copy link
Member Author

@Isotr0py PTAL and see if this looks ok to you as well.

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isotr0py Isotr0py merged commit b489fc3 into vllm-project:main Nov 8, 2024
71 checks passed
@DarkLight1337 DarkLight1337 deleted the test-llava-cpu branch November 8, 2024 15:31
Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Nov 8, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
omer-dayan pushed a commit to omer-dayan/vllm that referenced this pull request Nov 10, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: OmerD <omer@run.ai>
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>
jeejeelee pushed a commit to jeejeelee/vllm that referenced this pull request Nov 11, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
…ject#5481)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed x86 CPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants