Fix/vllm dependency #3249

mreso · 2024-07-16T10:11:07Z

Description

This PR adds vllm to LLM examples through manual and Dockerfile.llm installation.

Fixes #(issue)
#3247 (vllm part)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A

docker build --pull . -f docker/Dockerfile.llm -t ts/llm
docker run --rm -ti --shm-size 1g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/llm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth

Log:

TorchServe is not currently running.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
xpu-smi not available or failed: Cannot run program "xpu-smi": error=2, No such file or directory
2024-07-16T12:14:24,405 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2024-07-16T12:14:24,409 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2024-07-16T12:14:24,454 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
2024-07-16T12:14:24,568 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.11.0
TS Home: /home/venv/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 4
Number of CPUs: 48
Max heap size: 30688 M
Python executable: /home/venv/bin/python
Config file: config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model_store
Initial Models: model
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Enable metrics API: true
Metrics mode: LOG
Disable system metrics: false
Workflow Store: /home/model-server/wf-store
CPP log config: N/A
Model config: N/A
System metrics command: default
Model API enabled: false
2024-07-16T12:14:24,574 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2024-07-16T12:14:24,589 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model
2024-07-16T12:14:24,595 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /home/model-server/tmp/models/89800c74e9514a2eabe182b171c3540f
2024-07-16T12:14:24,596 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /home/model-server/tmp/models/89800c74e9514a2eabe182b171c3540f/model
2024-07-16T12:14:24,605 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model model
2024-07-16T12:14:24,605 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model model
2024-07-16T12:14:24,605 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.
2024-07-16T12:14:24,606 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: model, count: 1
2024-07-16T12:14:24,610 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Device Ids: null
2024-07-16T12:14:24,611 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2024-07-16T12:14:24,612 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml, --async]
2024-07-16T12:14:24,656 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2024-07-16T12:14:24,656 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2024-07-16T12:14:24,657 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2024-07-16T12:14:24,658 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2024-07-16T12:14:24,658 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2024-07-16T12:14:24,829 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2024-07-16T12:14:25,762 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9000, pid=85
2024-07-16T12:14:25,763 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2024-07-16T12:14:25,772 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2024-07-16T12:14:25,773 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]85
2024-07-16T12:14:25,773 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.
2024-07-16T12:14:25,773 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.9.19
2024-07-16T12:14:25,773 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change null -> WORKER_STARTED
2024-07-16T12:14:25,777 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2024-07-16T12:14:25,789 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2024-07-16T12:14:25,790 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - handle_connection_async
2024-07-16T12:14:25,792 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Getting requests from model: org.pytorch.serve.wlm.Model@4115ab4f
2024-07-16T12:14:25,792 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncBatchAggregator - Adding job to jobs: cf347c40-a758-4f00-af4b-1f8c9e598498
2024-07-16T12:14:25,792 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Flushing req.cmd LOAD repeats 1 to backend at: 1721132065792
2024-07-16T12:14:25,812 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.AsyncWorkerThread - Successfully flushed req
2024-07-16T12:14:25,812 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
2024-07-16T12:14:26,271 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,272 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:20.433353424072266|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,272 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:464.1805725097656|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,273 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:95.8|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,273 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,273 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:0|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,274 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:1|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,274 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:1|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,274 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,274 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:2|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,274 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,275 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0.0|#Level:Host,DeviceId:3|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,275 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:0|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,275 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:1|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,275 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:2|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,275 [INFO ] pool-3-thread-1 TS_METRICS - GPUUtilization.Percent:0.0|#Level:Host,DeviceId:3|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,276 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:181211.515625|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,276 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:3235.33203125|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:26,276 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:2.6|#Level:Host|#hostname:160ca5b308a9,timestamp:1721132066
2024-07-16T12:14:27,576 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Enabled tensor cores
2024-07-16T12:14:27,577 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - OpenVINO is not enabled
2024-07-16T12:14:27,577 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - proceeding without onnxruntime
2024-07-16T12:14:27,577 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2024-07-16T12:14:29,685 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - 2024-07-16 12:14:29,685	INFO worker.py:1788 -- Started a local Ray instance.
2024-07-16T12:14:30,333 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:30 config.py:623] Defaulting to use mp for distributed inference
2024-07-16T12:14:30,335 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:30 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir='/data', load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=meta-llama/Meta-Llama-3-8B-Instruct)
2024-07-16T12:14:31,601 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-16T12:14:34,607 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-07-16T12:14:34,646 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-07-16T12:14:34,723 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
2024-07-16T12:14:34,977 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:34 utils.py:623] Found nccl from library libnccl.so.2
2024-07-16T12:14:34,977 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:34 utils.py:623] Found nccl from library libnccl.so.2
2024-07-16T12:14:34,978 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:34 pynccl.py:65] vLLM is using nccl==2.20.5
2024-07-16T12:14:34,978 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:34 utils.py:623] Found nccl from library libnccl.so.2
2024-07-16T12:14:34,978 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:34 pynccl.py:65] vLLM is using nccl==2.20.5
2024-07-16T12:14:34,978 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:34 utils.py:623] Found nccl from library libnccl.so.2
2024-07-16T12:14:34,978 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:34 pynccl.py:65] vLLM is using nccl==2.20.5
2024-07-16T12:14:34,979 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:34 pynccl.py:65] vLLM is using nccl==2.20.5
2024-07-16T12:14:35,077 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Traceback (most recent call last):
2024-07-16T12:14:35,078 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   File "/usr/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
2024-07-16T12:14:35,078 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -     cache[rtype].remove(name)
2024-07-16T12:14:35,078 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - KeyError: '/psm_78683ebe'
2024-07-16T12:14:35,078 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Traceback (most recent call last):
2024-07-16T12:14:35,078 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   File "/usr/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -     cache[rtype].remove(name)
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - KeyError: '/psm_78683ebe'
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - Traceback (most recent call last):
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -   File "/usr/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG -     cache[rtype].remove(name)
2024-07-16T12:14:35,079 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - KeyError: '/psm_78683ebe'
2024-07-16T12:14:35,079 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - WARNING 07-16 12:14:35 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-16T12:14:35,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m WARNING 07-16 12:14:35 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-16T12:14:35,080 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m WARNING 07-16 12:14:35 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-16T12:14:35,081 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m WARNING 07-16 12:14:35 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-16T12:14:35,337 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:35 weight_utils.py:218] Using model weights format ['*.safetensors']
2024-07-16T12:14:35,403 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:35 weight_utils.py:218] Using model weights format ['*.safetensors']
2024-07-16T12:14:35,407 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:35 weight_utils.py:218] Using model weights format ['*.safetensors']
2024-07-16T12:14:35,430 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:35 weight_utils.py:218] Using model weights format ['*.safetensors']
2024-07-16T12:14:36,486 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:36 model_runner.py:159] Loading model weights took 3.7417 GB
2024-07-16T12:14:36,764 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:36 model_runner.py:159] Loading model weights took 3.7417 GB
2024-07-16T12:14:36,961 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:36 model_runner.py:159] Loading model weights took 3.7417 GB
2024-07-16T12:14:37,171 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:37 model_runner.py:159] Loading model weights took 3.7417 GB
2024-07-16T12:14:39,710 [WARN ] W-9000-model_1.0-stderr MODEL_LOG - �[33m(raylet)�[0m [2024-07-16 12:14:39,638 E 398 427] (raylet) file_system_monitor.cc:111: /tmp/ray/session_2024-07-16_12-14-27_828493_85 is over 95% full, available space: 21930299392; capacity: 520367017984. Object creation will fail if spilling is required.
2024-07-16T12:14:40,427 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:40 distributed_gpu_executor.py:56] # GPU blocks: 31159, # CPU blocks: 8192
2024-07-16T12:14:41,968 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:41 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-16T12:14:41,969 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:41 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-07-16T12:14:41,975 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:41 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-16T12:14:41,975 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:41 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-07-16T12:14:42,000 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-16T12:14:42,001 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-07-16T12:14:42,006 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:42 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-16T12:14:42,007 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:42 model_runner.py:882] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-07-16T12:14:43,770 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3148)�[0;0m INFO 07-16 12:14:43 model_runner.py:954] Graph capturing finished in 2 secs.
2024-07-16T12:14:43,771 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3147)�[0;0m INFO 07-16 12:14:43 model_runner.py:954] Graph capturing finished in 2 secs.
2024-07-16T12:14:43,799 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - �[1;36m(VllmWorkerProcess pid=3146)�[0;0m INFO 07-16 12:14:43 model_runner.py:954] Graph capturing finished in 2 secs.
2024-07-16T12:14:43,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - INFO 07-16 12:14:43 model_runner.py:954] Graph capturing finished in 2 secs.
2024-07-16T12:14:43,899 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncBatchAggregator - Predictions is empty. This is from initial load....
2024-07-16T12:14:43,900 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.AsyncWorkerThread - Worker loaded the model successfully
2024-07-16T12:14:43,900 [DEBUG] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - W-9000-model_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED

curl -X POST -d '{"prompt":"Hello, my name is", "max_new_tokens": 50}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"

Result

{"text": " Helen", "tokens": 43881}{"text": " and", "tokens": 323}{"text": " I", "tokens": 358}{"text": " am", "tokens": 1097}{"text": " a", "tokens": 264}{"text": " devoted", "tokens": 29329}{"text": " animal", "tokens": 10065}{"text": " lover", "tokens": 31657}{"text": " and", "tokens": 323}{"text": " wildlife", "tokens": 30405}{"text": " artist", "tokens": 10255}{"text": ".", "tokens": 13}{"text": " I", "tokens": 358}{"text": " have", "tokens": 617}{"text": " always", "tokens": 2744}{"text": " been", "tokens": 1027}

Checklist:

Did you have fun?
Have you made corresponding changes to the documentation?

README.md

Increase shm size

agunapal

LGTM

mreso added 6 commits July 16, 2024 09:46

Update to torch 2.3.1; fix windows reqs

2b4dd09

Add installation of vllm to docs + Dockerfile.llm

61b6aa4

Use requirements.txt for vllm install in docker.llm

bfef5c1

readd versionless torchtext

17b18f4

Make sure base image is up to date when building Dockerfile.lmm

1efbc49

Fix vllm to tested 0.5.0

d9f19cd

mreso marked this pull request as ready for review July 16, 2024 12:17

mreso requested a review from agunapal July 16, 2024 12:18

Move back to 2.3.0

2396351

mreso commented Jul 16, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update README.md

7d0cd81

Increase shm size

agunapal approved these changes Jul 16, 2024

View reviewed changes

agunapal added this pull request to the merge queue Jul 16, 2024

Merged via the queue into master with commit dc7e455 Jul 16, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/vllm dependency #3249

Fix/vllm dependency #3249

mreso commented Jul 16, 2024 •

edited

Loading

agunapal left a comment

Fix/vllm dependency #3249

Fix/vllm dependency #3249

Conversation

mreso commented Jul 16, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal left a comment

Choose a reason for hiding this comment

mreso commented Jul 16, 2024 •

edited

Loading