[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

pseudotensor · 2024-07-26T20:33:56Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Text handling by the model is fine, but any image leads to crash

Reproduction

Just do any single image request as one would for other vision models. This is sufficient to cause crash every time:

from openai import OpenAI

client = OpenAI(base_url='http://<fill_IP>/v1')


from PIL import Image
import base64
import requests
from io import BytesIO


# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')


# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64_correct,
                },
            },
        ],
    }
]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL2-Llama3-76B",
    messages=messages,
    temperature=0.0,
    max_tokens=300,
)

print(response.choices[0])

I haven't used the latest lmdeploy for other models like internVL 1-5 that work fine with older version, so it's possible those are broken too. I'll try InternVL 1-5 to see if lmdeploy is generally broken.

Environment

Using latest docker w/ extra build for vision stuff on 4*H100

docker stop internvl2_llama3_76b_lmdeploy ; docker remove internvl2_llama3_76b_lmdeploy
docker run -d --restart=always --runtime nvidia --gpus '"device=0,1,2,3"' \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
    -p 23343:23333 \
    --ipc=host \
    --name internvl2_llama3_76b_lmdeploy \
    internvlmain2 \
    lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B \
    --tp 4 \
    --model-name OpenGVLab/InternVL2-Llama3-76B

See how docker image is built here: #2163

Error traceback

INFO:     172.16.0.83:34664 - "POST /v1/completions HTTP/1.1" 200 OK
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /opt/lmdeploy/src/turbomind/utils/cublasMMWrapper.cc:307 

terminate called recursively
terminate called recursively
terminate called recursively
[dfe606afa87e:1    :0:414] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
INFO:     172.16.0.83:14234 - "GET /health HTTP/1.1" 200 OK

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-07-26T20:52:27Z

If I try pytorch backend, I get on startup:

language_model.model.layers.79.input_layernorm.weight:  54%|?2024-07-26 20:45:57,609 - lmdeploy - ERROR - RuntimeError: Internal Triton PTX codegen error: 
ptxas fatal   : Value 'sm_90a' is not defined for option 'gpu-name'

2024-07-26 20:45:57,609 - lmdeploy - ERROR - <Triton> test failed!
Please ensure it has been installed correctly.

Same for InternVL 1-5 model.

My nvidia-smi if helpful:

Fri Jul 26 20:51:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0             119W / 700W |  67915MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:2D:00.0 Off |                    0 |
| N/A   36C    P0             120W / 700W |  68135MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:44:00.0 Off |                    0 |
| N/A   33C    P0             124W / 700W |  68135MiB / 81559MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:5B:00.0 Off |                    0 |
| N/A   37C    P0             124W / 700W |  67557MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:89:00.0 Off |                    0 |
| N/A   31C    P0             114W / 700W |  71661MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:A8:00.0 Off |                    0 |
| N/A   36C    P0             118W / 700W |  78617MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   37C    P0             123W / 700W |  80371MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:D8:00.0 Off |                   On |
| N/A   32C    P0             120W / 700W |  63356MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  7    1   0   0  |           32519MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    2   0   1  |           30837MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67908MiB |
|    1   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    2   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    3   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67550MiB |
|    4   N/A  N/A   1172218      C   python3                                   71642MiB |
|    5   N/A  N/A   1171696      C   python3                                   78600MiB |
|    6   N/A  N/A   1181117      C   /opt/py38/bin/python3                     80364MiB |
|    7    1    0    1465874      C   python                                    32484MiB |
|    7    2    0    1475745      C   python3                                   30802MiB |
+---------------------------------------------------------------------------------------+

This is after the model is already on the 0-3 GPUs

But some suggest cuda 11.8 is fine for sm_90: https://discuss.pytorch.org/t/cuda-version-conundrum/185714/3

pseudotensor · 2024-07-26T20:53:19Z

Ah, even OpenGVLab/InternVL-Chat-V1-5 segfaults.

So lmdeploy is broken somehow, because I'm using the same docker build scripts as I used for already-running cases just fine, only difference is using latest lmdeploy repo hash

pseudotensor · 2024-07-26T21:02:55Z

I'm also confused by this in readme:

Since v0.3.0, The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by

But the docker/Dockerfile still references cu118 and (I guess) uses tritonserver that only as cuda 11.8.

Is this a problem for deploying on H100? It's worked on lmdeploy from (maybe) 2-3 weeks ago, so I guess not, but maybe the pytorch issue is related.

pseudotensor · 2024-07-26T21:21:06Z

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

pseudotensor · 2024-07-26T23:11:53Z

The exact same build process but on f613814 works fine, no segfault, so definitely a regression.

RunningLeon · 2024-07-29T09:11:59Z

@pseudotensor hi, thanks for your feedback. Looks like it only happens in docker image of triton-serve based. #1971 can fix it.
As for docker image, we will consider provide a docker image of cuda12.x later.

RunningLeon · 2024-07-30T10:20:30Z

@pseudotensor hi, could you kindly try on this updated dockerfile from #2182? Any feedback would be greatly appreciated.

lvhan028 · 2024-07-30T10:26:03Z

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

The initial version of lmdeploy inherits FasterTransformer and triton inference server.
With the development of lmdeploy, we are gradually removing them #1986
The cu12 docker image (#2182) won't be released until full test passes

pseudotensor · 2024-08-02T22:31:06Z

Still hitting segfaults, unsure same issue: #2223

Probably same, so not fixed.

lvhan028 · 2024-08-06T04:00:39Z

Hi, @pseudotensor
we tried A100 (*8) but were not able to reproduce this issue.
I was wondering if there is any way that we can access your env and debugging it?

pseudotensor · 2024-08-06T04:08:20Z

Hi I plan to do the debugging thing: #2223 (comment)

Just busy with other stuff.

I'm unable to give access to the machine directly, but we can do a shared debugging session if that's helpful. You can email me at pseudotensor@gmail.com to setup details

pseudotensor · 2024-08-15T18:48:30Z

#2223 (comment)

pseudotensor changed the title ~~[Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B~~ [Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 (regression) Jul 26, 2024

pseudotensor changed the title ~~[Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 (regression)~~ [Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 Jul 26, 2024

lvhan028 assigned RunningLeon Jul 27, 2024

RunningLeon mentioned this issue Jul 30, 2024

update base image to support cuda12.4 in dockerfile #2182

Merged

pseudotensor mentioned this issue Aug 3, 2024

[Bug] illegal memory access was encountered /opt/lmdeploy/src/turbomind/utils/allocator.h:233 #2223

Closed

3 tasks

lvhan028 unassigned RunningLeon Aug 6, 2024

lvhan028 self-assigned this Aug 6, 2024

pseudotensor closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024

RunningLeon commented Jul 29, 2024

RunningLeon commented Jul 30, 2024

lvhan028 commented Jul 30, 2024

pseudotensor commented Aug 2, 2024

lvhan028 commented Aug 6, 2024

pseudotensor commented Aug 6, 2024

pseudotensor commented Aug 15, 2024

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

Comments

pseudotensor commented Jul 26, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

pseudotensor commented Jul 26, 2024 • edited Loading

pseudotensor commented Jul 26, 2024 • edited Loading

pseudotensor commented Jul 26, 2024

pseudotensor commented Jul 26, 2024 • edited Loading

pseudotensor commented Jul 26, 2024

RunningLeon commented Jul 29, 2024

RunningLeon commented Jul 30, 2024

lvhan028 commented Jul 30, 2024

pseudotensor commented Aug 2, 2024

lvhan028 commented Aug 6, 2024

pseudotensor commented Aug 6, 2024

pseudotensor commented Aug 15, 2024

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024 •

edited

Loading

pseudotensor commented Jul 26, 2024 •

edited

Loading