Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 #2164

Closed
2 of 3 tasks
pseudotensor opened this issue Jul 26, 2024 · 12 comments
Assignees

Comments

@pseudotensor
Copy link

pseudotensor commented Jul 26, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Text handling by the model is fine, but any image leads to crash

Reproduction

Just do any single image request as one would for other vision models. This is sufficient to cause crash every time:

from openai import OpenAI

client = OpenAI(base_url='http://<fill_IP>/v1')


from PIL import Image
import base64
import requests
from io import BytesIO


# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
    """encode image to base64 format."""

    buffered = BytesIO()
    if format == 'JPEG':
        image = image.convert('RGB')
    image.save(buffered, format)
    return base64.b64encode(buffered.getvalue()).decode('utf-8')


# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))

# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What do you see?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": 'data:image/jpeg;base64,' + base64_correct,
                },
            },
        ],
    }
]

response = client.chat.completions.create(
    model="OpenGVLab/InternVL2-Llama3-76B",
    messages=messages,
    temperature=0.0,
    max_tokens=300,
)

print(response.choices[0])

I haven't used the latest lmdeploy for other models like internVL 1-5 that work fine with older version, so it's possible those are broken too. I'll try InternVL 1-5 to see if lmdeploy is generally broken.

Environment

Using latest docker w/ extra build for vision stuff on 4*H100

docker stop internvl2_llama3_76b_lmdeploy ; docker remove internvl2_llama3_76b_lmdeploy
docker run -d --restart=always --runtime nvidia --gpus '"device=0,1,2,3"' \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \
    -p 23343:23333 \
    --ipc=host \
    --name internvl2_llama3_76b_lmdeploy \
    internvlmain2 \
    lmdeploy serve api_server OpenGVLab/InternVL2-Llama3-76B \
    --tp 4 \
    --model-name OpenGVLab/InternVL2-Llama3-76B

See how docker image is built here: #2163

Error traceback

INFO:     172.16.0.83:34664 - "POST /v1/completions HTTP/1.1" 200 OK
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /opt/lmdeploy/src/turbomind/utils/cublasMMWrapper.cc:307 

terminate called recursively
terminate called recursively
terminate called recursively
[dfe606afa87e:1    :0:414] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
INFO:     172.16.0.83:14234 - "GET /health HTTP/1.1" 200 OK
@pseudotensor
Copy link
Author

pseudotensor commented Jul 26, 2024

If I try pytorch backend, I get on startup:

language_model.model.layers.79.input_layernorm.weight:  54%|?2024-07-26 20:45:57,609 - lmdeploy - ERROR - RuntimeError: Internal Triton PTX codegen error: 
ptxas fatal   : Value 'sm_90a' is not defined for option 'gpu-name'

2024-07-26 20:45:57,609 - lmdeploy - ERROR - <Triton> test failed!
Please ensure it has been installed correctly.

Same for InternVL 1-5 model.

My nvidia-smi if helpful:

Fri Jul 26 20:51:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0             119W / 700W |  67915MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:2D:00.0 Off |                    0 |
| N/A   36C    P0             120W / 700W |  68135MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:44:00.0 Off |                    0 |
| N/A   33C    P0             124W / 700W |  68135MiB / 81559MiB |      1%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:5B:00.0 Off |                    0 |
| N/A   37C    P0             124W / 700W |  67557MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:89:00.0 Off |                    0 |
| N/A   31C    P0             114W / 700W |  71661MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:A8:00.0 Off |                    0 |
| N/A   36C    P0             118W / 700W |  78617MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   37C    P0             123W / 700W |  80371MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:D8:00.0 Off |                   On |
| N/A   32C    P0             120W / 700W |  63356MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  7    1   0   0  |           32519MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    2   0   1  |           30837MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67908MiB |
|    1   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    2   N/A  N/A   2083468      C   /opt/py38/bin/python3                     68128MiB |
|    3   N/A  N/A   2083468      C   /opt/py38/bin/python3                     67550MiB |
|    4   N/A  N/A   1172218      C   python3                                   71642MiB |
|    5   N/A  N/A   1171696      C   python3                                   78600MiB |
|    6   N/A  N/A   1181117      C   /opt/py38/bin/python3                     80364MiB |
|    7    1    0    1465874      C   python                                    32484MiB |
|    7    2    0    1475745      C   python3                                   30802MiB |
+---------------------------------------------------------------------------------------+

This is after the model is already on the 0-3 GPUs

But some suggest cuda 11.8 is fine for sm_90: https://discuss.pytorch.org/t/cuda-version-conundrum/185714/3

@pseudotensor
Copy link
Author

pseudotensor commented Jul 26, 2024

Ah, even OpenGVLab/InternVL-Chat-V1-5 segfaults.

So lmdeploy is broken somehow, because I'm using the same docker build scripts as I used for already-running cases just fine, only difference is using latest lmdeploy repo hash

@pseudotensor pseudotensor changed the title [Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B [Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 (regression) Jul 26, 2024
@pseudotensor pseudotensor changed the title [Bug] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 (regression) [Bug Regression] segfault in turbomind for OpenGVLab/InternVL2-Llama3-76B and OpenGVLab/InternVL-Chat-V1-5 Jul 26, 2024
@pseudotensor
Copy link
Author

I'm also confused by this in readme:

Since v0.3.0, The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by

But the docker/Dockerfile still references cu118 and (I guess) uses tritonserver that only as cuda 11.8.

Is this a problem for deploying on H100? It's worked on lmdeploy from (maybe) 2-3 weeks ago, so I guess not, but maybe the pytorch issue is related.

@pseudotensor
Copy link
Author

pseudotensor commented Jul 26, 2024

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

@pseudotensor
Copy link
Author

The exact same build process but on f613814 works fine, no segfault, so definitely a regression.

@RunningLeon
Copy link
Collaborator

@pseudotensor hi, thanks for your feedback. Looks like it only happens in docker image of triton-serve based. #1971 can fix it.
As for docker image, we will consider provide a docker image of cuda12.x later.

@RunningLeon
Copy link
Collaborator

@pseudotensor hi, could you kindly try on this updated dockerfile from #2182? Any feedback would be greatly appreciated.

@lvhan028
Copy link
Collaborator

Why can't the docker image use updated triton server image?

https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-06.html#rel-24-06

that uses cuda 12.5?

And why is the triton server image used as base image at all? Seems overly complicaetd and you don't even use the triton server. Why not just normal Ubuntu with python 3.10?

The initial version of lmdeploy inherits FasterTransformer and triton inference server.
With the development of lmdeploy, we are gradually removing them #1986
The cu12 docker image (#2182) won't be released until full test passes

@pseudotensor
Copy link
Author

Still hitting segfaults, unsure same issue: #2223

Probably same, so not fixed.

@lvhan028
Copy link
Collaborator

lvhan028 commented Aug 6, 2024

Hi, @pseudotensor
we tried A100 (*8) but were not able to reproduce this issue.
I was wondering if there is any way that we can access your env and debugging it?

@lvhan028 lvhan028 self-assigned this Aug 6, 2024
@pseudotensor
Copy link
Author

Hi I plan to do the debugging thing: #2223 (comment)

Just busy with other stuff.

I'm unable to give access to the machine directly, but we can do a shared debugging session if that's helpful. You can email me at pseudotensor@gmail.com to setup details

@pseudotensor
Copy link
Author

#2223 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants