GPU Memory is not freed after embedding operations #514

Aelentel · 2025-01-14T08:36:14Z

System Info

Infinity docker image : michaelf34/infinity:0.0.74

Docker compose command and deploy parts

command: [
      "v2",
      "--model-id","Alibaba-NLP/gte-multilingual-base",
      "--batch-size","4",
      "--dtype","float16",
      "--device","cuda",
      "--engine","torch",
      "--port","7997"
      ]
 deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
                - utility
                - compute
                - video

GPU Card

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:0A:00.0 Off |                  Off |
|  0%   37C    P8             13W /  450W |    1174MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Information

Docker + cli
pip + cli
pip + usage of Python interface

Tasks

An officially supported CLI command
My own modifications

Reproduction

embedded around 13k documents, waited 6 hours after the embedding batches, the memory GPU memory is still allocated to the process and not freed by embedding memory artifact (not talking about the model itself).

is there a way to free the memory outside of restarting the container ?

Memory just after the Container Start

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:0A:00.0 Off |                  Off |
|  0%   39C    P2             73W /  450W |    1174MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    128020      C   /app/.venv/bin/python                        1164MiB |
+-----------------------------------------------------------------------------------------+

Memory 6 hours after the encoding batch is completed

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:0A:00.0 Off |                  Off |
|  0%   37C    P8             13W /  450W |   10516MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1931      C   /app/.venv/bin/python                       10506MiB |
+-----------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

michaelfeil · 2025-01-14T17:26:57Z

#324

Potentially related to unused kv cache..

Aelentel · 2025-01-14T19:34:42Z

According to the linked issue, power of 2 and multiple of 8, I'll try 16 for the batch size.

However the memory usage is still problematic, the model itself is around 1.2 GB when loaded (as saw in the startup nvidia-smi) is there a way to cleanup the used memory after heavy usage ?.

Maybe restart/respawn the model inference process ? Can it be done via api ? Or is that a problem at a lower level (Cuda or torch)

Give or take the container get ready to process in less than a minute, so I think I can implement a restart after batching, but a less forcefull solution would help long term running and availability in production.

Note that by freeing memory, I talk about the inferred vectors, not the memory taken by the model.

Aelentel · 2025-01-28T13:42:51Z

@michaelfeil changed the batch to 8, made sure that the API call also limit the request on 8 doc each call, still memory issue, is there a way to disable the kv-cache on the v2 command flags ?

Aelentel changed the title ~~GPU Memory is not freed after embdding operations~~ GPU Memory is not freed after embedding operations Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Memory is not freed after embedding operations #514

GPU Memory is not freed after embedding operations #514

Aelentel commented Jan 14, 2025 •

edited

Loading

michaelfeil commented Jan 14, 2025

Aelentel commented Jan 14, 2025 •

edited

Loading

Aelentel commented Jan 28, 2025 •

edited

Loading

GPU Memory is not freed after embedding operations #514

GPU Memory is not freed after embedding operations #514

Comments

Aelentel commented Jan 14, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

michaelfeil commented Jan 14, 2025

Aelentel commented Jan 14, 2025 • edited Loading

Aelentel commented Jan 28, 2025 • edited Loading

Aelentel commented Jan 14, 2025 •

edited

Loading

Aelentel commented Jan 14, 2025 •

edited

Loading

Aelentel commented Jan 28, 2025 •

edited

Loading