Infinity embed crashes too easily #517

taoari · 2025-01-15T22:33:48Z

System Info

0.0.74

Information

Docker + cli
pip + cli
pip + usage of Python interface

Tasks

An officially supported CLI command
My own modifications

Reproduction

docker with command: >
  v2
  --model-id Alibaba-NLP/gte-large-en-v1.5
  --batch-size 8
  --url-prefix "/v1"
  --port 80

Initially, the GPU memory usage starts at just a few gigabytes. However, after running hundreds of calls, the memory consumption gradually increases to over 40GB, eventually resulting in an OOM (Out of Memory) error.

The API should be robust enough to handle heavy usage without crashing or becoming unresponsive, as such issues hinder its usability and reliability. A potential solution could involve implementing a restriction, such as automatically truncating documents that exceed a specified size.

The text was updated successfully, but these errors were encountered:

kime541200 · 2025-01-23T13:06:29Z

Same problem here.

michaelfeil · 2025-01-24T05:07:24Z

Note:

likely related to use_cache=True which is a setting for causal-las
potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

luzhongqiu · 2025-01-24T12:59:40Z

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

kime541200 · 2025-01-24T13:18:31Z

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

I reduced the batch and it stopped happening, but I don't think this is a good solution. I'm still looking for a better solution.

michaelfeil · 2025-01-24T16:58:52Z

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

M3 should not have this issue at all. Can you send the logs here?

taoari · 2025-01-27T20:58:24Z

Note:

likely related to use_cache=True which is a setting for causal-las

potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

@michaelfeil Thanks for the reply. Could you share with the options for max_chars_per_batch in cli? I found no such options in https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/cli.py#L153

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinity embed crashes too easily #517

Infinity embed crashes too easily #517

taoari commented Jan 15, 2025

kime541200 commented Jan 23, 2025

michaelfeil commented Jan 24, 2025

luzhongqiu commented Jan 24, 2025

kime541200 commented Jan 24, 2025

michaelfeil commented Jan 24, 2025

taoari commented Jan 27, 2025

Infinity embed crashes too easily #517

Infinity embed crashes too easily #517

Comments

taoari commented Jan 15, 2025

System Info

Information

Tasks

Reproduction

kime541200 commented Jan 23, 2025

michaelfeil commented Jan 24, 2025

luzhongqiu commented Jan 24, 2025

kime541200 commented Jan 24, 2025

michaelfeil commented Jan 24, 2025

taoari commented Jan 27, 2025