Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinity embed crashes too easily #517

Open
2 of 5 tasks
taoari opened this issue Jan 15, 2025 · 6 comments
Open
2 of 5 tasks

Infinity embed crashes too easily #517

taoari opened this issue Jan 15, 2025 · 6 comments

Comments

@taoari
Copy link

taoari commented Jan 15, 2025

System Info

0.0.74

Information

  • Docker + cli
  • pip + cli
  • pip + usage of Python interface

Tasks

  • An officially supported CLI command
  • My own modifications

Reproduction

docker with command: >
  v2
  --model-id Alibaba-NLP/gte-large-en-v1.5
  --batch-size 8
  --url-prefix "/v1"
  --port 80

Initially, the GPU memory usage starts at just a few gigabytes. However, after running hundreds of calls, the memory consumption gradually increases to over 40GB, eventually resulting in an OOM (Out of Memory) error.

The API should be robust enough to handle heavy usage without crashing or becoming unresponsive, as such issues hinder its usability and reliability. A potential solution could involve implementing a restriction, such as automatically truncating documents that exceed a specified size.

@kime541200
Copy link

Same problem here.

@michaelfeil
Copy link
Owner

Note:

  • likely related to use_cache=True which is a setting for causal-las
  • potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

@luzhongqiu
Copy link

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

@kime541200
Copy link

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

I reduced the batch and it stopped happening, but I don't think this is a good solution. I'm still looking for a better solution.

@michaelfeil
Copy link
Owner

same problem using bge-m3 for embedding and rerank, oom. how to deal with this ! :(

M3 should not have this issue at all. Can you send the logs here?

@taoari
Copy link
Author

taoari commented Jan 27, 2025

Note:

  • likely related to use_cache=True which is a setting for causal-las
  • potentially retains the KV-Cache from previous generations.

40GB does not make sense. however 8192tokens x 8 will cause a decent ulitzation which might be what you are seeing. The restriction would be split batches for with a max_num_tokens parameter. As this happens before tokenizations, it would be a max_chars_per_batch parameter.

@michaelfeil Thanks for the reply. Could you share with the options for max_chars_per_batch in cli? I found no such options in https://github.com/michaelfeil/infinity/blob/main/libs/infinity_emb/infinity_emb/cli.py#L153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants