Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inference time cost gap for FlagEmbedding 1.2 and 1.3 #1233

Open
Atlantic8 opened this issue Nov 15, 2024 · 7 comments
Open

inference time cost gap for FlagEmbedding 1.2 and 1.3 #1233

Atlantic8 opened this issue Nov 15, 2024 · 7 comments

Comments

@Atlantic8
Copy link

Atlantic8 commented Nov 15, 2024

i use the same reranker model bge-reranker-v2-m3 and same python scripts,

`
from FlagEmbedding import FlagReranker

model = FlagReranker(model_path, use_fp16=True)

model.compute_score(qp_pairs, normalize=True)
`

the environments only diff in version of FlagEmbedding. however inference time cost for FlagEmbedding=1.3 is almost twice as long as that with FlagEmbedding=1.2, unfortunately I have to use FlagEmbedding 1.3 because i have to finetune the model with query_instruction_for_rerank, passage_max_length and sep_token.

can anyone help with this problem?

@hanhainebula
Copy link
Collaborator

Hello, @Atlantic8! Could you present more details such as the devices used for inference and the number of sentence pairs? Then we will check the reason of this problem. Thank you.

@Atlantic8
Copy link
Author

Hello, @Atlantic8! Could you present more details such as the devices used for inference and the number of sentence pairs? Then we will check the reason of this problem. Thank you.

device is Nvidia V100 32G. I used 20 <query, doc> pairs, where the query is long (like 1000 tokens) and doc is short (like 20 tokens).

@hanhainebula
Copy link
Collaborator

Hello, @Atlantic8. This is normal since initializing multiple devices (refer to here) need some time. Considering that the number of sentence pairs you inference here is only 20, you can add parameter devices="cuda:0" to use only one GPU to save the time for initializing multiple devices. The modified code:

from FlagEmbedding import FlagReranker

model = FlagReranker(model_path, use_fp16=True, devices="cuda:0")

model.compute_score(qp_pairs, normalize=True)

@Atlantic8
Copy link
Author

Atlantic8 commented Nov 22, 2024

I tried your solution, unfortunately, it's not working.
I built a service, the model was initialized once only. when i downgraded FlagEmbedding to 1.2, time cost decreased by nearly 50%, so I think it must be something with FlagEmbedding version.

@hanhainebula
Copy link
Collaborator

When using FlagEmbedding 1.2, how many devices did you use? If the number of devices is also 1, then this gap was not quite as expected🤔.

@Atlantic8
Copy link
Author

Atlantic8 commented Nov 22, 2024

only 1.
The only difference is FlagEmbedding version, other variables are the same.

@hanhainebula
Copy link
Collaborator

Hello, @Atlantic8. Here is an example for testing the inference time:

import os
import time
import datasets
from FlagEmbedding import FlagReranker


def test_inference_time(reranker: FlagReranker, sentences: list, number: int = 20):
    if len(sentences) > number:
        sentences = sentences[:number]
    elif len(sentences) < number:
        sentences = sentences * (number // len(sentences) + 1)
        sentences = sentences[:number]
    start_time = time.time()
    scores = reranker.compute_score(sentences, batch_size=16, max_length=1024, normalize=True)
    end_time = time.time()
    print("=====================================")
    print("Number of pairs: ", number)
    print("Time cost: ", end_time - start_time)


def main():
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    
    reranker = FlagReranker(
        'BAAI/bge-reranker-v2-m3',
        use_fp16=True
    )

    cache_dir = "~/.cache"

    queries = datasets.load_dataset('Shitao/MLDR', "hi", cache_dir=cache_dir, trust_remote_code=True)["test"].select(range(20))
    corpus = datasets.load_dataset('Shitao/MLDR', "corpus-hi", cache_dir=cache_dir, trust_remote_code=True)["corpus"].select(range(20))

    sentences = [(q["query"], d["text"]) for q, d in zip(queries, corpus)]
    
    print("Warm up")
    reranker.compute_score([("hello world", "hello world")])
    
    test_inference_time(reranker, sentences, number=20)
    test_inference_time(reranker, sentences, number=1000)
    test_inference_time(reranker, sentences, number=10000)


if __name__ == '__main__':
    main()

For FlagEmbedding 1.2.10, the test result is:

=====================================
Number of pairs:  20
Time cost:  0.30451512336730957
=====================================
Number of pairs:  1000
Time cost:  12.109597444534302
=====================================
Number of pairs:  10000
Time cost:  121.79582500457764

For FlagEmbedding 1.3.2, the test result is:

=====================================
Number of pairs:  20
Time cost:  0.47469186782836914
=====================================
Number of pairs:  1000
Time cost:  12.363729476928711
=====================================
Number of pairs:  10000
Time cost:  120.30488777160645

The Warm up part is to move the model to the target device to make the comparison fair. For FlagEmbedding 1.2.10, this operation is completed in the __init__ function (refer to here). For FlagEmbedding 1.3.2, this operation is completed in the compute_score function (refer to here).

From the above results, we can observe that there is no big gap between FlagEmbedding 1.2 and 1.3. Hope this result can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants