Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: evaluate openai models on the remaining MTEB(Medical) tasks #71

Merged

Conversation

dbuades
Copy link
Contributor

@dbuades dbuades commented Dec 13, 2024

Following up on this PR, this PR adds results from evaluating text-embedding-3-small and text-embedding-3-large on the remaining tasks in the MTEB(Medical) benchmark.

As discussed here, results from revision 1 are equivalent to those from revision 2. Therefore, we only evaluated tasks that were not previously run.

Thank you @Muennighoff for providing an API key with credits!

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the results files checker make pre-push.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) December 14, 2024 00:03
@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

@KennethEnevoldsen KennethEnevoldsen merged commit 8f3c2a3 into embeddings-benchmark:main Dec 14, 2024
2 checks passed
@Muennighoff
Copy link
Contributor

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

Feel free to run all of them if you want :)

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

Looks good!

I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)

That was fast!

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think?

Feel free to run all of them if you want :)

Thanks! I'll run all the common dimensions then (256, 512, 768, 1024)

@dbuades dbuades deleted the feat/medical-mteb-openai branch December 14, 2024 00:11
@KennethEnevoldsen
Copy link
Contributor

That was fast!

caught me at boring conference talk ;)

@KennethEnevoldsen
Copy link
Contributor

Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

That was fast!

caught me at boring conference talk ;)

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers. If so, we could override the co2_tracker parameter to False. Somewhere around here.

@KennethEnevoldsen
Copy link
Contributor

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers

But I am not sure we know at that time (the encoder does not contain the metadata)

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211

I hadn't seen this, thanks. I intended to simply create a PR in mteb doing something like:

text_embedding_3_small_512 = ModelMeta(
    name="text-embedding-3-small-512",
    revision="1",
    release_date="2024-01-25",
    languages=None,  # supported languages not specified
    loader=partial(
        OpenAIWrapper,
        model_name="text-embedding-3-small",
        tokenizer_name="cl100k_base",
        max_tokens=8192,
        embed_dim=512,
    ),
    max_tokens=8191,
    embed_dim=512,
    open_weights=False,
    n_parameters=None,
    memory_usage=None,
    license=None,
    reference="https://openai.com/index/new-embedding-models-and-api-updates/",
    similarity_fn_name="cosine",
    framework=["API"],
    use_instructions=False,
)

text_embedding_3_small_768 = ModelMeta(
    name="text-embedding-3-small-768",
    revision="1",
    release_date="2024-01-25",
    languages=None,  # supported languages not specified
    loader=partial(
        OpenAIWrapper,
        model_name="text-embedding-3-small",
        tokenizer_name="cl100k_base",
        max_tokens=8192,
        embed_dim=768,
    ),
    max_tokens=8191,
    embed_dim=768,
    open_weights=False,
    n_parameters=None,
    memory_usage=None,
    license=None,
    reference="https://openai.com/index/new-embedding-models-and-api-updates/",
    similarity_fn_name="cosine",
    framework=["API"],
    use_instructions=False,
)

but the experiments approach is more flexible.

@KennethEnevoldsen
Copy link
Contributor

I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again.

(would love to have them run though, then we have start experimenting with how to best display it)

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers

But I am not sure we know at that time (the encoder does not contain the metadata)

I agree that the encoder does not contain the metadata, but I believe mteb.run() does since it receives the model as an argument. We could check the model meta and if the framework is API, then disable co2_tracker.

Maybe I'm not explaining myself clearly, I can open a small PR to illustrate it in the coming days.

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again.

(would love to have them run though, then we have start experimenting with how to best display it)

Fair enough! I'll run the evaluations for now and open a draft PR here (they would still show up on different folders) but not on mteb.

@KennethEnevoldsen
Copy link
Contributor

Would love a PR - should make it clear

@dbuades
Copy link
Contributor Author

dbuades commented Dec 14, 2024

Would love a PR - should make it clear

Perfect, I'll open one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants