-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: evaluate openai
models on the remaining MTEB(Medical)
tasks
#71
feat: evaluate openai
models on the remaining MTEB(Medical)
tasks
#71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I just noted that we track co2 for APIs, which seems a bit misleading. Not sure what the best solution is here (my guess would be that we just filter it out in the leaderboard)
@Muennighoff I like your suggestion of also running both models with reduced dimensions. I was thinking 768 dimensions would be a good number since it would put it in line with medium sized models, testing the compression accuracy of their Matryoshka training. What do you think? |
Feel free to run all of them if you want :) |
That was fast! |
Thanks! I'll run all the common dimensions then (256, 512, 768, 1024) |
caught me at boring conference talk ;) |
Re running the dimensions: I just want to refer to embeddings-benchmark/mteb#1211 |
By the way, regarding CodeCarbon, in addition to filtering it out in the leaderboard, we could also modify mteb.run() to check if the evaluated model is an instance of one of the API provider wrappers. If so, we could override the co2_tracker parameter to False. Somewhere around here. |
But I am not sure we know at that time (the encoder does not contain the metadata) |
I hadn't seen this, thanks. I intended to simply create a PR in
but the experiments approach is more flexible. |
I think this is perfectly file for running them, but I don't think we would accept the PR as we have been working on removing duplicates on the new leaderboard and this would add them again. (would love to have them run though, then we have start experimenting with how to best display it) |
I agree that the encoder does not contain the metadata, but I believe Maybe I'm not explaining myself clearly, I can open a small PR to illustrate it in the coming days. |
Fair enough! I'll run the evaluations for now and open a draft PR here (they would still show up on different folders) but not on |
Would love a PR - should make it clear |
Perfect, I'll open one. |
Following up on this PR, this PR adds results from evaluating
text-embedding-3-small
andtext-embedding-3-large
on the remaining tasks in theMTEB(Medical)
benchmark.As discussed here, results from revision 1 are equivalent to those from revision 2. Therefore, we only evaluated tasks that were not previously run.
Thank you @Muennighoff for providing an API key with credits!
Checklist
make test
.make pre-push
.