Added MMTEB (#275)

* restructing the readme * added mmteb * removed unec. method * Added docstring to metadata * Updated outdated examples * formatting documents * fix: Updated form to be parsed correctly * Updated based on feedback * Apply suggestions from code review Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * updated based on feedback * Added suggestion from review * added correction based on review --------- Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
embeddings-benchmark · Mar 24, 2024 · c0dc49a · c0dc49a
1 parent b08913f
commit c0dc49a
Show file tree

Hide file tree

Showing 129 changed files with 447 additions and 671 deletions.
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,21 @@
+
+<!-- If you are not submitting for a dataset, feel free to remove the content below  -->
+
+
+<!-- add additonal description, question etc. related to the new dataset -->
+
+## Checklist for adding MMTEB dataset
+<!-- 
+Before you commit here is a checklist you should complete before submitting
+if you are not 
+ -->
+
+- [ ] I have tested that the dataset runs with the `mteb` package.
+- [ ] I have run the following models on the task (adding the results to the pr). These can be run using the `mteb run -m {model_name} -t {task_name}` command.
+  - [ ] `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
+  - [ ] `intfloat/multilingual-e5-small`
+- [ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
+- [ ] I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)
+- [ ] Run tests locally to make sure nothing is broken using `make test`. 
+- [ ] Run the formatter to format the code using `make lint`. 
+- [ ] I have added points for my submission to the [POINTS.md](https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb/POINTS.md) file.
diff --git a/Makefile b/Makefile
@@ -15,3 +15,8 @@ test-parallel:
 	@echo "--- 🧪 Running tests ---"
 	@echo "Note that parallel tests can sometimes cause issues with some tests."
 	pytest -n auto --dist=loadfile -s -v
+
+pr:
+	@echo "--- 🚀 Running requirements for a PR ---"
+	make lint
+	make test-parallel
diff --git a/README.md b/README.md
@@ -46,6 +46,8 @@ from sentence_transformers import SentenceTransformer
 
 # Define the sentence-transformers model name
 model_name = "average_word_embeddings_komninos"
+# or directly from huggingface:
+# model_name = "sentence-transformers/all-MiniLM-L6-v2"
 
 model = SentenceTransformer(model_name)
 evaluation = MTEB(tasks=["Banking77Classification"])
@@ -131,15 +133,15 @@ Models should implement the following interface, implementing an `encode` functi
 
 ```python
 class MyModel():
-    def encode(self, sentences, batch_size=32, **kwargs):
+    def encode(self, sentences: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
         """
         Returns a list of embeddings for the given sentences.
+        
         Args:
-            sentences (`List[str]`): List of sentences to encode
-            batch_size (`int`): Batch size for the encoding
+            sentences: List of sentences to encode
 
         Returns:
-            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
+            List of embeddings for the given sentences
         """
         pass
 
@@ -152,64 +154,48 @@ If you'd like to use different encoding functions for query and corpus when eval
 
 ```python
 class MyModel():
-    def encode_queries(self, queries, batch_size=32, **kwargs):
+    def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
         """
         Returns a list of embeddings for the given sentences.
         Args:
-            queries (`List[str]`): List of sentences to encode
-            batch_size (`int`): Batch size for the encoding
+            queries: List of sentences to encode
 
         Returns:
-            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
+            List of embeddings for the given sentences
         """
         pass
 
-    def encode_corpus(self, corpus, batch_size=32, **kwargs):
+    def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
         """
         Returns a list of embeddings for the given sentences.
         Args:
-            corpus (`List[str]` or `List[Dict[str, str]]`): List of sentences to encode
+            corpus: List of sentences to encode
                 or list of dictionaries with keys "title" and "text"
-            batch_size (`int`): Batch size for the encoding
 
         Returns:
-            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
+            List of embeddings for the given sentences
         """
         pass
 ```
 
-### Evaluating on a custom task
+### Evaluating on a custom dataset
 
-To add a new task, you need to implement a new class that inherits from the `AbsTask` associated with the task type (e.g. `AbsTaskReranking` for reranking tasks). You can find the supported task types in [here](https://github.com/embeddings-benchmark/mteb-draft/tree/main/mteb/abstasks).
+To evaluate on a custom task, you can run the following code on your custom task. See [how to add a new task](docs/adding_a_dataset.md), for how to create a new task in MTEB.
 
 ```python
 from mteb import MTEB
 from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
 from sentence_transformers import SentenceTransformer
 
 
-class MindSmallReranking(AbsTaskReranking):
-    @property
-    def description(self):
-        return {
-            "name": "MindSmallReranking",
-            "hf_hub_name": "mteb/mind_small",
-            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
-            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
-            "type": "Reranking",
-            "category": "s2s",
-            "eval_splits": ["validation"],
-            "eval_langs": ["en"],
-            "main_score": "map",
-        }
+class MyCustomTask(AbsTaskReranking):
+    ...
 
 model = SentenceTransformer("average_word_embeddings_komninos")
-evaluation = MTEB(tasks=[MindSmallReranking()])
+evaluation = MTEB(tasks=[MyCustomTask()])
 evaluation.run(model)
 ```
 
-> **Note:** for multilingual tasks, make sure your class also inherits from the `MultilingualTask` class like in [this](https://github.com/embeddings-benchmark/mteb-draft/blob/main/mteb/tasks/Classification/MTOPIntentClassification.py) example.
-
 </details>
 
 <br /> 
@@ -221,12 +207,16 @@ evaluation.run(model)
 | 📋 [Tasks] | Overview of available tasks |
 | 📈 [Leaderboard] | The interactive leaderboard of the benchmark |
 | 🤖 [Adding a model] | Information related to how to submit a model to the leaderboard |
+| 👩‍💻 [Adding a dataset] | How to add a new task/dataset to MTEB | 
 | 🤝  [Contributing] | How to contribute to MTEB and set it up for development |
+<!-- | 🌐 [MMTEB] | An open-source effort to extend MTEB to cover a broad set of languages |   -->
 
 [Tasks]: docs/tasks.md
 [Contributing]: docs/contributing.md
 [Adding a model]: docs/adding_a_model.md
+[Adding a task]: docs/adding_a_dataset.md
 [Leaderboard]: https://huggingface.co/spaces/mteb/leaderboard
+[MMTEB]: docs/mmteb/readme.md
 
 ## Citing