Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add jasper #1591

Merged
merged 20 commits into from
Dec 23, 2024
Merged

Feat: Add jasper #1591

merged 20 commits into from
Dec 23, 2024

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Dec 14, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

Full model results embeddings-benchmark/results#68

Some results are not matching, with a significant gap in AskUbuntuDupQuestions. @DunZhang, could you please take a look at what might be wrong?

I tried adding:

sentences = [i if i.strip() else "<|endoftext|>" for i in sentences]

but it had no effect. I also tried setting max_seq_length to 400, but that didn’t help either.

task Results from embeddings-benchmark/results#68 Results from this implementation
BIOSSES 0.846598 0.847182
STS17 (en-ar) 0.52721 0.526445
STS17 (fr-en) 0.841421 0.841915
STS17 (en-en) 0.910079 0.910827
STS17 (nl-en) 0.841977 0.841734
STS17 (es-en) 0.869965 0.870213
STS17 (it-en) 0.863103 0.863171
STS17 (en-de) 0.858372 0.858703
STS17 (en-tr) 0.555649 0.555269
AskUbuntuDupQuestions 0.673812 0.67403
SummEval 0.314212 0.314331
SciFact 0.80372 0.80493
TweetSentimentExtractionClassification 0.772411 0.772722
EmotionClassification 0.8773 0.8772
SprintDuplicateQuestions 0.964021 0.963987
SCIDOCS 0.24638 0.24713

Full results jasper_results.zip

Eval code
import mteb

tasks = mteb.get_tasks(
    tasks=[
        "BIOSSES",
        "STS17",
        "STS16",
        "AskUbuntuDupQuestions",
        "SummEval",
        "SciFact",
        "SCIDOCS",
        "TweetSentimentExtractionClassification",
        "EmotionClassification",
        "SprintDuplicateQuestions"
    ],
    languages=["eng"]
)

models = [ 
    "infgrad/jasper_en_vision_language_v1",
]

evaluation = mteb.MTEB(tasks=tasks)

for model_name in models:
    model = mteb.get_model(model_name)
    evaluation.run(
        model,
        output_folder="results",
        verbosity=2,
        raise_error=False,
        encode_kwargs={"batch_size": 8},
        # overwrite_results=True,
    )

@Samoed Samoed marked this pull request as ready for review December 14, 2024 21:15
@Samoed
Copy link
Collaborator Author

Samoed commented Dec 14, 2024

I think my implementation of the model has lower results on AskUbuntuDupQuestions because, in the authors' implementation, prompts for passages are only applied to retrieval tasks. In my implementation, prompts for passages are not applied to any tasks (including retrieval and reranking), resulting in worse performance. I'm not sure what to do in this case

@KennethEnevoldsen
Copy link
Contributor

I think it is perfectly fine to apply prompts only to some tasks (as long as it is clear in the implementation)

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 14, 2024

I agree, but it unclear why passage prompt not applying to retrieval only. Should they be applied to InstructionsRetrieval or InstructionReranking?

@KennethEnevoldsen
Copy link
Contributor

Yea that is a somewhat arbitrary decision (again that is why it is nice to have the implementation). I would probably add it in both cases

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 15, 2024

I suggest waiting for @DunZhang's input to hear his opinion on this

@DunZhang
Copy link

@Samoed
Hi, It's an interesting thing 😄.

In Englinsh-MTEB, the Rerank tasks really more like STS tasks, which means that the queries and 'passages' are symmetrical.
In other words, the so-called 'passages' are actually questions.

Below is the task type about AskUbuntuDupQuestions and example data:
image

image

As the data are symmetrical, they all need prompt just like STS task!

On the contrary, Rerank task in Chinese-MTEB is about query and passage (Irrelevance to the present matter, not to be discussed at length).

Finally, my model's usage:

For s2p task (e.g. retrieval), s need prompt, p does not need prompt

For s2s task (e.g. STS), they all need prompt.

Reference:
https://huggingface.co/datasets/mteb/askubuntudupquestions-reranking?row=4
https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md

@DunZhang
Copy link

As for the other mismatched tasks, that's too hard to explain, and if the overall difference in averages isn't too large, I think it's negligible.

Below is some reproduction details:

  • In my test, the model is bfloat16
  • max_length=400
  • attn_implementation=sdpa
  • vector_dim=12288
  • padding_side=right

Finally:
do normalize in SentenceTransformers's encode function:

encode_multi_process(...., normalize_embeddings=True)

or

encode(...., normalize_embeddings=True)

then convert to fp32:
vectors = vectors.astype(dtype=np.float32)

Actually, convert to fp32 then do normalize always get high score(Statistically non-significant difference)

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 17, 2024

I think I will try to apply prompt based on retrieval type. Thank you for the feedback!

instruction = self.get_task_instruction(task_name, prompt_type)

# to passage prompts won't be applied to passages
if prompt_type == PromptType.passage and task.metadata.type == "s2p":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it to apply the passage prompt only if the task type is s2s or p2p.

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Dec 22, 2024

Just noting here, that it is perfectly valid to change the prompt conditional on the task.

E.g.

if prompt_type == PromptType.passage and task.metadata.name not in SYMETRIC_RETRIEVAL_TASKS:

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 22, 2024

Yes, I've changed and now prompt won't apply to all s2p tasks or I can strictly filter based on selected tasks.

@KennethEnevoldsen
Copy link
Contributor

Then I believe this is ready to merge?

@Samoed
Copy link
Collaborator Author

Samoed commented Dec 22, 2024

I think yes, but I was waiting if @DunZhang have something to add

@DunZhang
Copy link

I think yes, but I was waiting if @DunZhang have something to add

Hi. I have nothing more to add 😄

@KennethEnevoldsen
Copy link
Contributor

Perfect - will merge then - thanks for taking the time

@KennethEnevoldsen KennethEnevoldsen merged commit ef5a068 into main Dec 23, 2024
10 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the add_jasper branch December 23, 2024 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants