Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature specs discussion board for umBRELA #1

Open
UShivani3 opened this issue May 4, 2024 · 5 comments
Open

Feature specs discussion board for umBRELA #1

UShivani3 opened this issue May 4, 2024 · 5 comments

Comments

@UShivani3
Copy link
Member

I am starting this thread for feature spec discussion for umBRELA @lintool @ronakice.

Suggestions from my side:

  • parameter for specifying the number of samples for inference and later performing voting to get majority results.
  • Maybe add some additional instructions to the prompt to guide the LLM. We already have an option of specifying a prompt file, so I'm not sure how useful this could be.
  • If the input dictionary already includes a relevance label, we can add a key called correctness of the label in output. This could be a feature for verifying already available relevance assessments.
@ronakice
Copy link
Member

ronakice commented May 5, 2024

@UShivani3 can you give some demo usage here so @lintool is aware of the exacts of the framework so far? Some snippets.

@UShivani3
Copy link
Member Author

UShivani3 commented May 5, 2024

Yes, my bad!

Here is the snippet

Setting up the model judge:

from umbrela.vicuna_judge import VicunaJudge

judge_vicuna = VicunaJudge("dl19-passage")

Passing qrel-passages for evaluations:

input_dict = {
    "query": {"text": "how long is life cycle of flea", "qid": "264014"},
    "candidates": [
        {
            "doc": {
                "segment": "The life cycle of a flea can last anywhere from 20 days to an entire year. It depends on how long the flea remains in the dormant stage (eggs, larvae, pupa). Outside influences, such as weather, affect the flea cycle. A female flea can lay around 20 to 25 eggs in one day."
            },
            "docid": "4834547",
            "score": 14.971799850463867,
        },
    ]
}

judgments = judge_vicuna.judge(input_dict)

Output format for each judgment:

judgment = {
          "model": model_name,
          "query": query,
          "passage": passage,
          "prompt": prompt,
          "prediction": model_response,
          "judgment": relevance_label_after_parsing_model_response,
          }

I have also added a sample code using OSLLMJudge class here: https://github.com/castorini/umbrela/blob/main/src/eval/test.py.

@ronakice
Copy link
Member

ronakice commented May 6, 2024

@thakur-nandan can you give your thoughts on the design so far too?

@thakur-nandan
Copy link
Member

Sure, thanks @UShivani3, overall I like the minimalistic code and easy-to-use repository design. Both prompts look good. The installation instructions in the README are helpful.

One suggestion I have is to decouple the prompt with LLM judge code, This will in the future complicate as one would need to keep on updating the base LLMJudge shown below with newer prompts as shown below:

if prompt_type:

How I think we can restructure the design:

  • PromptTemplate class: This will take in prompt_type and prompt_file and fewshot_count as the input and output any prompt we like (either bing or basic) for the query-passage pair.
  • LLMJudge Class: This will take in the prompt from the PromptTemplate class as the input and output the relevance judgment.

@ronakice @UShivani3 would be happy to take your suggestions.

@thakur-nandan
Copy link
Member

One more question: @UShivani3 what does the score in the input_dict signify? Is this a retrieval/reranking score?

Does it affect the LLMJudge response?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants