Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up the ability to run eval suites #114

Merged

Conversation

polm-stability
Copy link
Collaborator

This PR includes changes to allow the running of eval suites with a single command. An example command looks like this:

python scripts/run_suite.py my_model my_eval_suite my_prompt

The suite is specified as a list of tasks, with versions and fewshot specs, in a config file. Because the spec is in a file, it can be versioned and shared across models, while each model can vary the prompt it uses (as well as args related to loading the model). Prompts are specified using names rather than numbers to make it clear what they refer to and avoid mistakes.

This is a barely functional wrapper for running "test suites", which are
just a list of preconfigured tasks. You can specify prompt and model.
This needs more testing and UI cleanup.
This moves suite config handling code into the library proper instead of
the script, and creates a subdir for suite configs.
@polm-stability polm-stability marked this pull request as ready for review November 9, 2023 08:18
@polm-stability polm-stability removed the request for review from jon-tow November 9, 2023 08:18
@polm-stability
Copy link
Collaborator Author

This is still pretty bare-bones, but it's functional and should be good for automating eval across different models. Basically we can run the same eval we've been running, but with a simpler invocation, and without worrying about copying versions or fewshot parameters the wrong way.

Comment on lines 1 to 10
PROMPT_CODES = {
"user": "0.0",
"jgpt": "0.1",
"fintan": "0.2",
"fintan2": "0.2.1",
"ja-alpaca": "0.3",
"rinna-sft": "0.4",
"rinna-bilingual": "0.5",
"llama2": "0.6",
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if these names make sense or could be improved.

Copy link

@mkshing mkshing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polm-stability thank you for this PR!

  • Can you add the instruction to use suites somewhere? Maybe you can just change the example script in README to one using suites.
  • Can you fix docs/prompt_templates.md?

@polm-stability
Copy link
Collaborator Author

Good point, docs should be updated now.

@mkshing
Copy link

mkshing commented Nov 10, 2023

Awesome! LGTM!

Copy link

@mrorii mrorii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM 👍, but let me double check one point just in case 🙏

docs/prompt_templates.md Outdated Show resolved Hide resolved
scripts/run_suite.py Outdated Show resolved Hide resolved
This introduces a style for handling complex prompts and specifically
handles the case of JSLM Beta. This is handled by using a function that
takes the name of task as input. This allows for full customization
without requiring details specification when actually running an eval
suite.

The style is simple - instead of mapping to a numeric version like 0.2,
a shortname for a prompt can map to a callable that takes the task name.
This allows for any kind of custom logic.

This may not be the simplest or best approach, but it required few
changes, keeps everything in one place, and touches nothing else in the
code base, so it should be easy to change later if necessary.
Copy link

@mrorii mrorii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 👍

@polm-stability polm-stability merged commit e68527f into Stability-AI:jp-stable Nov 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants