Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add llama2 format #100

Merged
merged 5 commits into from
Oct 10, 2023
Merged

add llama2 format #100

merged 5 commits into from
Oct 10, 2023

Conversation

mkshing
Copy link

@mkshing mkshing commented Oct 10, 2023

Description

Added a new prompt version of llama2-chat by following https://huggingface.co/blog/llama2#how-to-prompt-llama-2. This PR enables to evaluate all llama2 variants including ELYZA-japanese-Llama-2-7b-instruct.

Usage

2 key points:

  1. Set the correct system prompt as SYSTEM_PROMPT environment variable
  2. Use 0.6 for prompt version
# Make sure to set the correct system prompt 
export SYSTEM_PROMPT="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

MODEL_ARGS="pretrained=meta-llama/Llama-2-7b-chat-hf"
TASK="jsquad-1.1-0.6,jcommonsenseqa-1.1-0.6,jnli-1.1-0.6,marc_ja-1.1-0.6,jaqket_v2-0.2-0.6,xlsum_ja-1.0-0.6,xwinograd_ja,mgsm-1.0-0.6"
python main.py \
   --model hf-causal \
   --model_args $MODEL_ARGS \
   --tasks $TASK \
   --num_fewshot "2,3,0,0,1,1,0,5" \
   --device "cuda" \
   --output_path "models/llama2/llama2-7b-chat/result.json" 

Comparison between 0.3 and 0.6 on JCommonsenseQA for ELYZA-japanese-Llama-2-7b-instruct

Task Version Metric Value Stderr
jcommonsenseqa-1.1-0.6 1.1 acc 0.7087 ± 0.0136
acc_norm 0.7015 ± 0.0137
jcommonsenseqa-1.1-0.3 1.1 acc 0.6506 ± 0.0143
acc_norm 0.3539 ± 0.0143
  • * 65.15 in their blog, (see "lm-eval-harness" section)
export SYSTEM_PROMPT="あなたは誠実で優秀な日本人のアシスタントです。"
export MODEL_ARGS="pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct"
export TASK="jcommonsenseqa-1.1-0.6,jcommonsenseqa-1.1-0.3"
python main.py  \
    --model hf-causal \
    --model_args $MODEL_ARGS \
    --tasks $TASK \
    --num_fewshot "3,3" \
    --device "cuda" \
    --output_path models/elyza/ELYZA-japanese-Llama-2-7b-instruct/result.json \
{
  "results": {
    "jcommonsenseqa-1.1-0.6": {
      "acc": 0.7086684539767649,
      "acc_stderr": 0.013589216112682913,
      "acc_norm": 0.7015192135835567,
      "acc_norm_stderr": 0.013685386698397504
    },
    "jcommonsenseqa-1.1-0.3": {
      "acc": 0.6505808757819481,
      "acc_stderr": 0.014259460025628168,
      "acc_norm": 0.353887399463807,
      "acc_norm_stderr": 0.01430097848599956
    }
  },
  "versions": {
    "jcommonsenseqa-1.1-0.6": 1.1,
    "jcommonsenseqa-1.1-0.3": 1.1
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=elyza/ELYZA-japanese-Llama-2-7b-instruct",
    "num_fewshot": [
      3,
      3
    ],
    "batch_size": null,
    "device": "cuda",
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

@mkshing mkshing added the enhancement New feature or request label Oct 10, 2023
@mkshing mkshing requested a review from jon-tow as a code owner October 10, 2023 03:54
@mkshing mkshing removed the request for review from jon-tow October 10, 2023 03:54
@mkshing mkshing mentioned this pull request Oct 10, 2023
@mkshing mkshing self-assigned this Oct 10, 2023
@leemengtw
Copy link
Collaborator

@mkshing quick comment but maybe we can start to use llama2 instead of 0.6 to improve readability? This is suggested by @polm-stability on Slack as well

@mkshing
Copy link
Author

mkshing commented Oct 10, 2023

@leemengtw Yeah I agree your point but we need to deprecate the integer version and rename all, which takes extra work. So, at least, in this PR, I want to keep the integer format. Does it make sense??

@mkshing mkshing removed the request for review from fujiki-saij October 10, 2023 06:20
Copy link

@mrorii mrorii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

lm_eval/tasks/ja/jaqket_v2.py Show resolved Hide resolved
Copy link
Collaborator

@leemengtw leemengtw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leemengtw Yeah I agree your point but we need to deprecate the integer version and rename all, which takes extra work. So, at least, in this PR, I want to keep the integer format. Does it make sense??

@mkshing yes totally make sense. Let's keep 0.6 for this release.

Also, the PR LGMeng.

@mkshing mkshing merged commit c517f0e into jp-stable Oct 10, 2023
1 check passed
@mkshing mkshing deleted the mkshing/llama2-prompt branch October 10, 2023 06:58
polm-stability pushed a commit to polm-stability/lm-evaluation-harness that referenced this pull request Oct 11, 2023
* add llama2 format

* add 0.6 in prompt_templates.md

* make pre-commit pass

* remove debugging line
polm-stability added a commit that referenced this pull request Nov 6, 2023
* Initial working refactor

This just pulls the argparse stuff into a separate function.

* Do some rearrangement for the refactor

Eval args are necessary, other params are optional.

The print output is only needed when called from the cli, plus it
assumes that various keys are present (even if None), which is not the
case when calling from Python.

* Move main script to scripts dir, add symlink

Other scripts can't import the main script since it's in the top level.
This moves it into the scripts dir and adds a symlink so it's still
usable at the old location.

* Work on adding example Python harness script

* Add notify script

* Fix arg

* task cleanup

* Add versions to tasks

* Fix typo

* Fix versions

* Read webook url from env var

* evaluate line-corporation large models (#81)

* compare results between Jsquad prompt with title and without title (#84)

* re-evaluate models with jsquad prompt with title

* update jsquad to include titles into the prompt

* re-evaluate models with jsquad prompt with title

* inherit JSQuAD v1.2 tasks from v1.1 for readability

* re-evaluate models with jsquad prompt with title

* wont need jsquad_v11

* revert result.json and harness.sh in models

* fix format

* Verbose output for more tasks (#92)

* Add output to jaqket v2

* Add details to jsquad

* Add versbose output to xlsum

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add gptq support (#87)

* add EleutherAI PR519 autoGPTQ

* add comma

* change type

* change type2

* change path

* Undo README modifications

---------

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* Add Balanced Accuracy (#95)

* First implementation of balanced accuracy

* Add comment

* Make JNLI a balanced acc task

* Add mcc and balanced f1 scores

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Remove 3.8 version spec from pre-commit config

The version here makes it so that pre-commit can only run in an
environment with python3.8 in the path, but there's no compelling reason
for that. Removing the spec just uses system python.

* Fix Linter Related Issues (#96)

* Change formatting to make the linter happy

This is mostly:

- newlines at end of files
- removing blank lines at end of files
- changing single to double quotes
- black multi-line formatting rules
- other whitespace edits

* Remove codespell

Has a lot of false positives

* boolean style issue

* bare except

These seem harmless enough, so just telling the linter to ignore them

* More linter suggestions

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Simplify neologdn version

This was pointing to a commit, but the relevant PR has been merged and
released for a while now, so a normal version spec can be used.

* Update xwinograd dataset

The old dataset was deleted.

* won't need llama2/llama2-2.7b due to duplication (#99)

* add gekko (#98)

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* add llama2 format (#100)

* add llama2 format

* add 0.6 in prompt_templates.md

* make pre-commit pass

* remove debugging line

* fix bug on `mgsm` for prompt version `0.3` (#101)

* Add JCoLA task (#93)

* WIP: need JCoLA

* Update harness.jcola.sh

* update prompt

* update prompt

* update prompt

* update prompt

* Revert "update prompt"

This reverts commit cd9a914.

* WIP: evaluate on JCoLA

* Add new metrics to cola

This modifies cola, since jcola just inherits this part. It's not a
problem to modify the parent task because it just adds some output.

* Linter edits

* evaluate on JCoLA

* need JCoLAWithLlama2

* JCoLA's prompt version should be 0.0

https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md

* documentation

jptasks.md and prompt_templates.md

* won't need harness and result for JCoLA

* fix linter related issue

* Delete harness.jcola.sh

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>

* Linter fixes

* Remove example - script is used instead of function

* Cleanup

* Cleanup / linter fixes

There were some things related to the old shell script usage that
weren't working, this should fix it.

* Add README section describing cluster usage

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: kumapo <kumapo@users.noreply.github.com>
Co-authored-by: webbigdata-jp <87654083+webbigdata-jp@users.noreply.github.com>
Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants