-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compare results between Jsquad prompt with title and without title #84
compare results between Jsquad prompt with title and without title #84
Conversation
kumapo
commented
Aug 26, 2023
•
edited
Loading
edited
Model | jsquad w/o title | jsquad w/ title |
---|---|---|
rinna-japanese-gpt-1b/EM | 26.18 | 30.19 |
rinna-japanese-gpt-1b/f1 | 44.68 | 47.12 |
rinna-gpt-neox-3.6b-instruction-sft-v2/EM | 44.91 | 47.55 |
rinna-gpt-neox-3.6b-instruction-sft-v2/f1 | 59.38 | 61.63 |
Llama2-2.7b/EM | 58.37 | 59.93 |
Llama2-2.7b/f1 | 69.52 | 70.82 |
f843cbf
to
7c5d4bd
Compare
93b4610
to
a73f7a1
Compare
a73f7a1
to
5dcde92
Compare
18a2996
to
556fc89
Compare
556fc89
to
7e9e1ab
Compare
could you check this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumapo thank you for awesome investigation and PR! I left a few comments to organize the code.
Also, regarding the results, I'm thinking if we can skip those changes this time because all models in the leaderboard haven't done yet and the scores in the leaderboard wouldn't match if result.json
and harness.sh
are changed. I know you kindly evaluated as many models as you could as I said you should before and I feel sorry...
lm_eval/tasks/ja/jsquad_v11.py
Outdated
DYNAMIC_MAX_LENGTH = os.getenv("DYNAMIC_MAX_LENGTH", "true").lower() | ||
|
||
|
||
class JSQuADV11(Task): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to put inside jsquad.py
and inherit from the v1 class. Then, we can see the diffs easily.
lm_eval/tasks/ja/jsquad_v11.py
Outdated
return self._squad_metric(predictions=predictions, references=references)[key] | ||
|
||
|
||
class JSQuADV11WithFintanPrompt(JSQuADV11): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to inherit from JSQuADV11
and JSQuADWithFintanPrompt
to reduce duplicates?
lm_eval/tasks/ja/jsquad_v11.py
Outdated
) | ||
|
||
|
||
class JSQuADV11WithJAAlpacaPrompt(JSQuADV11): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as JSQuADV11WithFintanPrompt
lm_eval/tasks/ja/jsquad_v11.py
Outdated
return f"### 指示:\n{self.INSTRUCTION}\n\n### 入力:\n{input_text}\n\n### 応答:\n" | ||
|
||
|
||
class JSQuADV11WithRinnaInstructionSFT(JSQuADV11): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as JSQuADV11WithFintanPrompt
lm_eval/tasks/ja/jsquad_v11.py
Outdated
return f"ユーザー: {input_text}{self.SEP}システム: " | ||
|
||
|
||
class JSQuADV11WithRinnaBilingualInstructionSFT(JSQuADV11WithRinnaInstructionSFT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as JSQuADV11WithFintanPrompt
Thanks for your comments. |
Will you please check the updated PR? Please let me know if I missed something. |
@kumapo Awesome! Thank you! |
Sorry, I forgot that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@kumapo sorry but could you try to resolve this error in the action? |
@kumapo Thank you for trying to resolve it. But, I see the errors come from the entire code rather than your PR. So, please let me merge this PR now! Thank you so much for your effort again 🙏 |
Actually, I was wondering if it would have been better to fix all the problems. |
…tability-AI#84) * re-evaluate models with jsquad prompt with title * update jsquad to include titles into the prompt * re-evaluate models with jsquad prompt with title * inherit JSQuAD v1.2 tasks from v1.1 for readability * re-evaluate models with jsquad prompt with title * wont need jsquad_v11 * revert result.json and harness.sh in models * fix format
* Initial working refactor This just pulls the argparse stuff into a separate function. * Do some rearrangement for the refactor Eval args are necessary, other params are optional. The print output is only needed when called from the cli, plus it assumes that various keys are present (even if None), which is not the case when calling from Python. * Move main script to scripts dir, add symlink Other scripts can't import the main script since it's in the top level. This moves it into the scripts dir and adds a symlink so it's still usable at the old location. * Work on adding example Python harness script * Add notify script * Fix arg * task cleanup * Add versions to tasks * Fix typo * Fix versions * Read webook url from env var * evaluate line-corporation large models (#81) * compare results between Jsquad prompt with title and without title (#84) * re-evaluate models with jsquad prompt with title * update jsquad to include titles into the prompt * re-evaluate models with jsquad prompt with title * inherit JSQuAD v1.2 tasks from v1.1 for readability * re-evaluate models with jsquad prompt with title * wont need jsquad_v11 * revert result.json and harness.sh in models * fix format * Verbose output for more tasks (#92) * Add output to jaqket v2 * Add details to jsquad * Add versbose output to xlsum --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add gptq support (#87) * add EleutherAI PR519 autoGPTQ * add comma * change type * change type2 * change path * Undo README modifications --------- Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> * Add Balanced Accuracy (#95) * First implementation of balanced accuracy * Add comment * Make JNLI a balanced acc task * Add mcc and balanced f1 scores --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Remove 3.8 version spec from pre-commit config The version here makes it so that pre-commit can only run in an environment with python3.8 in the path, but there's no compelling reason for that. Removing the spec just uses system python. * Fix Linter Related Issues (#96) * Change formatting to make the linter happy This is mostly: - newlines at end of files - removing blank lines at end of files - changing single to double quotes - black multi-line formatting rules - other whitespace edits * Remove codespell Has a lot of false positives * boolean style issue * bare except These seem harmless enough, so just telling the linter to ignore them * More linter suggestions --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Simplify neologdn version This was pointing to a commit, but the relevant PR has been merged and released for a while now, so a normal version spec can be used. * Update xwinograd dataset The old dataset was deleted. * won't need llama2/llama2-2.7b due to duplication (#99) * add gekko (#98) Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> * add llama2 format (#100) * add llama2 format * add 0.6 in prompt_templates.md * make pre-commit pass * remove debugging line * fix bug on `mgsm` for prompt version `0.3` (#101) * Add JCoLA task (#93) * WIP: need JCoLA * Update harness.jcola.sh * update prompt * update prompt * update prompt * update prompt * Revert "update prompt" This reverts commit cd9a914. * WIP: evaluate on JCoLA * Add new metrics to cola This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output. * Linter edits * evaluate on JCoLA * need JCoLAWithLlama2 * JCoLA's prompt version should be 0.0 https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md * documentation jptasks.md and prompt_templates.md * won't need harness and result for JCoLA * fix linter related issue * Delete harness.jcola.sh --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com> * Linter fixes * Remove example - script is used instead of function * Cleanup * Cleanup / linter fixes There were some things related to the old shell script usage that weren't working, this should fix it. * Add README section describing cluster usage --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: kumapo <kumapo@users.noreply.github.com> Co-authored-by: webbigdata-jp <87654083+webbigdata-jp@users.noreply.github.com> Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>