Arena hard elad2 #1026

eladven · 2024-07-17T06:01:17Z

LLMaJ:

Added examples for llm as judge using Arena Hard (Evaluate your model on the Arena Hard benchmark using a custom LLMaJ and Evaluate a judge model performance judging the Arena Hard Benchmark)
Added tasks.response_assessment.pairwise_comparative_rating.single_turn to the possible tasks llama can run
Datasets added:
LlmaJ now pass the original dataset privacy_policy to the new generated dataset
Added WeightedWinRateCorrelation metric for tasks.response_assessment.pairwise_comparative_rating.single_turn

Datasets additions:

Add ArenaHard dataset for evaluation and meta-evaluation of llmaj
Reward bench (a dataset for meta evaluation)

Templates:

Arena Hard template + processors
ArenaHard-like Prometeous template
Added class PairwiseComparativeRatingTemplate

Small Changes:

Added DuplicateSplit Operator
1. Instance Metadata now contains "data_classification_policy"

Consistency Breaking dataset changes

Added shuffle to the universal_ner card as the dataset comes sorted by classes from HF

Small fixes:

Fixed calpnq card by adding test to unanswerable responses
fixed in HF Space loader - _map_wildcard_path_to_full_paths didn't use the revision
SelectFields operator now always keep the fields ["data_classification_policy", "recipe_metadata"]
Join Operator won't delete ["data_classification_policy", "recipe_metadata"]
Bug fix in shuffle operation in PairwiseChoiceTemplate

Left TODOs: (open issues)

Support using both positions in llmaj
Add a metric for judge bias to positions in llmaj
Maybe also add a metric to the meta-eval in llmaj

prepare/cards/universal_ner.py

prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

examples/evaluate_a_judge_model_capabilities_on_arena_hard.py

tests/library/test_recipe.py

…d_elad2

* bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

@yoavkatz

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>

* bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

@yoavkatz

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>

OfirArviv added 30 commits July 17, 2024 08:52

bug fixes in PairwiseChoiceTemplate

fd17810

add arena hard regex parser operator

0e2d7ce

update mt bench card common

95ec7fe

update mt bench card common

44f398a

add reward bench

a5ce257

update metric to pairwise comarison task

c019b6e

arena hard tasks and cards

d9c3960

update mt bench template

c5bbade

add duplicate stream operator

1f0e2da

add PairwiseComparativeRatingTemplate

5c868bc

add card

416d9e9

add card

3c89849

add template

186faa5

add winrate metrics

b5bf6a2

add comparative rating task

a02e23c

add ExtractArenaHardNumericalJudgment

8893eee

add arena hard cards

ddcb726

add arena hard template

a361172

add weighted winrate metrics

e144985

delete file

f5c2ccf

update PairwiseComparativeRatingTemplate

d53de7c

add metric

f9123d8

add metric

9c97815

update

111c6a6

update

35bac84

update

a56e8f8

fix template bug

7a4ecb0

update

4b461af

llama 3 update

c3278a6

update

5ed6fb4

OfirArviv added 6 commits July 21, 2024 13:08

update

a42c811

update

e321825

update

e33b475

Merge branch 'main' into arena_hard_elad2

d59600c

update

80088bf

Merge branch 'main' into arena_hard_elad2

ad32171

perlitz reviewed Jul 21, 2024

View reviewed changes

prepare/cards/universal_ner.py Show resolved Hide resolved

prepare/templates/rag/response_generation.py Outdated Show resolved Hide resolved

prepare/templates/rag/response_generation.py Outdated Show resolved Hide resolved

OfirArviv and others added 5 commits July 21, 2024 14:09

Update prepare/templates/rag/response_generation.py

af5ad98

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

Update prepare/templates/rag/response_generation.py

3ec947f

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

update

a2c428e

Merge branch 'main' into arena_hard_elad2

ab50245

Merge branch 'main' into arena_hard_elad2

ec0c0fd

yoavkatz reviewed Jul 22, 2024

View reviewed changes

examples/evaluate_a_judge_model_capabilities_on_arena_hard.py Show resolved Hide resolved

yoavkatz reviewed Jul 22, 2024

View reviewed changes

tests/library/test_recipe.py Show resolved Hide resolved

OfirArviv added 5 commits July 22, 2024 15:07

cr fixes

37e5ad2

Merge remote-tracking branch 'origin/arena_hard_elad2' into arena_har…

4f70ed8

…d_elad2

llmaj format fix

b92b2a0

llmaj format fix

e69c772

Merge branch 'main' into arena_hard_elad2

c426040

OfirArviv approved these changes Jul 22, 2024

View reviewed changes

yoavkatz enabled auto-merge (squash) July 22, 2024 14:33

OfirArviv added 2 commits July 22, 2024 21:48

merge

aa78fdd

Merge branch 'main' into arena_hard_elad2

c7a48bc

yoavkatz merged commit 98fc005 into main Jul 22, 2024
8 checks passed

yoavkatz deleted the arena_hard_elad2 branch July 22, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arena hard elad2 #1026

Arena hard elad2 #1026

eladven commented Jul 17, 2024 •

edited by perlitz

Loading

Arena hard elad2 #1026

Arena hard elad2 #1026

Conversation

eladven commented Jul 17, 2024 • edited by perlitz Loading

eladven commented Jul 17, 2024 •

edited by perlitz

Loading