Benjams/add rag task card and metric (#1044)

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
IBM · Aug 29, 2024 · 7e1ee30 · 7e1ee30
1 parent a61c53e
commit 7e1ee30
Show file tree

Hide file tree

Showing 17 changed files with 586 additions and 0 deletions.
diff --git a/prepare/cards/rag/end_to_end/__init__.py b/prepare/cards/rag/end_to_end/__init__.py
diff --git a/prepare/cards/rag/end_to_end/clapnq.py b/prepare/cards/rag/end_to_end/clapnq.py
@@ -0,0 +1,123 @@
+import json
+from dataclasses import dataclass
+
+from unitxt import add_to_catalog
+from unitxt.blocks import TaskCard, TemplatesDict
+from unitxt.loaders import LoadCSV
+from unitxt.operators import Copy, ListFieldValues, Set
+from unitxt.templates import InputOutputTemplate
+from unitxt.test_utils.card import test_card
+
+
+@dataclass(frozen=True)
+class ClapNqBenchmark:
+    # Raw_data
+    TRAIN_RAW_FILE_URL: str = "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/train/question_train_answerable.tsv"
+    TEST_RAW_FILE_URL: str = "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/dev/question_dev_answerable.tsv"
+
+    # Fields
+    ID: str = "id"
+    QUESTION: str = "question"
+    DOC_ID_LIST: str = "doc-id-list"
+    ANSWERS: str = "answers"
+
+
+@dataclass(frozen=True)
+class ClapNqDocuments:
+    # Raw_data
+    RAW_FILE_URL: str = "https://media.githubusercontent.com/media/primeqa/clapnq/main/retrieval/passages.tsv"
+
+    # Fields
+    ID: str = "id"
+    TEXT: str = "text"
+    TITLE: str = "title"
+
+    ARTIFACT_NAME: str = "cards.rag.documents.clap_nq.en"
+
+
+card = TaskCard(
+    loader=LoadCSV(
+        sep="\t",
+        files={
+            "train": ClapNqBenchmark.TRAIN_RAW_FILE_URL,
+            "test": ClapNqBenchmark.TEST_RAW_FILE_URL,
+        },
+    ),
+    preprocess_steps=[
+        Copy(
+            field_to_field={
+                ClapNqBenchmark.QUESTION: "question",
+                ClapNqBenchmark.ID: "question_id",
+            },
+        ),
+        Set(
+            fields={
+                "reference_contexts": [],
+                "is_answerable_label": True,
+                "metadata_field": "",
+            }
+        ),
+        ListFieldValues(
+            fields=[ClapNqBenchmark.DOC_ID_LIST],
+            to_field="reference_context_ids",
+        ),
+        ListFieldValues(
+            fields=[ClapNqBenchmark.ANSWERS],
+            to_field="reference_answers",
+        ),
+    ],
+    task="tasks.rag.end_to_end",
+    # templates=["templates.empty"],
+    templates=TemplatesDict({"default": "templates.rag.end_to_end.json_predictions"}),
+)
+
+wrong_answer = {
+    "contexts": ["hi"],
+    "is_answerable": True,
+    "answer": "Don't know",
+    "context_ids": ["id0"],
+}
+test_card(
+    card,
+    strict=True,
+    full_mismatch_prediction_values=[json.dumps(wrong_answer)],
+    debug=False,
+    demos_taken_from="test",
+    demos_pool_size=5,
+)
+
+add_to_catalog(card, "cards.rag.benchmark.clap_nq.en", overwrite=True)
+
+# Documents
+card = TaskCard(
+    loader=LoadCSV(sep="\t", files={"train": ClapNqDocuments.RAW_FILE_URL}),
+    preprocess_steps=[
+        Copy(
+            field_to_field={
+                ClapNqDocuments.ID: "document_id",
+                ClapNqDocuments.TITLE: "title",
+            },
+        ),
+        ListFieldValues(
+            fields=[ClapNqDocuments.TEXT],
+            to_field="passages",
+        ),
+        Set(
+            fields={
+                "metadata_field": "",
+            }
+        ),
+    ],
+    task="tasks.rag.corpora",
+    templates=TemplatesDict(
+        {
+            "empty": InputOutputTemplate(
+                input_format="",
+                output_format="",
+            ),
+        }
+    ),
+)
+
+# Not testing card, because documents are not evaluated.
+add_to_catalog(card, "cards.rag.documents.clap_nq.en", overwrite=True)
diff --git a/prepare/metrics/rag.py b/prepare/metrics/rag.py
@@ -416,3 +416,128 @@
     add_to_catalog(
         metric, f"metrics.rag.response_generation.{axis}.{base_metric}", overwrite=True
     )
+
+# end to end
+
+end_to_end_artifact_name_to_main_score = {
+    "metrics.rag.end_to_end.answer_correctness": "recall",
+    "metrics.rag.end_to_end.answer_reward": "score",
+    "metrics.rag.end_to_end.answer_faithfulness": "precision",
+    "metrics.rag.end_to_end.context_correctness": "score",
+    "metrics.rag.end_to_end.context_relevance": "score",
+}
+
+end_to_end_artifact_names_to_main_metric = {
+    "metrics.rag.end_to_end.answer_correctness": "metrics.token_overlap",
+    "metrics.rag.end_to_end.answer_reward": "metrics.reward.deberta_v3_large_v2",
+    "metrics.rag.end_to_end.answer_faithfulness": "metrics.token_overlap",
+    "metrics.rag.end_to_end.context_correctness": "metrics.mrr",
+    "metrics.rag.end_to_end.context_relevance": "metrics.perplexity_q.flan_t5_small",
+}
+
+assert len(end_to_end_artifact_name_to_main_score) == len(
+    end_to_end_artifact_names_to_main_metric
+)
+
+copy_field_prediction_answer_to_prediction = Copy(
+    field_to_field=[
+        (
+            "prediction/answer",
+            "prediction",
+        )
+    ],
+)
+
+copy_field_reference_answers_to_references = Copy(
+    field_to_field={"task_data/reference_answers": "references"},
+)
+
+copy_field_reference_contexts_to_references = Copy(
+    field_to_field={"task_data/reference_contexts": "references"}
+)
+
+copy_field_prediction_contexts_to_prediction = Copy(
+    field_to_field=[
+        (
+            "prediction/contexts",
+            "prediction",
+        )
+    ],
+)
+
+copy_field_prediction_context_ids_to_prediction = Copy(
+    field_to_field=[
+        (
+            "prediction/context_ids",
+            "prediction",
+        )
+    ],
+)
+
+copy_field_reference_context_ids_to_references_in_a_list = ListFieldValues(
+    fields=["task_data/reference_context_ids"],
+    to_field="references",
+)
+
+copy_field_prediction_contexts_to_references = Copy(
+    field_to_field=[
+        (
+            "prediction/contexts",
+            "references",
+        )
+    ],
+)
+
+
+copy_field_question_to_prediction = Copy(
+    field_to_field=[
+        (
+            "task_data/question",
+            "prediction",
+        )
+    ],
+)
+
+copy_field_question_to_references_in_a_list = ListFieldValues(
+    fields=["task_data/question"],
+    to_field="references",
+)
+
+end_to_end_artifact_names_to_preprocess_steps = {
+    "metrics.rag.end_to_end.answer_correctness": [
+        copy_field_prediction_answer_to_prediction,
+        copy_field_reference_answers_to_references,
+    ],
+    "metrics.rag.end_to_end.answer_reward": [
+        copy_field_prediction_answer_to_prediction,
+        copy_field_question_to_references_in_a_list,
+    ],
+    "metrics.rag.end_to_end.answer_faithfulness": [
+        copy_field_prediction_contexts_to_references,
+        copy_field_prediction_answer_to_prediction,
+    ],
+    "metrics.rag.end_to_end.context_correctness": [
+        copy_field_prediction_context_ids_to_prediction,
+        copy_field_reference_context_ids_to_references_in_a_list,
+    ],
+    "metrics.rag.end_to_end.context_relevance": [
+        copy_field_prediction_contexts_to_references,
+        copy_field_question_to_prediction,
+    ],
+}
+
+
+for artifact_name in end_to_end_artifact_names_to_preprocess_steps.keys():
+    metric_short_name = artifact_name.split(".")[-1]
+    if metric_short_name == "rouge":  # rouge does not need a prefix
+        score_prefix = ""
+    else:
+        score_prefix = f"[score_prefix={metric_short_name}_]"
+
+    metric = MetricPipeline(
+        main_score=end_to_end_artifact_name_to_main_score[artifact_name],
+        preprocess_steps=end_to_end_artifact_names_to_preprocess_steps[artifact_name],
+        metric=f"{end_to_end_artifact_names_to_main_metric[artifact_name]}{score_prefix}",
+    )
+
+    add_to_catalog(metric, artifact_name, overwrite=True)
diff --git a/prepare/tasks/rag/__init__.py b/prepare/tasks/rag/__init__.py
diff --git a/prepare/tasks/rag/rag_end_to_end.py b/prepare/tasks/rag/rag_end_to_end.py
@@ -0,0 +1,49 @@
+from typing import Any, Dict, List
+
+from unitxt import add_to_catalog
+from unitxt.blocks import Task
+
+add_to_catalog(
+    Task(
+        input_fields={
+            "question": str,
+            "question_id": Any,
+            "metadata_field": str,
+        },
+        reference_fields={
+            "reference_answers": List[str],
+            "reference_contexts": List[str],
+            "reference_context_ids": List[str],
+            "is_answerable_label": bool,
+        },
+        metrics=[
+            "metrics.rag.end_to_end.answer_correctness",
+            "metrics.rag.end_to_end.answer_faithfulness",
+            "metrics.rag.end_to_end.answer_reward",
+            "metrics.rag.end_to_end.context_correctness",
+            "metrics.rag.end_to_end.context_relevance",
+        ],
+        prediction_type=Dict[str, Any],
+        augmentable_inputs=["question"],
+    ),
+    "tasks.rag.end_to_end",
+    overwrite=True,
+)
+
+add_to_catalog(
+    Task(
+        input_fields={
+            "document_id": str,
+            "title": str,
+            "passages": List[str],
+            "metadata_field": str,
+        },
+        reference_fields={},
+        prediction_type=Any,
+        metrics=[
+            "metrics.rouge"
+        ],  # We can not define an empty metric, so we gave here a simple one- although rouge is not related
+    ),
+    "tasks.rag.corpora",
+    overwrite=True,
+)
diff --git a/prepare/templates/rag/end_to_end.py b/prepare/templates/rag/end_to_end.py
@@ -0,0 +1,28 @@
+from unitxt import add_to_catalog
+from unitxt.operator import SequentialOperator
+from unitxt.struct_data_operators import LoadJson
+from unitxt.templates import InputOutputTemplate
+
+add_to_catalog(
+    SequentialOperator(
+        steps=[
+            LoadJson(
+                field="prediction",
+                process_every_value=False,
+            ),
+        ]
+    ),
+    "processors.load_json_predictions",
+    overwrite=True,
+)
+
+add_to_catalog(
+    # For rag end-to-end tasks
+    InputOutputTemplate(
+        input_format="",
+        output_format='{{"answer": "{reference_answers}", "contexts" : ["{reference_contexts}"],  "context_ids" : ["{reference_context_ids}"]}}',
+        postprocessors=["processors.load_json_predictions"],
+    ),
+    "templates.rag.end_to_end.json_predictions",
+    overwrite=True,
+)
diff --git a/src/unitxt/catalog/cards/rag/benchmark/clap_nq/en.json b/src/unitxt/catalog/cards/rag/benchmark/clap_nq/en.json
@@ -0,0 +1,46 @@
+{
+    "__type__": "task_card",
+    "loader": {
+        "__type__": "load_csv",
+        "sep": "\t",
+        "files": {
+            "train": "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/train/question_train_answerable.tsv",
+            "test": "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/dev/question_dev_answerable.tsv"
+        }
+    },
+    "preprocess_steps": [
+        {
+            "__type__": "copy",
+            "field_to_field": {
+                "question": "question",
+                "id": "question_id"
+            }
+        },
+        {
+            "__type__": "set",
+            "fields": {
+                "reference_contexts": [],
+                "is_answerable_label": true,
+                "metadata_field": ""
+            }
+        },
+        {
+            "__type__": "list_field_values",
+            "fields": [
+                "doc-id-list"
+            ],
+            "to_field": "reference_context_ids"
+        },
+        {
+            "__type__": "list_field_values",
+            "fields": [
+                "answers"
+            ],
+            "to_field": "reference_answers"
+        }
+    ],
+    "task": "tasks.rag.end_to_end",
+    "templates": {
+        "default": "templates.rag.end_to_end.json_predictions"
+    }
+}