Skip to content

Commit

Permalink
Benjams/add rag task card and metric (#1044)
Browse files Browse the repository at this point in the history
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027)

Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added seed to LLM as judges for consistent results (#1029)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* replace type and __type__ in type error (#1035)

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add task rag_end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add card for clapnq end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add sandbox_benjams

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add subset

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove constants

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* rename sandbox_benjams to sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add string to context id in rag (#1036)

* allow strings (hash) as context id

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* save to catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed issues with fresh install (#1037)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add validation to tldr, remove shuffle from billsum (#1038)

* add validation to tldr, remove shuffle from billsum
(shuffled by the SplitRandomMix)

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

* fix formatting

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

---------

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add CloseTextSampler and FixedIndicesSampler (#1034)

* Add CloseTextSampler

That returns demos that are textually close to the current instance.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Make sampler call pass  current instance

Added end 2 end test of sampler that depends on output

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added FixedIndicesSampler(Sampler):

Selects a fix set of samples based on a list of indices from the demo pool

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Made splitter currently use random_generators

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed all Sample randomization

To use common code to create randomizer per instance

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030)

* changed input and output of templates

to "input_fields" and "reference_ fields" .

This is to continue  the work done on tasks.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed type hint

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Documentation update

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* FinQA - filter problematic examples (#1039)

filter problematic examples

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Arena hard elad2 (#1026)

* bug fixes in PairwiseChoiceTemplate

* add arena hard regex parser operator

* update mt bench card common

* update mt bench card common

* add reward bench

* update metric to pairwise comarison task

* arena hard tasks and cards

* update mt bench template

* add duplicate stream operator

* add PairwiseComparativeRatingTemplate

* add card

* add card

* add template

* add winrate metrics

* add comparative rating task

* add ExtractArenaHardNumericalJudgment

* add arena hard cards

* add arena hard template

* add weighted winrate metrics

* delete file

* update PairwiseComparativeRatingTemplate

* add metric

* add metric

* update

* update

* update

* fix template bug

* update

* llama 3 update

* update

* update

* update jsons

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* fix

* fix

* update

* update

* update

* bluebench related changes

* fix type issue

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* update

* update

* update

* prometheus1

* update

* fix

* fix

* merge with arena_branch

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* rebuild catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* add debugging to clapnq

* Reproduce all artifacts

* Add missing artifacts to catalog

* Add secrets baseline

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

* Fix bugs with catalog creation

* Remove areana hard examples from tests, since they don't pass

* Add missing metadata to test mock

* Add data_classification_policy and recipe_metadata to the steams tests

* Fix test failures

* Update multi_turn_gpt4_judgement.py

* Update multi_turn_with_reference_gpt4_judgement.py

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* bug fix in LoadFromHFSpace

* revert

* revert

* update examples

* add coment to expain change

* update to new params usage

* pr fixes

* pr fixes

* update

* update

* update

* update

* update

* update

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* update

* cr fixes

* llmaj format fix

* llmaj format fix

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* demo's target prefix is now taken from demo instance (#1031)

* demo's target prefix is now taken from demo instance

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* do not pop fields out of demo instances.
Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove the reduced clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* define an empty template for rag end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Implement metrics ensemble (#1047)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add load_json_predictions as processor in the template

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add the processors/load_json_predictions.json generated to the catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add flores101 (#1053)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos (#1052)

* Added example for selection of demos

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added example doc

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Update docs/docs/examples.rst

* Update docs/docs/examples.rst

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add overwrite

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fix typo in japanese_llama system prompt (issue #964) (#1056)

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Allow assigning None in overwrites when fetching artifacts with modifications (#1062)

allow =None in overwrites for fetch

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Make sure preparation times printed fully and nicely (#1046)

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* numeric nlg - template changes (#1041)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric (#1064)

* add judge input to the metric

* add judge input to the metric

* fix

* fix test

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Unitxt capitalization adding_dataset.rst (#1057)

making Unitxt capitalization consistent in text

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fixed the score_ci inconsistency issue (#1065)

* suggested fix for score_ci inconsistency issue

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* unify with the update, and thus simplified the check

Signed-off-by: dafnapension <dafnashein@yahoo.com>
---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Use of conventional python types in input definition of tasks and metrics (#1045)

* Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Make tasks types python types

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix errors

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Some fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* More fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update catalog

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix cards

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Revert change

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix typing in docs with new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* refactor of new asset to new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update secrets baseline

Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning (#1072)

* Added prediction type to llm as jusdge to avoid warning

Clarified the sandalone llm as judge example

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Removed accidentally added file

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed clapnq to check with reasonable error values

Also updated rag tasks to use new typing (instead of string types)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix the type hint

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* update catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add metric "metrics.rag.retrieval_at_k" to catalog (#1074)

* add metric "metrics.rag.retrieval_at_k" to catalog
this is a wrapper around the retrieval_at_k for the ragas scheme

* add corresponding json file for the new metric

---------

Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* merge - resolve conflict

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com>
Co-authored-by: Alon H <alonh@users.noreply.github.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com>
Co-authored-by: Elad <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com>
Co-authored-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com>
Co-authored-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
  • Loading branch information
18 people authored and csrajmohan committed Aug 29, 2024
1 parent a61c53e commit 7e1ee30
Show file tree
Hide file tree
Showing 17 changed files with 586 additions and 0 deletions.
Empty file.
123 changes: 123 additions & 0 deletions prepare/cards/rag/end_to_end/clapnq.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
import json
from dataclasses import dataclass

from unitxt import add_to_catalog
from unitxt.blocks import TaskCard, TemplatesDict
from unitxt.loaders import LoadCSV
from unitxt.operators import Copy, ListFieldValues, Set
from unitxt.templates import InputOutputTemplate
from unitxt.test_utils.card import test_card


@dataclass(frozen=True)
class ClapNqBenchmark:
# Raw_data
TRAIN_RAW_FILE_URL: str = "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/train/question_train_answerable.tsv"
TEST_RAW_FILE_URL: str = "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/dev/question_dev_answerable.tsv"

# Fields
ID: str = "id"
QUESTION: str = "question"
DOC_ID_LIST: str = "doc-id-list"
ANSWERS: str = "answers"


@dataclass(frozen=True)
class ClapNqDocuments:
# Raw_data
RAW_FILE_URL: str = "https://media.githubusercontent.com/media/primeqa/clapnq/main/retrieval/passages.tsv"

# Fields
ID: str = "id"
TEXT: str = "text"
TITLE: str = "title"

ARTIFACT_NAME: str = "cards.rag.documents.clap_nq.en"


card = TaskCard(
loader=LoadCSV(
sep="\t",
files={
"train": ClapNqBenchmark.TRAIN_RAW_FILE_URL,
"test": ClapNqBenchmark.TEST_RAW_FILE_URL,
},
),
preprocess_steps=[
Copy(
field_to_field={
ClapNqBenchmark.QUESTION: "question",
ClapNqBenchmark.ID: "question_id",
},
),
Set(
fields={
"reference_contexts": [],
"is_answerable_label": True,
"metadata_field": "",
}
),
ListFieldValues(
fields=[ClapNqBenchmark.DOC_ID_LIST],
to_field="reference_context_ids",
),
ListFieldValues(
fields=[ClapNqBenchmark.ANSWERS],
to_field="reference_answers",
),
],
task="tasks.rag.end_to_end",
# templates=["templates.empty"],
templates=TemplatesDict({"default": "templates.rag.end_to_end.json_predictions"}),
)

wrong_answer = {
"contexts": ["hi"],
"is_answerable": True,
"answer": "Don't know",
"context_ids": ["id0"],
}
test_card(
card,
strict=True,
full_mismatch_prediction_values=[json.dumps(wrong_answer)],
debug=False,
demos_taken_from="test",
demos_pool_size=5,
)

add_to_catalog(card, "cards.rag.benchmark.clap_nq.en", overwrite=True)

# Documents
card = TaskCard(
loader=LoadCSV(sep="\t", files={"train": ClapNqDocuments.RAW_FILE_URL}),
preprocess_steps=[
Copy(
field_to_field={
ClapNqDocuments.ID: "document_id",
ClapNqDocuments.TITLE: "title",
},
),
ListFieldValues(
fields=[ClapNqDocuments.TEXT],
to_field="passages",
),
Set(
fields={
"metadata_field": "",
}
),
],
task="tasks.rag.corpora",
templates=TemplatesDict(
{
"empty": InputOutputTemplate(
input_format="",
output_format="",
),
}
),
)

# Not testing card, because documents are not evaluated.
add_to_catalog(card, "cards.rag.documents.clap_nq.en", overwrite=True)
125 changes: 125 additions & 0 deletions prepare/metrics/rag.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,3 +416,128 @@
add_to_catalog(
metric, f"metrics.rag.response_generation.{axis}.{base_metric}", overwrite=True
)

# end to end

end_to_end_artifact_name_to_main_score = {
"metrics.rag.end_to_end.answer_correctness": "recall",
"metrics.rag.end_to_end.answer_reward": "score",
"metrics.rag.end_to_end.answer_faithfulness": "precision",
"metrics.rag.end_to_end.context_correctness": "score",
"metrics.rag.end_to_end.context_relevance": "score",
}

end_to_end_artifact_names_to_main_metric = {
"metrics.rag.end_to_end.answer_correctness": "metrics.token_overlap",
"metrics.rag.end_to_end.answer_reward": "metrics.reward.deberta_v3_large_v2",
"metrics.rag.end_to_end.answer_faithfulness": "metrics.token_overlap",
"metrics.rag.end_to_end.context_correctness": "metrics.mrr",
"metrics.rag.end_to_end.context_relevance": "metrics.perplexity_q.flan_t5_small",
}

assert len(end_to_end_artifact_name_to_main_score) == len(
end_to_end_artifact_names_to_main_metric
)

copy_field_prediction_answer_to_prediction = Copy(
field_to_field=[
(
"prediction/answer",
"prediction",
)
],
)

copy_field_reference_answers_to_references = Copy(
field_to_field={"task_data/reference_answers": "references"},
)

copy_field_reference_contexts_to_references = Copy(
field_to_field={"task_data/reference_contexts": "references"}
)

copy_field_prediction_contexts_to_prediction = Copy(
field_to_field=[
(
"prediction/contexts",
"prediction",
)
],
)

copy_field_prediction_context_ids_to_prediction = Copy(
field_to_field=[
(
"prediction/context_ids",
"prediction",
)
],
)

copy_field_reference_context_ids_to_references_in_a_list = ListFieldValues(
fields=["task_data/reference_context_ids"],
to_field="references",
)

copy_field_prediction_contexts_to_references = Copy(
field_to_field=[
(
"prediction/contexts",
"references",
)
],
)


copy_field_question_to_prediction = Copy(
field_to_field=[
(
"task_data/question",
"prediction",
)
],
)

copy_field_question_to_references_in_a_list = ListFieldValues(
fields=["task_data/question"],
to_field="references",
)

end_to_end_artifact_names_to_preprocess_steps = {
"metrics.rag.end_to_end.answer_correctness": [
copy_field_prediction_answer_to_prediction,
copy_field_reference_answers_to_references,
],
"metrics.rag.end_to_end.answer_reward": [
copy_field_prediction_answer_to_prediction,
copy_field_question_to_references_in_a_list,
],
"metrics.rag.end_to_end.answer_faithfulness": [
copy_field_prediction_contexts_to_references,
copy_field_prediction_answer_to_prediction,
],
"metrics.rag.end_to_end.context_correctness": [
copy_field_prediction_context_ids_to_prediction,
copy_field_reference_context_ids_to_references_in_a_list,
],
"metrics.rag.end_to_end.context_relevance": [
copy_field_prediction_contexts_to_references,
copy_field_question_to_prediction,
],
}


for artifact_name in end_to_end_artifact_names_to_preprocess_steps.keys():
metric_short_name = artifact_name.split(".")[-1]
if metric_short_name == "rouge": # rouge does not need a prefix
score_prefix = ""
else:
score_prefix = f"[score_prefix={metric_short_name}_]"

metric = MetricPipeline(
main_score=end_to_end_artifact_name_to_main_score[artifact_name],
preprocess_steps=end_to_end_artifact_names_to_preprocess_steps[artifact_name],
metric=f"{end_to_end_artifact_names_to_main_metric[artifact_name]}{score_prefix}",
)

add_to_catalog(metric, artifact_name, overwrite=True)
Empty file added prepare/tasks/rag/__init__.py
Empty file.
49 changes: 49 additions & 0 deletions prepare/tasks/rag/rag_end_to_end.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from typing import Any, Dict, List

from unitxt import add_to_catalog
from unitxt.blocks import Task

add_to_catalog(
Task(
input_fields={
"question": str,
"question_id": Any,
"metadata_field": str,
},
reference_fields={
"reference_answers": List[str],
"reference_contexts": List[str],
"reference_context_ids": List[str],
"is_answerable_label": bool,
},
metrics=[
"metrics.rag.end_to_end.answer_correctness",
"metrics.rag.end_to_end.answer_faithfulness",
"metrics.rag.end_to_end.answer_reward",
"metrics.rag.end_to_end.context_correctness",
"metrics.rag.end_to_end.context_relevance",
],
prediction_type=Dict[str, Any],
augmentable_inputs=["question"],
),
"tasks.rag.end_to_end",
overwrite=True,
)

add_to_catalog(
Task(
input_fields={
"document_id": str,
"title": str,
"passages": List[str],
"metadata_field": str,
},
reference_fields={},
prediction_type=Any,
metrics=[
"metrics.rouge"
], # We can not define an empty metric, so we gave here a simple one- although rouge is not related
),
"tasks.rag.corpora",
overwrite=True,
)
28 changes: 28 additions & 0 deletions prepare/templates/rag/end_to_end.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from unitxt import add_to_catalog
from unitxt.operator import SequentialOperator
from unitxt.struct_data_operators import LoadJson
from unitxt.templates import InputOutputTemplate

add_to_catalog(
SequentialOperator(
steps=[
LoadJson(
field="prediction",
process_every_value=False,
),
]
),
"processors.load_json_predictions",
overwrite=True,
)

add_to_catalog(
# For rag end-to-end tasks
InputOutputTemplate(
input_format="",
output_format='{{"answer": "{reference_answers}", "contexts" : ["{reference_contexts}"], "context_ids" : ["{reference_context_ids}"]}}',
postprocessors=["processors.load_json_predictions"],
),
"templates.rag.end_to_end.json_predictions",
overwrite=True,
)
46 changes: 46 additions & 0 deletions src/unitxt/catalog/cards/rag/benchmark/clap_nq/en.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"__type__": "task_card",
"loader": {
"__type__": "load_csv",
"sep": "\t",
"files": {
"train": "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/train/question_train_answerable.tsv",
"test": "https://raw.githubusercontent.com/primeqa/clapnq/main/retrieval/dev/question_dev_answerable.tsv"
}
},
"preprocess_steps": [
{
"__type__": "copy",
"field_to_field": {
"question": "question",
"id": "question_id"
}
},
{
"__type__": "set",
"fields": {
"reference_contexts": [],
"is_answerable_label": true,
"metadata_field": ""
}
},
{
"__type__": "list_field_values",
"fields": [
"doc-id-list"
],
"to_field": "reference_context_ids"
},
{
"__type__": "list_field_values",
"fields": [
"answers"
],
"to_field": "reference_answers"
}
],
"task": "tasks.rag.end_to_end",
"templates": {
"default": "templates.rag.end_to_end.json_predictions"
}
}
Loading

0 comments on commit 7e1ee30

Please sign in to comment.