-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011
Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011
Conversation
@yoavkatz maybe change it in the class itself? Not only in the catalog? That way if people fetch the metric as I did they still have the default |
Meteor (unlike rouge) does not have a class, and only is an instantiation of Huggingface metric. I think the Rouge reason it is slow , is that in itself it does bootstrapping in the code: https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/rouge.py#L134 I think the solution is to avoid use aggregation. |
@eladven , @yoavkatz , I think that the problem with the use of unitxt of HF Rouge and Meteor (and others) is in making them Global, and even leaving the default arg of GlobalMetric: I think it is better to fix that issue and not flood the catalog with cards of yes/no CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment before you merge. I think it is not a good idea to run away from CI when this is not necessary.
I agree. I changed the code of Rouge to be HuggingfaceBulkMetric and it solved the issue for Rouge. Please review and let me know what you think. Regarding Meteor - can you do the appropriate thing? |
yes, coming up |
Hi @yoavkatz , I added Meteor as an HuggingFaceInstance, to see how this works. With ci. |
In general BulkInstanceMetric does not perform aggregation, and assume the results returned per instance. It just a matter of optimization, for example in LLM based metrics we send a batch of requests at a time , instead of 1-1 which is much slower. In Rouge, I removed the use_aggregation=True to the metric returns for each instance the rouge score, which is then averaged by the generic BulkInstanceMetric code. |
@dafnapension - it would be if we add a unit test that checks that if we wrap a metric (like Rouge) in HuggingfaceInstanceMetric or HuggingfaceBulkInstanceMetric - we would get the same results. |
bbaf1ca
to
a63edfc
Compare
A good idea, coming up! |
a63edfc
to
8a3ad1d
Compare
Added a comparison of Meteor to its previous global HF implementation, upon preparation of the metric (in Meteor.py). |
Hi @yoavkatz , I added a test comparing old and new rouge - to show they return the same result. Last: please note that for rouge, if HF's |
07f2ffd
to
6040057
Compare
src/unitxt/metrics.py
Outdated
@@ -327,6 +327,11 @@ def score_based_confidence_interval( | |||
# otherwise, the aggregation_func needs to be applied AFTER resampling the instances; | |||
# that is, re-form the groups, calculate the function, and take the mean of the group scores | |||
aggregation_func = self.average_item_scores | |||
|
|||
method = "BCa" | |||
if len(instances) > 100: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the run time difference and accuracy between BCa and percentile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really know.. I took the advice of @arielge and @assaftibm , here:
#1008 (comment)
and
#1008 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found information about percentile and BCa here.
From my understanding (@assaftibm , @arielge , please correct my mistakes..),
'percentile' works by the following intuitive notion: make n_resamples of the population [the stream fed to metric], each such resample is of the size of the population and is built by repeating the following process: independently, and identically distributed, select an item [instance] from the population, with replacement [return of a selected item back to the population].
Then, given one resample, compute the statistics (the metric) on it, the same way the metric is computed on the original population.
Then, given n_resamples scores (one per re_sample), return the 0.05 and 0.95 percentile ("quintile" of 100 rather than 5) as the ci borders.
BCa continues the above result of n_resamples scores, by correcting skewness and biases thereof, observed when comparing to the original population.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it matters, but I saw somewhere the BCa is the one that is commonly used. If the difference in runtime is small (I would think the main cost is recalculating the metric and not the BCa vs percentile), then I would keep it simple and not change behavior on some threshold.
So - can you try running twice (with BCA and with percentile) on say 200 instances, and see the difference in runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dafnapension!
Yes - I think your understanding of Percentile bootstrap is correct.
I would suggest the following:
- When it's an InstanceMetric, use BCa bootstrap.
- When it's a GlobalMetric, use Basic bootstrap (or Percentile, see discussion below).
In 2, I wouldn't condition the type of bootstrap on the size of the dataset and switch between BCa and other lighter bootstrap because it can confuse people that test a full dataset vs. smaller samples of it - they will suddenly get different CIs due to a switch in the bootstrap type. I think it's better to stick to one for consistency.
Regarding Basic vs. Percentile bootstrap.
There is a description here. I have to admit that I don't fully understand the computation of the Basic bootstrap. However, note that Scipy documentation includes this comment:
While the 'percentile' method is the most intuitive, it is rarely used in practice. Two more common methods are available, 'basic' (‘reverse percentile’) and 'BCa' (‘bias-corrected and accelerated’); they differ in how step 3 is performed.
So this is the main reason why I would lean towards Basic over Percentile. It would be easier if we could stick to BCa also in GlobalMetrics but unfortunately that doesn't seem practical.
As a side note, I think it makes sense to ask about the slowdown with BCa on StackOverflow - showing a short self-contained sample code that reproduces the problem using Scipy's implementation and asking if this is as expected. It's possible that the implementation in Scipy is not optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all my experiments, because of lack of real data, I used prediction=target (first reference). I think it only changes a constant time of computing each instance score, so I allowed myself..
In my humble opinion, in the numbers of instances and resamples that I see in the cards, we can stick to the said to be most commonly used BCa. Just remember to check again if we grow dramatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed analysis. BCa seems to increase the time about 10%-20% in instance metrics, but more than doubles the time in global metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do we agree to keep BCa always? I don't think we have many more global metrics that take a long time - and it's the simplest approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I, too, think so. Leave the change for a future consideration, if and when needed.
@elronbandel also asked that I simply copy over the code from what we find in HF to an in-house instance metric, for a further speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem will appear with more complex global metrics like CorpusBLEU, but I agree it can be dealt with in a separate PR.
6040057
to
2381c4e
Compare
Hi @yoavkatz , looking at these lines in Line 1305 in 2381c4e
it seems like you assume that every score that is a list, means one component per instance. It is true for Rouge, but not so in bleu, for example, where a score named precisions is an array whose length equals the input argument max_order .I think that this breakdown of list scores to instance score should be done in the individual metric level, and not in the BulkMetric level.
|
I suggest we complete this PR on Rouge and Metric, and then think about Bleu and generalizing this. |
Note that there is HuggingfaceInstanceMetric which is used by Rouge and Meteor and HuggingfaceBulkMetric, which is only used by BertScore as far as I understand. |
Hi @yoavkatz , indeed I was referring to a generalization (for a future PR), trying to address your comment that performance-wise, if possible, better approach HF in batches. |
039f3b4
to
9357b60
Compare
src/unitxt/metrics.py
Outdated
nltk.download("wordnet") | ||
nltk.download("omw-1.4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nltk.download("wordnet") | |
nltk.download("omw-1.4") | |
nltk.download("wordnet", quiet=True) | |
nltk.download("omw-1.4", quiet=True) |
src/unitxt/metrics.py
Outdated
from nltk import word_tokenize | ||
from nltk.translate import meteor_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to prepare so import is performed only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @elronbandel , at last I learned (from you, as always) how to avoid those "already up to date" running on my screen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dafnapension looks great. only need to move imports to prepare() and make nltk downloads quiet ( see my comments)
Also consider making new PR only with your changes since there is this .secrets.basline that seem to be affected by wrong merging of the main
src/unitxt/metrics.py
Outdated
import nltk | ||
from rouge_score import rouge_scorer | ||
|
||
nltk.download("punkt") | ||
self.sent_tokenize = nltk.sent_tokenize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to prepare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to save it under self.scorer then use it in the compute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @elronbandel , Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elronbandel @dafnapension - can we merge these?
added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com>
to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
… advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
9357b60
to
d8e264c
Compare
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
9c0f87b
to
551cd76
Compare
Signed-off-by: dafnapension <dafnashein@yahoo.com>
…meteor-metric-by-default
…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>
…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
added a new metric with interval calculations