Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

yoavkatz · 2024-07-09T09:23:40Z

added a new metric with interval calculations

OfirArviv · 2024-07-09T09:36:01Z

@yoavkatz maybe change it in the class itself? Not only in the catalog? That way if people fetch the metric as I did they still have the default

yoavkatz · 2024-07-09T11:04:15Z

@yoavkatz maybe change it in the class itself? Not only in the catalog? That way if people fetch the metric as I did they still have the default

Meteor (unlike rouge) does not have a class, and only is an instantiation of Huggingface metric.

I think the Rouge reason it is slow , is that in itself it does bootstrapping in the code:

https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/rouge.py#L134

I think the solution is to avoid use aggregation.

dafnapension · 2024-07-10T10:13:21Z

@eladven , @yoavkatz , I think that the problem with the use of unitxt of HF Rouge and Meteor (and others) is in making them Global, and even leaving the default arg of GlobalMetric: process_single_instances = True in place.
In such a case, HF, that by itself invokes heavy language models, is invoked first for each and every instance (from the beginning of GlovalMetric.process()), and then for the whole stream (from the second part of that process()). When HF is invoked for the whole stream, it again processes each and every instance.
Now if this happens with CI - the processing of each and every instance (which is heavy language model) repeats 2 * number of resamples.
If these HF metric are incarnated as InstanceMetric (and there is no reason why not), then each instance would have been processed just once! the CI uses existing scores, if there are, it does not compute them again, and there will be if so incarnated as InstanceMetric.

I think it is better to fix that issue and not flood the catalog with cards of yes/no CI.

dafnapension

Please see my comment before you merge. I think it is not a good idea to run away from CI when this is not necessary.

yoavkatz · 2024-07-10T10:41:36Z

@eladven , @yoavkatz , I think that the problem with the use of unitxt of HF Rouge and Meteor (and others) is in making them Global, and even leaving the default arg of GlobalMetric: process_single_instances = True in place. In such a case, HF, that by itself invokes heavy language models, is invoked first for each and every instance (from the beginning of GlovalMetric.process()), and then for the whole stream (from the second part of that process()). When HF is invoked for the whole stream, it again processes each and every instance. Now if this happens with CI - the processing of each and every instance (which is heavy language model) repeats 2 * number of resamples. If these HF metric are incarnated as InstanceMetric (and there is no reason why not), then each instance would have been processed just once! the CI uses existing scores, if there are, it does not compute them again, and there will be if so incarnated as InstanceMetric.

I think it is better to fix that issue and not flood the catalog with cards of yes/no CI.

I agree. I changed the code of Rouge to be HuggingfaceBulkMetric and it solved the issue for Rouge.

Please review and let me know what you think.

Regarding Meteor - can you do the appropriate thing?

dafnapension · 2024-07-10T13:53:49Z

yes, coming up

dafnapension · 2024-07-10T19:43:45Z

Hi @yoavkatz , I added Meteor as an HuggingFaceInstance, to see how this works. With ci.
Also, I thought about rouge: if it is bulk, then each bulk gets the global score as the average of the bulk, not the average of the whole stream. Is that OK?
So I changed Rouge to inherit from this new HuggingFaceInstance. Nothing else changed.
I also fixed a small bug in test_metric_utils, and things seem to be fine now, here, on my little laptop.
I hope that this new incarnation will survive also out in the wild fields of fm_eval..

yoavkatz · 2024-07-11T05:52:12Z

Hi @yoavkatz , I added Meteor as an HuggingFaceInstance, to see how this works. With ci. Also, I thought about rouge: if it is bulk, then each bulk gets the global score as the average of the bulk, not the average of the whole stream. Is that OK? So I changed Rouge to inherit from this new HuggingFaceInstance. Nothing else changed. I also fixed a small bug in test_metric_utils, and things seem to be fine now, here, on my little laptop. I hope that this new incarnation will survive also out in the wild fields of fm_eval..

In general BulkInstanceMetric does not perform aggregation, and assume the results returned per instance. It just a matter of optimization, for example in LLM based metrics we send a batch of requests at a time , instead of 1-1 which is much slower.

In Rouge, I removed the use_aggregation=True to the metric returns for each instance the rouge score, which is then averaged by the generic BulkInstanceMetric code.

yoavkatz · 2024-07-11T06:03:34Z

@dafnapension - it would be if we add a unit test that checks that if we wrap a metric (like Rouge) in HuggingfaceInstanceMetric or HuggingfaceBulkInstanceMetric - we would get the same results.

dafnapension · 2024-07-11T10:04:30Z

A good idea, coming up!

dafnapension · 2024-07-11T13:26:41Z

Added a comparison of Meteor to its previous global HF implementation, upon preparation of the metric (in Meteor.py).
Now looking into Rouge.

dafnapension · 2024-07-11T19:41:16Z

Hi @yoavkatz , I added a test comparing old and new rouge - to show they return the same result.
Also, per a good advice from @arielge and @assaftibm , in case of more than 100 instances, I changed the method of bootstrap from "BCa" (the default) to "percentile".

Last: please note that for rouge, if HF's use_aggregator = False , then HF returns vectors of numeric results (rather than simple floats), with which we can not do CI. So I made the comparison above with use_aggregator = True in order to allow CI. The bootstrapping that HF does per its own initiative (when use_aggregator = True) does not change much the result, compared to a simple average over the individual instance scores.

yoavkatz · 2024-07-15T14:17:17Z

src/unitxt/metrics.py

@@ -327,6 +327,11 @@ def score_based_confidence_interval(
            # otherwise, the aggregation_func needs to be applied AFTER resampling the instances;
            #   that is, re-form the groups, calculate the function, and take the mean of the group scores
            aggregation_func = self.average_item_scores
+
+        method = "BCa"
+        if len(instances) > 100:


What the run time difference and accuracy between BCa and percentile?

I do not really know.. I took the advice of @arielge and @assaftibm , here:
#1008 (comment)
and
#1008 (comment)

I found information about percentile and BCa here.
From my understanding (@assaftibm , @arielge , please correct my mistakes..),
'percentile' works by the following intuitive notion: make n_resamples of the population [the stream fed to metric], each such resample is of the size of the population and is built by repeating the following process: independently, and identically distributed, select an item [instance] from the population, with replacement [return of a selected item back to the population].
Then, given one resample, compute the statistics (the metric) on it, the same way the metric is computed on the original population.
Then, given n_resamples scores (one per re_sample), return the 0.05 and 0.95 percentile ("quintile" of 100 rather than 5) as the ci borders.

BCa continues the above result of n_resamples scores, by correcting skewness and biases thereof, observed when comparing to the original population.

I'm not sure it matters, but I saw somewhere the BCa is the one that is commonly used. If the difference in runtime is small (I would think the main cost is recalculating the metric and not the BCa vs percentile), then I would keep it simple and not change behavior on some threshold.

So - can you try running twice (with BCA and with percentile) on say 200 instances, and see the difference in runtime?

Hi @dafnapension!

Yes - I think your understanding of Percentile bootstrap is correct.

I would suggest the following:

When it's an InstanceMetric, use BCa bootstrap.

When it's a GlobalMetric, use Basic bootstrap (or Percentile, see discussion below).

In 2, I wouldn't condition the type of bootstrap on the size of the dataset and switch between BCa and other lighter bootstrap because it can confuse people that test a full dataset vs. smaller samples of it - they will suddenly get different CIs due to a switch in the bootstrap type. I think it's better to stick to one for consistency.

Regarding Basic vs. Percentile bootstrap.
There is a description here. I have to admit that I don't fully understand the computation of the Basic bootstrap. However, note that Scipy documentation includes this comment:

While the 'percentile' method is the most intuitive, it is rarely used in practice. Two more common methods are available, 'basic' (‘reverse percentile’) and 'BCa' (‘bias-corrected and accelerated’); they differ in how step 3 is performed.

So this is the main reason why I would lean towards Basic over Percentile. It would be easier if we could stick to BCa also in GlobalMetrics but unfortunately that doesn't seem practical.

As a side note, I think it makes sense to ask about the slowdown with BCa on StackOverflow - showing a short self-contained sample code that reproduces the problem using Scipy's implementation and asking if this is as expected. It's possible that the implementation in Scipy is not optimal.

In all my experiments, because of lack of real data, I used prediction=target (first reference). I think it only changes a constant time of computing each instance score, so I allowed myself..
In my humble opinion, in the numbers of instances and resamples that I see in the cards, we can stick to the said to be most commonly used BCa. Just remember to check again if we grow dramatically.

Thanks for the detailed analysis. BCa seems to increase the time about 10%-20% in instance metrics, but more than doubles the time in global metrics.

So do we agree to keep BCa always? I don't think we have many more global metrics that take a long time - and it's the simplest approach.

I, too, think so. Leave the change for a future consideration, if and when needed.
@elronbandel also asked that I simply copy over the code from what we find in HF to an in-house instance metric, for a further speedup.

I think the problem will appear with more complex global metrics like CorpusBLEU, but I agree it can be dealt with in a separate PR.

dafnapension · 2024-07-16T08:09:28Z

Hi @yoavkatz , looking at these lines in HuggingfaceBulkMetric:

unitxt/src/unitxt/metrics.py

Line 1305 in 2381c4e

results = [{} for _ in range(len(scores[self.hf_metric_fields[0]]))]

it seems like you assume that every score that is a list, means one component per instance. It is true for Rouge, but not so in bleu, for example, where a score named precisions is an array whose length equals the input argument max_order.
I think that this breakdown of list scores to instance score should be done in the individual metric level, and not in the BulkMetric level.

yoavkatz · 2024-07-16T08:49:57Z

Hi @yoavkatz , looking at these lines in HuggingfaceBulkMetric:

unitxt/src/unitxt/metrics.py

Line 1305 in 2381c4e

results = [{} for _ in range(len(scores[self.hf_metric_fields[0]]))]

it seems like you assume that every score that is a list, means one component per instance. It is true for Rouge, but not so in bleu, for example, where a score named precisions is an array whose length equals the input argument max_order.
I think that this breakdown of list scores to instance score should be done in the individual metric level, and not in the BulkMetric level.

I suggest we complete this PR on Rouge and Metric, and then think about Bleu and generalizing this.

yoavkatz · 2024-07-16T09:01:55Z

Note that there is HuggingfaceInstanceMetric which is used by Rouge and Meteor and HuggingfaceBulkMetric, which is only used by BertScore as far as I understand.

dafnapension · 2024-07-16T11:57:18Z

Hi @yoavkatz , indeed I was referring to a generalization (for a future PR), trying to address your comment that performance-wise, if possible, better approach HF in batches.

elronbandel · 2024-07-22T06:40:01Z

src/unitxt/metrics.py

+        nltk.download("wordnet")
+        nltk.download("omw-1.4")


Suggested change

nltk.download("wordnet")

nltk.download("omw-1.4")

nltk.download("wordnet", quiet=True)

nltk.download("omw-1.4", quiet=True)

elronbandel · 2024-07-22T06:41:51Z

src/unitxt/metrics.py

+        from nltk import word_tokenize
+        from nltk.translate import meteor_score


move to prepare so import is performed only once

thanks, @elronbandel , at last I learned (from you, as always) how to avoid those "already up to date" running on my screen.

elronbandel

@dafnapension looks great. only need to move imports to prepare() and make nltk downloads quiet ( see my comments)

Also consider making new PR only with your changes since there is this .secrets.basline that seem to be affected by wrong merging of the main

elronbandel · 2024-07-22T06:42:29Z

src/unitxt/metrics.py

+        import nltk
+        from rouge_score import rouge_scorer
+
+        nltk.download("punkt")
+        self.sent_tokenize = nltk.sent_tokenize


move to prepare

You need to save it under self.scorer then use it in the compute

Hi @elronbandel , Done.

@elronbandel @dafnapension - can we merge these?

added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com>

Signed-off-by: Yoav Katz <katz@il.ibm.com>

when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com>

to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com>

Signed-off-by: dafnapension <dafnashein@yahoo.com>

… advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com>

Signed-off-by: dafnapension <dafnashein@yahoo.com>

…meteor-metric-by-default

Updated

…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>

…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

@yoavkatz

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>

…ion (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>

@yoavkatz

* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027) Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added seed to LLM as judges for consistent results (#1029) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * replace type and __type__ in type error (#1035) Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add rag_end_to_end metrics Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add task rag_end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add card for clapnq end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add sandbox_benjams Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add subset Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add a reduction of clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove constants Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * rename sandbox_benjams to sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove sandbox Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add string to context id in rag (#1036) * allow strings (hash) as context id Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * save to catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed issues with fresh install (#1037) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add validation to tldr, remove shuffle from billsum (#1038) * add validation to tldr, remove shuffle from billsum (shuffled by the SplitRandomMix) Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> * fix formatting Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> --------- Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011) * Remove confidence interval calculation for meteor metric by default added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when metrics not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added error mesage when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed Rouge to be HuggingfaceBulkMetric to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com> * added meteor as an HuggingFaceInstanceMetric Signed-off-by: dafnapension <dafnashein@yahoo.com> * removed meteor_with_confidence_intervals.json Signed-off-by: dafnapension <dafnashein@yahoo.com> * fixed test_metric_utils.py by better concentrating on rougeL only Signed-off-by: dafnapension <dafnashein@yahoo.com> * comment about rounded floats in tested scores Signed-off-by: dafnapension <dafnashein@yahoo.com> * while generating metric meteor, compmare against HF implementation Signed-off-by: dafnapension <dafnashein@yahoo.com> * added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com> * implemented Meteor and Rouge with inhouse code Signed-off-by: dafnapension <dafnashein@yahoo.com> * download quietly, and import in prepare Signed-off-by: dafnapension <dafnashein@yahoo.com> * trying to avoid .secrets.baseline Signed-off-by: dafnapension <dafnashein@yahoo.com> * secret.baseline how do I get rid of it? Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add CloseTextSampler and FixedIndicesSampler (#1034) * Add CloseTextSampler That returns demos that are textually close to the current instance. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Make sampler call pass current instance Added end 2 end test of sampler that depends on output Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added FixedIndicesSampler(Sampler): Selects a fix set of samples based on a list of indices from the demo pool Signed-off-by: Yoav Katz <katz@il.ibm.com> * Made splitter currently use random_generators Signed-off-by: Yoav Katz <katz@il.ibm.com> * Changed all Sample randomization To use common code to create randomizer per instance Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> * Updated demos in test After a non backward compatible change Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030) * changed input and output of templates to "input_fields" and "reference_ fields" . This is to continue the work done on tasks. Signed-off-by: Yoav Katz <katz@il.ibm.com> * Fixed type hint Signed-off-by: Yoav Katz <katz@il.ibm.com> * Documentation update Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * FinQA - filter problematic examples (#1039) filter problematic examples Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Arena hard elad2 (#1026) * bug fixes in PairwiseChoiceTemplate * add arena hard regex parser operator * update mt bench card common * update mt bench card common * add reward bench * update metric to pairwise comarison task * arena hard tasks and cards * update mt bench template * add duplicate stream operator * add PairwiseComparativeRatingTemplate * add card * add card * add template * add winrate metrics * add comparative rating task * add ExtractArenaHardNumericalJudgment * add arena hard cards * add arena hard template * add weighted winrate metrics * delete file * update PairwiseComparativeRatingTemplate * add metric * add metric * update * update * update * fix template bug * update * llama 3 update * update * update * update jsons * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * fix * fix * update * update * update * bluebench related changes * fix type issue Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * update * update * update * prometheus1 * update * fix * fix * merge with arena_branch Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * rebuild catalog Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> * add debugging to clapnq * Reproduce all artifacts * Add missing artifacts to catalog * Add secrets baseline Signed-off-by: Elad Venezian <eladv@il.ibm.com> * Fix bugs with catalog creation * Remove areana hard examples from tests, since they don't pass * Add missing metadata to test mock * Add data_classification_policy and recipe_metadata to the steams tests * Fix test failures * Update multi_turn_gpt4_judgement.py * Update multi_turn_with_reference_gpt4_judgement.py * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * revert catalog consistecy and preperation yml files * Update docs/docs/examples.rst Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> * bug fix in LoadFromHFSpace * revert * revert * update examples * add coment to expain change * update to new params usage * pr fixes * pr fixes * update * update * update * update * update * update * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * Update prepare/templates/rag/response_generation.py Co-authored-by: Yotam Perlitz <perlitz@gmail.com> * update * cr fixes * llmaj format fix * llmaj format fix --------- Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * demo's target prefix is now taken from demo instance (#1031) * demo's target prefix is now taken from demo instance Signed-off-by: dafnapension <dafnashein@yahoo.com> * do not pop fields out of demo instances. Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream Signed-off-by: dafnapension <dafnashein@yahoo.com> * simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * remove the reduced clap_nq Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * define an empty template for rag end_to_end Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Implement metrics ensemble (#1047) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add load_json_predictions as processor in the template Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add the processors/load_json_predictions.json generated to the catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add flores101 (#1053) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added example for selection of demos (#1052) * Added example for selection of demos Signed-off-by: Yoav Katz <katz@il.ibm.com> * Added example doc Signed-off-by: Yoav Katz <katz@il.ibm.com> * Update docs/docs/examples.rst * Update docs/docs/examples.rst --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add overwrite Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063) Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fix typo in japanese_llama system prompt (issue #964) (#1056) Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Allow assigning None in overwrites when fetching artifacts with modifications (#1062) allow =None in overwrites for fetch Signed-off-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Make sure preparation times printed fully and nicely (#1046) Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * numeric nlg - template changes (#1041) Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * add judge input to the metric (#1064) * add judge input to the metric * add judge input to the metric * fix * fix test Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Unitxt capitalization adding_dataset.rst (#1057) making Unitxt capitalization consistent in text Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fixed the score_ci inconsistency issue (#1065) * suggested fix for score_ci inconsistency issue Signed-off-by: dafnapension <dafnashein@yahoo.com> * unify with the update, and thus simplified the check Signed-off-by: dafnapension <dafnashein@yahoo.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Use of conventional python types in input definition of tasks and metrics (#1045) * Fix data classes not support field overriding in fields containing types or functions Signed-off-by: elronbandel <elron.bandel@ibm.com> * Make tasks types python types Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix errors Signed-off-by: elronbandel <elron.bandel@ibm.com> * Some fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * More fixes Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update catalog Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix cards Signed-off-by: elronbandel <elron.bandel@ibm.com> * Revert change Signed-off-by: elronbandel <elron.bandel@ibm.com> * Fix typing in docs with new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * refactor of new asset to new convention Signed-off-by: elronbandel <elron.bandel@ibm.com> * Update secrets baseline Signed-off-by: elronbandel <elron.bandel@ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Added prediction type to llm as jusdge to avoid warning (#1072) * Added prediction type to llm as jusdge to avoid warning Clarified the sandalone llm as judge example Signed-off-by: Yoav Katz <katz@il.ibm.com> * Removed accidentally added file Signed-off-by: Yoav Katz <katz@il.ibm.com> --------- Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Fixed clapnq to check with reasonable error values Also updated rag tasks to use new typing (instead of string types) Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * fix the type hint Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * update catalog Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * Add metric "metrics.rag.retrieval_at_k" to catalog (#1074) * add metric "metrics.rag.retrieval_at_k" to catalog this is a wrapper around the retrieval_at_k for the ragas scheme * add corresponding json file for the new metric --------- Co-authored-by: Elron Bandel <elronbandel@gmail.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> * merge - resolve conflict Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> --------- Signed-off-by: elronbandel <elron.bandel@ibm.com> Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com> Signed-off-by: Yoav Katz <katz@il.ibm.com> Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com> Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com> Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elad Venezian <eladv@il.ibm.com> Signed-off-by: welisheva22 <welisheva22@gmail.com> Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: Yotam Perlitz <perlitz@gmail.com> Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com> Co-authored-by: Alon H <alonh@users.noreply.github.com> Co-authored-by: dafnapension <dafnashein@yahoo.com> Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com> Co-authored-by: Elad <eladv@il.ibm.com> Co-authored-by: ofirarviv <ofir.arviv@ibm.com> Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com> Co-authored-by: michal <shmueli@il.ibm.com> Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com> Co-authored-by: welisheva22 <welisheva22@gmail.com> Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com> Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com> Co-authored-by: Yoav Katz <katz@il.ibm.com> Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>

yoavkatz requested a review from eladven July 9, 2024 09:23

dafnapension previously requested changes Jul 10, 2024

View reviewed changes

yoavkatz changed the title ~~Remove confidence interval calculation for meteor metric by default~~ Move Rouge and Meteor to be Bulk metrics to reduce runtime in confidence interval calculations Jul 10, 2024

yoavkatz mentioned this pull request Jul 10, 2024

Exponential time complexity when confidence interval calculation is enabled #1008

Closed

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from bbaf1ca to a63edfc Compare July 11, 2024 10:04

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from a63edfc to 8a3ad1d Compare July 11, 2024 13:22

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch 2 times, most recently from 07f2ffd to 6040057 Compare July 15, 2024 13:20

yoavkatz commented Jul 15, 2024

View reviewed changes

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 6040057 to 2381c4e Compare July 15, 2024 14:49

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch 5 times, most recently from 039f3b4 to 9357b60 Compare July 21, 2024 19:53

elronbandel reviewed Jul 22, 2024

View reviewed changes

elronbandel requested changes Jul 22, 2024

View reviewed changes

yoavkatz and others added 11 commits July 22, 2024 10:32

Remove confidence interval calculation for meteor metric by default

c93ce43

added a new metric with interval calculations Signed-off-by: Yoav Katz <katz@il.ibm.com>

Added error mesage when metrics not a list

43f8ec3

Signed-off-by: Yoav Katz <katz@il.ibm.com>

Added error mesage

6362e7f

when post processors are not a list Signed-off-by: Yoav Katz <katz@il.ibm.com>

Changed Rouge to be HuggingfaceBulkMetric

a6f4a65

to avoid recalculation of metric on every resample Signed-off-by: Yoav Katz <katz@il.ibm.com>

added meteor as an HuggingFaceInstanceMetric

c4c8f09

Signed-off-by: dafnapension <dafnashein@yahoo.com>

removed meteor_with_confidence_intervals.json

4698e01

Signed-off-by: dafnapension <dafnashein@yahoo.com>

fixed test_metric_utils.py by better concentrating on rougeL only

b338047

Signed-off-by: dafnapension <dafnashein@yahoo.com>

comment about rounded floats in tested scores

1e9dd44

Signed-off-by: dafnapension <dafnashein@yahoo.com>

while generating metric meteor, compmare against HF implementation

2846c82

Signed-off-by: dafnapension <dafnashein@yahoo.com>

added a test comparing new Rouge with HF Rouge, nd per arielge's good…

022e1dc

… advice, changed bootstrap method to percentile in case of 100 or more instances Signed-off-by: dafnapension <dafnashein@yahoo.com>

implemented Meteor and Rouge with inhouse code

d8e264c

Signed-off-by: dafnapension <dafnashein@yahoo.com>

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 9357b60 to d8e264c Compare July 22, 2024 07:32

dafnapension added 2 commits July 22, 2024 11:15

download quietly, and import in prepare

35f8098

Signed-off-by: dafnapension <dafnashein@yahoo.com>

trying to avoid .secrets.baseline

551cd76

Signed-off-by: dafnapension <dafnashein@yahoo.com>

dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 9c0f87b to 551cd76 Compare July 22, 2024 10:56

secret.baseline how do I get rid of it?

6759837

Signed-off-by: dafnapension <dafnashein@yahoo.com>

elronbandel approved these changes Jul 22, 2024

View reviewed changes

Merge branch 'main' into Remove-confidence-interval-caluculation-for-…

072926c

…meteor-metric-by-default

elronbandel enabled auto-merge (squash) July 22, 2024 16:15

elronbandel merged commit 94daea3 into main Jul 22, 2024
8 checks passed

elronbandel deleted the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch July 22, 2024 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

yoavkatz commented Jul 9, 2024

OfirArviv commented Jul 9, 2024

yoavkatz commented Jul 9, 2024

dafnapension commented Jul 10, 2024 •

edited

Loading

dafnapension left a comment

yoavkatz commented Jul 10, 2024

dafnapension commented Jul 10, 2024

dafnapension commented Jul 10, 2024

yoavkatz commented Jul 11, 2024

yoavkatz commented Jul 11, 2024

dafnapension commented Jul 11, 2024

dafnapension commented Jul 11, 2024

dafnapension commented Jul 11, 2024 •

edited

Loading

yoavkatz Jul 15, 2024

dafnapension Jul 15, 2024

dafnapension Jul 17, 2024

yoavkatz Jul 17, 2024

assaftibm Jul 17, 2024

dafnapension Jul 18, 2024 •

edited

Loading

yoavkatz Jul 18, 2024

yoavkatz Jul 21, 2024

dafnapension Jul 21, 2024

assaftibm Jul 22, 2024

dafnapension commented Jul 16, 2024

yoavkatz commented Jul 16, 2024

yoavkatz commented Jul 16, 2024

dafnapension commented Jul 16, 2024

elronbandel Jul 22, 2024

elronbandel Jul 22, 2024

dafnapension Jul 22, 2024 •

edited

Loading

elronbandel left a comment

elronbandel Jul 22, 2024

elronbandel Jul 22, 2024

dafnapension Jul 22, 2024

yoavkatz Jul 22, 2024

		from nltk import word_tokenize
		from nltk.translate import meteor_score

Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

Conversation

yoavkatz commented Jul 9, 2024

OfirArviv commented Jul 9, 2024

yoavkatz commented Jul 9, 2024

dafnapension commented Jul 10, 2024 • edited Loading

dafnapension left a comment

Choose a reason for hiding this comment

yoavkatz commented Jul 10, 2024

dafnapension commented Jul 10, 2024

dafnapension commented Jul 10, 2024

yoavkatz commented Jul 11, 2024

yoavkatz commented Jul 11, 2024

dafnapension commented Jul 11, 2024

dafnapension commented Jul 11, 2024

dafnapension commented Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dafnapension Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dafnapension commented Jul 16, 2024

yoavkatz commented Jul 16, 2024

yoavkatz commented Jul 16, 2024

dafnapension commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dafnapension Jul 22, 2024 • edited Loading

Choose a reason for hiding this comment

elronbandel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dafnapension commented Jul 10, 2024 •

edited

Loading

dafnapension commented Jul 11, 2024 •

edited

Loading

dafnapension Jul 18, 2024 •

edited

Loading

dafnapension Jul 22, 2024 •

edited

Loading