Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Rouge and Meteor to InstanceMetric for faster score computation #1011

Conversation

yoavkatz
Copy link
Member

@yoavkatz yoavkatz commented Jul 9, 2024

added a new metric with interval calculations

@yoavkatz yoavkatz requested a review from eladven July 9, 2024 09:23
@OfirArviv
Copy link
Collaborator

@yoavkatz maybe change it in the class itself? Not only in the catalog? That way if people fetch the metric as I did they still have the default

@yoavkatz
Copy link
Member Author

yoavkatz commented Jul 9, 2024

@yoavkatz maybe change it in the class itself? Not only in the catalog? That way if people fetch the metric as I did they still have the default

Meteor (unlike rouge) does not have a class, and only is an instantiation of Huggingface metric.

I think the Rouge reason it is slow , is that in itself it does bootstrapping in the code:

https://huggingface.co/spaces/evaluate-metric/rouge/blob/main/rouge.py#L134

I think the solution is to avoid use aggregation.

@dafnapension
Copy link
Collaborator

dafnapension commented Jul 10, 2024

@eladven , @yoavkatz , I think that the problem with the use of unitxt of HF Rouge and Meteor (and others) is in making them Global, and even leaving the default arg of GlobalMetric: process_single_instances = True in place.
In such a case, HF, that by itself invokes heavy language models, is invoked first for each and every instance (from the beginning of GlovalMetric.process()), and then for the whole stream (from the second part of that process()). When HF is invoked for the whole stream, it again processes each and every instance.
Now if this happens with CI - the processing of each and every instance (which is heavy language model) repeats 2 * number of resamples.
If these HF metric are incarnated as InstanceMetric (and there is no reason why not), then each instance would have been processed just once! the CI uses existing scores, if there are, it does not compute them again, and there will be if so incarnated as InstanceMetric.

I think it is better to fix that issue and not flood the catalog with cards of yes/no CI.

Copy link
Collaborator

@dafnapension dafnapension left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment before you merge. I think it is not a good idea to run away from CI when this is not necessary.

@yoavkatz
Copy link
Member Author

@eladven , @yoavkatz , I think that the problem with the use of unitxt of HF Rouge and Meteor (and others) is in making them Global, and even leaving the default arg of GlobalMetric: process_single_instances = True in place. In such a case, HF, that by itself invokes heavy language models, is invoked first for each and every instance (from the beginning of GlovalMetric.process()), and then for the whole stream (from the second part of that process()). When HF is invoked for the whole stream, it again processes each and every instance. Now if this happens with CI - the processing of each and every instance (which is heavy language model) repeats 2 * number of resamples. If these HF metric are incarnated as InstanceMetric (and there is no reason why not), then each instance would have been processed just once! the CI uses existing scores, if there are, it does not compute them again, and there will be if so incarnated as InstanceMetric.

I think it is better to fix that issue and not flood the catalog with cards of yes/no CI.

I agree. I changed the code of Rouge to be HuggingfaceBulkMetric and it solved the issue for Rouge.

Please review and let me know what you think.

Regarding Meteor - can you do the appropriate thing?

@yoavkatz yoavkatz changed the title Remove confidence interval calculation for meteor metric by default Move Rouge and Meteor to be Bulk metrics to reduce runtime in confidence interval calculations Jul 10, 2024
@dafnapension
Copy link
Collaborator

yes, coming up

@dafnapension
Copy link
Collaborator

Hi @yoavkatz , I added Meteor as an HuggingFaceInstance, to see how this works. With ci.
Also, I thought about rouge: if it is bulk, then each bulk gets the global score as the average of the bulk, not the average of the whole stream. Is that OK?
So I changed Rouge to inherit from this new HuggingFaceInstance. Nothing else changed.
I also fixed a small bug in test_metric_utils, and things seem to be fine now, here, on my little laptop.
I hope that this new incarnation will survive also out in the wild fields of fm_eval..

@yoavkatz
Copy link
Member Author

Hi @yoavkatz , I added Meteor as an HuggingFaceInstance, to see how this works. With ci. Also, I thought about rouge: if it is bulk, then each bulk gets the global score as the average of the bulk, not the average of the whole stream. Is that OK? So I changed Rouge to inherit from this new HuggingFaceInstance. Nothing else changed. I also fixed a small bug in test_metric_utils, and things seem to be fine now, here, on my little laptop. I hope that this new incarnation will survive also out in the wild fields of fm_eval..

In general BulkInstanceMetric does not perform aggregation, and assume the results returned per instance. It just a matter of optimization, for example in LLM based metrics we send a batch of requests at a time , instead of 1-1 which is much slower.

In Rouge, I removed the use_aggregation=True to the metric returns for each instance the rouge score, which is then averaged by the generic BulkInstanceMetric code.

@yoavkatz
Copy link
Member Author

@dafnapension - it would be if we add a unit test that checks that if we wrap a metric (like Rouge) in HuggingfaceInstanceMetric or HuggingfaceBulkInstanceMetric - we would get the same results.

@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from bbaf1ca to a63edfc Compare July 11, 2024 10:04
@dafnapension
Copy link
Collaborator

A good idea, coming up!

@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from a63edfc to 8a3ad1d Compare July 11, 2024 13:22
@dafnapension
Copy link
Collaborator

Added a comparison of Meteor to its previous global HF implementation, upon preparation of the metric (in Meteor.py).
Now looking into Rouge.

@dafnapension
Copy link
Collaborator

dafnapension commented Jul 11, 2024

Hi @yoavkatz , I added a test comparing old and new rouge - to show they return the same result.
Also, per a good advice from @arielge and @assaftibm , in case of more than 100 instances, I changed the method of bootstrap from "BCa" (the default) to "percentile".

Last: please note that for rouge, if HF's use_aggregator = False , then HF returns vectors of numeric results (rather than simple floats), with which we can not do CI. So I made the comparison above with use_aggregator = True in order to allow CI. The bootstrapping that HF does per its own initiative (when use_aggregator = True) does not change much the result, compared to a simple average over the individual instance scores.

@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch 2 times, most recently from 07f2ffd to 6040057 Compare July 15, 2024 13:20
@@ -327,6 +327,11 @@ def score_based_confidence_interval(
# otherwise, the aggregation_func needs to be applied AFTER resampling the instances;
# that is, re-form the groups, calculate the function, and take the mean of the group scores
aggregation_func = self.average_item_scores

method = "BCa"
if len(instances) > 100:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the run time difference and accuracy between BCa and percentile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really know.. I took the advice of @arielge and @assaftibm , here:
#1008 (comment)
and
#1008 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found information about percentile and BCa here.
From my understanding (@assaftibm , @arielge , please correct my mistakes..),
'percentile' works by the following intuitive notion: make n_resamples of the population [the stream fed to metric], each such resample is of the size of the population and is built by repeating the following process: independently, and identically distributed, select an item [instance] from the population, with replacement [return of a selected item back to the population].
Then, given one resample, compute the statistics (the metric) on it, the same way the metric is computed on the original population.
Then, given n_resamples scores (one per re_sample), return the 0.05 and 0.95 percentile ("quintile" of 100 rather than 5) as the ci borders.

BCa continues the above result of n_resamples scores, by correcting skewness and biases thereof, observed when comparing to the original population.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it matters, but I saw somewhere the BCa is the one that is commonly used. If the difference in runtime is small (I would think the main cost is recalculating the metric and not the BCa vs percentile), then I would keep it simple and not change behavior on some threshold.

So - can you try running twice (with BCA and with percentile) on say 200 instances, and see the difference in runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dafnapension!

Yes - I think your understanding of Percentile bootstrap is correct.

I would suggest the following:

  1. When it's an InstanceMetric, use BCa bootstrap.
  2. When it's a GlobalMetric, use Basic bootstrap (or Percentile, see discussion below).

In 2, I wouldn't condition the type of bootstrap on the size of the dataset and switch between BCa and other lighter bootstrap because it can confuse people that test a full dataset vs. smaller samples of it - they will suddenly get different CIs due to a switch in the bootstrap type. I think it's better to stick to one for consistency.

Regarding Basic vs. Percentile bootstrap.
There is a description here. I have to admit that I don't fully understand the computation of the Basic bootstrap. However, note that Scipy documentation includes this comment:

While the 'percentile' method is the most intuitive, it is rarely used in practice. Two more common methods are available, 'basic' (‘reverse percentile’) and 'BCa' (‘bias-corrected and accelerated’); they differ in how step 3 is performed.

So this is the main reason why I would lean towards Basic over Percentile. It would be easier if we could stick to BCa also in GlobalMetrics but unfortunately that doesn't seem practical.

As a side note, I think it makes sense to ask about the slowdown with BCa on StackOverflow - showing a short self-contained sample code that reproduces the problem using Scipy's implementation and asking if this is as expected. It's possible that the implementation in Scipy is not optimal.

Copy link
Collaborator

@dafnapension dafnapension Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all my experiments, because of lack of real data, I used prediction=target (first reference). I think it only changes a constant time of computing each instance score, so I allowed myself..
In my humble opinion, in the numbers of instances and resamples that I see in the cards, we can stick to the said to be most commonly used BCa. Just remember to check again if we grow dramatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed analysis. BCa seems to increase the time about 10%-20% in instance metrics, but more than doubles the time in global metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we agree to keep BCa always? I don't think we have many more global metrics that take a long time - and it's the simplest approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, too, think so. Leave the change for a future consideration, if and when needed.
@elronbandel also asked that I simply copy over the code from what we find in HF to an in-house instance metric, for a further speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem will appear with more complex global metrics like CorpusBLEU, but I agree it can be dealt with in a separate PR.

@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 6040057 to 2381c4e Compare July 15, 2024 14:49
@dafnapension
Copy link
Collaborator

Hi @yoavkatz , looking at these lines in HuggingfaceBulkMetric:

results = [{} for _ in range(len(scores[self.hf_metric_fields[0]]))]

it seems like you assume that every score that is a list, means one component per instance. It is true for Rouge, but not so in bleu, for example, where a score named precisions is an array whose length equals the input argument max_order.
I think that this breakdown of list scores to instance score should be done in the individual metric level, and not in the BulkMetric level.

@yoavkatz
Copy link
Member Author

Hi @yoavkatz , looking at these lines in HuggingfaceBulkMetric:

results = [{} for _ in range(len(scores[self.hf_metric_fields[0]]))]

it seems like you assume that every score that is a list, means one component per instance. It is true for Rouge, but not so in bleu, for example, where a score named precisions is an array whose length equals the input argument max_order.
I think that this breakdown of list scores to instance score should be done in the individual metric level, and not in the BulkMetric level.

I suggest we complete this PR on Rouge and Metric, and then think about Bleu and generalizing this.

@yoavkatz
Copy link
Member Author

Note that there is HuggingfaceInstanceMetric which is used by Rouge and Meteor and HuggingfaceBulkMetric, which is only used by BertScore as far as I understand.

@dafnapension
Copy link
Collaborator

Hi @yoavkatz , indeed I was referring to a generalization (for a future PR), trying to address your comment that performance-wise, if possible, better approach HF in batches.

@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch 5 times, most recently from 039f3b4 to 9357b60 Compare July 21, 2024 19:53
Comment on lines 1350 to 1351
nltk.download("wordnet")
nltk.download("omw-1.4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)

Comment on lines 1364 to 1365
from nltk import word_tokenize
from nltk.translate import meteor_score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to prepare so import is performed only once

Copy link
Collaborator

@dafnapension dafnapension Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @elronbandel , at last I learned (from you, as always) how to avoid those "already up to date" running on my screen.

Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dafnapension looks great. only need to move imports to prepare() and make nltk downloads quiet ( see my comments)

Also consider making new PR only with your changes since there is this .secrets.basline that seem to be affected by wrong merging of the main

Comment on lines 1780 to 1784
import nltk
from rouge_score import rouge_scorer

nltk.download("punkt")
self.sent_tokenize = nltk.sent_tokenize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to prepare

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to save it under self.scorer then use it in the compute

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @elronbandel , Done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elronbandel @dafnapension - can we merge these?

yoavkatz and others added 11 commits July 22, 2024 10:32
added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>
to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
… advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 9357b60 to d8e264c Compare July 22, 2024 07:32
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
@dafnapension dafnapension force-pushed the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch from 9c0f87b to 551cd76 Compare July 22, 2024 10:56
Signed-off-by: dafnapension <dafnashein@yahoo.com>
@elronbandel elronbandel enabled auto-merge (squash) July 22, 2024 16:15
@elronbandel elronbandel merged commit 94daea3 into main Jul 22, 2024
8 checks passed
@elronbandel elronbandel deleted the Remove-confidence-interval-caluculation-for-meteor-metric-by-default branch July 22, 2024 16:36
csrajmohan pushed a commit that referenced this pull request Jul 29, 2024
…ion (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
benjaminsznajder pushed a commit that referenced this pull request Jul 29, 2024
…ion (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
benjaminsznajder added a commit that referenced this pull request Jul 29, 2024
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027)

Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added seed to LLM as judges for consistent results (#1029)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* replace type and __type__ in type error (#1035)

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add task rag_end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add card for clapnq end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add sandbox_benjams

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add subset

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove constants

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* rename sandbox_benjams to sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add string to context id in rag (#1036)

* allow strings (hash) as context id

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* save to catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed issues with fresh install (#1037)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add validation to tldr, remove shuffle from billsum (#1038)

* add validation to tldr, remove shuffle from billsum
(shuffled by the SplitRandomMix)

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

* fix formatting

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

---------

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add CloseTextSampler and FixedIndicesSampler (#1034)

* Add CloseTextSampler

That returns demos that are textually close to the current instance.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Make sampler call pass  current instance

Added end 2 end test of sampler that depends on output

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added FixedIndicesSampler(Sampler):

Selects a fix set of samples based on a list of indices from the demo pool

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Made splitter currently use random_generators

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed all Sample randomization

To use common code to create randomizer per instance

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030)

* changed input and output of templates

to "input_fields" and "reference_ fields" .

This is to continue  the work done on tasks.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed type hint

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Documentation update

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* FinQA - filter problematic examples (#1039)

filter problematic examples

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Arena hard elad2 (#1026)

* bug fixes in PairwiseChoiceTemplate

* add arena hard regex parser operator

* update mt bench card common

* update mt bench card common

* add reward bench

* update metric to pairwise comarison task

* arena hard tasks and cards

* update mt bench template

* add duplicate stream operator

* add PairwiseComparativeRatingTemplate

* add card

* add card

* add template

* add winrate metrics

* add comparative rating task

* add ExtractArenaHardNumericalJudgment

* add arena hard cards

* add arena hard template

* add weighted winrate metrics

* delete file

* update PairwiseComparativeRatingTemplate

* add metric

* add metric

* update

* update

* update

* fix template bug

* update

* llama 3 update

* update

* update

* update jsons

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* fix

* fix

* update

* update

* update

* bluebench related changes

* fix type issue

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* update

* update

* update

* prometheus1

* update

* fix

* fix

* merge with arena_branch

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* rebuild catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* add debugging to clapnq

* Reproduce all artifacts

* Add missing artifacts to catalog

* Add secrets baseline

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

* Fix bugs with catalog creation

* Remove areana hard examples from tests, since they don't pass

* Add missing metadata to test mock

* Add data_classification_policy and recipe_metadata to the steams tests

* Fix test failures

* Update multi_turn_gpt4_judgement.py

* Update multi_turn_with_reference_gpt4_judgement.py

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* bug fix in LoadFromHFSpace

* revert

* revert

* update examples

* add coment to expain change

* update to new params usage

* pr fixes

* pr fixes

* update

* update

* update

* update

* update

* update

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* update

* cr fixes

* llmaj format fix

* llmaj format fix

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* demo's target prefix is now taken from demo instance (#1031)

* demo's target prefix is now taken from demo instance

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* do not pop fields out of demo instances.
Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove the reduced clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* define an empty template for rag end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Implement metrics ensemble (#1047)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add load_json_predictions as processor in the template

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add the processors/load_json_predictions.json generated to the catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add flores101 (#1053)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos (#1052)

* Added example for selection of demos

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added example doc

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Update docs/docs/examples.rst

* Update docs/docs/examples.rst

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add overwrite

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fix typo in japanese_llama system prompt (issue #964) (#1056)

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Allow assigning None in overwrites when fetching artifacts with modifications (#1062)

allow =None in overwrites for fetch

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Make sure preparation times printed fully and nicely (#1046)

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* numeric nlg - template changes (#1041)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric (#1064)

* add judge input to the metric

* add judge input to the metric

* fix

* fix test

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Unitxt capitalization adding_dataset.rst (#1057)

making Unitxt capitalization consistent in text

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fixed the score_ci inconsistency issue (#1065)

* suggested fix for score_ci inconsistency issue

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* unify with the update, and thus simplified the check

Signed-off-by: dafnapension <dafnashein@yahoo.com>
---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Use of conventional python types in input definition of tasks and metrics (#1045)

* Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Make tasks types python types

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix errors

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Some fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* More fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update catalog

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix cards

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Revert change

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix typing in docs with new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* refactor of new asset to new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update secrets baseline

Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning (#1072)

* Added prediction type to llm as jusdge to avoid warning

Clarified the sandalone llm as judge example

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Removed accidentally added file

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed clapnq to check with reasonable error values

Also updated rag tasks to use new typing (instead of string types)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix the type hint

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* update catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add metric "metrics.rag.retrieval_at_k" to catalog (#1074)

* add metric "metrics.rag.retrieval_at_k" to catalog
this is a wrapper around the retrieval_at_k for the ragas scheme

* add corresponding json file for the new metric

---------

Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* merge - resolve conflict

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com>
Co-authored-by: Alon H <alonh@users.noreply.github.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com>
Co-authored-by: Elad <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com>
Co-authored-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com>
Co-authored-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
csrajmohan pushed a commit that referenced this pull request Aug 29, 2024
…ion (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
csrajmohan pushed a commit that referenced this pull request Aug 29, 2024
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027)

Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added seed to LLM as judges for consistent results (#1029)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* replace type and __type__ in type error (#1035)

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add task rag_end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add card for clapnq end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add sandbox_benjams

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add subset

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove constants

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* rename sandbox_benjams to sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add string to context id in rag (#1036)

* allow strings (hash) as context id

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* save to catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed issues with fresh install (#1037)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add validation to tldr, remove shuffle from billsum (#1038)

* add validation to tldr, remove shuffle from billsum
(shuffled by the SplitRandomMix)

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

* fix formatting

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

---------

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add CloseTextSampler and FixedIndicesSampler (#1034)

* Add CloseTextSampler

That returns demos that are textually close to the current instance.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Make sampler call pass  current instance

Added end 2 end test of sampler that depends on output

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added FixedIndicesSampler(Sampler):

Selects a fix set of samples based on a list of indices from the demo pool

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Made splitter currently use random_generators

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed all Sample randomization

To use common code to create randomizer per instance

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030)

* changed input and output of templates

to "input_fields" and "reference_ fields" .

This is to continue  the work done on tasks.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed type hint

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Documentation update

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* FinQA - filter problematic examples (#1039)

filter problematic examples

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Arena hard elad2 (#1026)

* bug fixes in PairwiseChoiceTemplate

* add arena hard regex parser operator

* update mt bench card common

* update mt bench card common

* add reward bench

* update metric to pairwise comarison task

* arena hard tasks and cards

* update mt bench template

* add duplicate stream operator

* add PairwiseComparativeRatingTemplate

* add card

* add card

* add template

* add winrate metrics

* add comparative rating task

* add ExtractArenaHardNumericalJudgment

* add arena hard cards

* add arena hard template

* add weighted winrate metrics

* delete file

* update PairwiseComparativeRatingTemplate

* add metric

* add metric

* update

* update

* update

* fix template bug

* update

* llama 3 update

* update

* update

* update jsons

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* fix

* fix

* update

* update

* update

* bluebench related changes

* fix type issue

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* update

* update

* update

* prometheus1

* update

* fix

* fix

* merge with arena_branch

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* rebuild catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* add debugging to clapnq

* Reproduce all artifacts

* Add missing artifacts to catalog

* Add secrets baseline

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

* Fix bugs with catalog creation

* Remove areana hard examples from tests, since they don't pass

* Add missing metadata to test mock

* Add data_classification_policy and recipe_metadata to the steams tests

* Fix test failures

* Update multi_turn_gpt4_judgement.py

* Update multi_turn_with_reference_gpt4_judgement.py

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* bug fix in LoadFromHFSpace

* revert

* revert

* update examples

* add coment to expain change

* update to new params usage

* pr fixes

* pr fixes

* update

* update

* update

* update

* update

* update

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* update

* cr fixes

* llmaj format fix

* llmaj format fix

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* demo's target prefix is now taken from demo instance (#1031)

* demo's target prefix is now taken from demo instance

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* do not pop fields out of demo instances.
Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove the reduced clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* define an empty template for rag end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Implement metrics ensemble (#1047)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add load_json_predictions as processor in the template

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add the processors/load_json_predictions.json generated to the catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add flores101 (#1053)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos (#1052)

* Added example for selection of demos

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added example doc

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Update docs/docs/examples.rst

* Update docs/docs/examples.rst

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add overwrite

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fix typo in japanese_llama system prompt (issue #964) (#1056)

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Allow assigning None in overwrites when fetching artifacts with modifications (#1062)

allow =None in overwrites for fetch

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Make sure preparation times printed fully and nicely (#1046)

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* numeric nlg - template changes (#1041)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric (#1064)

* add judge input to the metric

* add judge input to the metric

* fix

* fix test

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Unitxt capitalization adding_dataset.rst (#1057)

making Unitxt capitalization consistent in text

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fixed the score_ci inconsistency issue (#1065)

* suggested fix for score_ci inconsistency issue

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* unify with the update, and thus simplified the check

Signed-off-by: dafnapension <dafnashein@yahoo.com>
---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Use of conventional python types in input definition of tasks and metrics (#1045)

* Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Make tasks types python types

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix errors

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Some fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* More fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update catalog

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix cards

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Revert change

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix typing in docs with new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* refactor of new asset to new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update secrets baseline

Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning (#1072)

* Added prediction type to llm as jusdge to avoid warning

Clarified the sandalone llm as judge example

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Removed accidentally added file

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed clapnq to check with reasonable error values

Also updated rag tasks to use new typing (instead of string types)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix the type hint

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* update catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add metric "metrics.rag.retrieval_at_k" to catalog (#1074)

* add metric "metrics.rag.retrieval_at_k" to catalog
this is a wrapper around the retrieval_at_k for the ragas scheme

* add corresponding json file for the new metric

---------

Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* merge - resolve conflict

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com>
Co-authored-by: Alon H <alonh@users.noreply.github.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com>
Co-authored-by: Elad <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com>
Co-authored-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com>
Co-authored-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants