Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix evaluation PR #1711

Merged
merged 22 commits into from
May 30, 2024
Merged

Fix evaluation PR #1711

merged 22 commits into from
May 30, 2024

Conversation

mmabrouk
Copy link
Member

@mmabrouk mmabrouk commented May 28, 2024

I've significantly revised the evaluation PR, especially our handling of the correct_answer_key as a string or list, which was problematic. The following changes have been made:

  • I've changed our approach to the correct answer, offering a universal solution. Now, each evaluator can have advanced settings, such as correct_answer_key. Ground truth keys that should appear in the evaluation result are marked in the configuration.
  • I've enhanced the security of the code to prevent executing a function with a name taken directly from the frontend.
  • Ive replaced langchain with OpenAI for the AI critic. The versions we were using were outdated and incompatible with Ragas and other features. Langchain was not suitable for our needs.

Copy link

vercel bot commented May 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
agenta ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 29, 2024 0:08am

Removed langchain and use openai directly
@mmabrouk
Copy link
Member Author

@aakrem can you please review the general logic
@aybruhm can you please review the changes to the tests and see whether they make sense
@bekossy can you please QA the evaluation. I QA'ed myself, but between my changes and Akrem's a lot has been done. Can you please try all the evaluators and see whether they work. Then again with a different correct_answer. Thank you!

@mmabrouk
Copy link
Member Author

@bekossy oss and cloud.beta both point for now to this PR

Copy link
Collaborator

@aakrem aakrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mmabrouk

  • I like the idea of introducing the advance key, it frees up a lot of logic in frontend.
  • I also like a lot of the refactoring you did
  • Regarding the refactor in the evaluators.py, tbh, I prefer the initial implementation with correct_answer_keys as it groups all possible values I find it it abstract more things.

I also would love to know how are we planning to define a ragas evaluator. Is it going to be something like :

    "settings_template": {
      >>>  "correct_answer_keyS": {
            "label": "Correct Answers",
            "default": "correct_answers",
        >>>    "type": "array",
            "advanced": True,
          >>>  "ground_truth_keyS": True,
            "description": "The name of the column in the test data that contains the correct answer",
        },
    },
  • We are mixing the usage of terms correct_answer and ground_truths. I tried to do this I gave up as it's populated everywhere. let's either use the same terminology or refactor all to ground truth together with the schema at a later stage.

@mmabrouk
Copy link
Member Author

Regarding the refactor in the evaluators.py, tbh, I prefer the initial implementation with correct_answer_keys as it groups all possible values I find it it abstract more things.
I don't agree at all. 1) I think the previous implementation does not even work in the case of multiple ground truths. 2) This is a bad abstraction.

A good abstraction encapsulates a concept and conceals its details, as a result it ends up simplifying the code. It does this because a good abstraction mirrors the reality and the problem That's not the case here, you can see it by looking at the code and how we handle the correct_answer_keys:

  1. First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code. We end up having custom code that moves between string and array. The reason is that the abstraction does not aligned with reality. There is nothing in reality that is a list of correct answer_keys, it's just a random encapsulation that has been created to put things together. In reality, the fact of being a ground truth key is just a property of the setting.
CleanShot 2024-05-29 at 08 29 48@2x 2. In the evaluator functions, we take the first value of a list of keys. This isn't clean code—it unnecessarily complicates the code (now we have to get the first element of the list) without providing inherent meaning. For example, if you read the function on its own, you might wonder why the correct_answer_keys are a list? Why are we ignoring the rest? What's special about the first element? These questions indicate a problem. Another red flag is when you find yourself writing a constant in the code, like correct_answers[0]. It's a sign that the abstraction might be flawed. It's a good habit to have to 1) read functions by themselves and see if they make sense 2) check if you have any constants in code or seemingly random instruction that does not fit any business logic. CleanShot 2024-05-29 at 08 31 02@2x

Now, why is the abstraction that you have used is bad. The data structure you used does not mirror reality: A list means that all elements are interchangeable and have the same meaning. This is not the case here, let's assume we have an evaluator that uses two columns, context and correct_answer. Obviously it needs to now which is which to function properly. How do you even define this when you have a list, do you expect the user to provide these in some specific order?

In conclusion, while grouping values together is often a feature of good abstraction, it doesn't necessarily make an abstraction good. A good abstraction reflects the problem at hand and simplifies its complexity. I hope this explanation is helpful.

I hope this helps. I strongly recommend to everyone reading (or even skimming) the book clean code. I think it is very helpful.

I also would love to know how are we planning to define a ragas evaluator.

    "settings_template": {
        "context_key": {
            "label": "Column Containing Contexts",
            "default": "contexts",
            "type": "string"
            "advanced": True,
            "ground_truth_key": True,
            "description": "The name of the column in the test data that contains the contexts",
        },
    },

Not sure why you were under the impression here that we would be saving each of the context in a different column and providing a list of column names in this case. That would not work, since the number of contexts might change from one case to another.

@mmabrouk
Copy link
Member Author

Just a small addition here. The right abstraction, that is the abstraction that mirrors reality here, is that a value in the test set could be not only a string but also a list. Using this abstraction, would fix any future complexity we would add when using ragas (for instance parsing a json).

@mmabrouk
Copy link
Member Author

We are mixing the usage of terms correct_answer and ground_truths. I tried to do this I gave up as it's populated everywhere. let's either use the same terminology or refactor all to ground truth together with the schema at a later stage.
Actually the issue, is that ground truth is not the right world here. For instance the context in the case of ragas is not a ground truth. Maybe a better word is "evaluation context", not sure

@mmabrouk
Copy link
Member Author

@aybruhm @aakrem @bekossy Please review and and approve (or request changes) the changes when you are done, to know that we can merge

@mmabrouk mmabrouk requested a review from aakrem May 29, 2024 10:13
@aakrem
Copy link
Collaborator

aakrem commented May 29, 2024

@mmabrouk

First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code.

In frontend it's also considered as a list.
https://github.com/Agenta-AI/agenta/pull/1606/files#diff-a5b1a701e0665f6eb124977e2185fc01a12aaf7bf65ddb00e4b61c1e9c41ab41R7

Regarding UI We still don't even have a design for an evaluator containing a list of correct answers.
I don't understand the concern there.

There is nothing in reality that is a list of correct answer_keys

Regarding all your "reality" points, the reality I see it is that evaluators can have a single correct answer or multiple ones.
Some will use only one some will use many.

I think the original implementation focused a lot to make things ready to integrate evaluator with multiple correct answer.

It's still not clear how are you planning to create an evaluator with multiple correct answers with this implementation.

Can you simply give an example.

@mmabrouk
Copy link
Member Author

mmabrouk commented May 30, 2024

@aakrem

@mmabrouk

First let's look at the code in the frontend (which does not make any sense to me). We handle the correct_answer_keys as a string in the UI yet an array in the code.

In frontend it's also considered as a list. https://github.com/Agenta-AI/agenta/pull/1606/files#diff-a5b1a701e0665f6eb124977e2185fc01a12aaf7bf65ddb00e4b61c1e9c41ab41R7

Which we set as a string.

Regarding UI We still don't even have a design for an evaluator containing a list of correct answers. I don't understand the concern there.

I think you don't understand the problem...

There is nothing in reality that is a list of correct answer_keys

Regarding all your "reality" points, the reality I see it is that evaluators can have a single correct answer or multiple ones. Some will use only one some will use many.

No, an evaluator have multiple columns each with a different meanings: one for context, one for correct_answer.. Not multiple columns with the same meaning (correct_answer1, correct_answer_2). Each column contains one type of information.

I think the original implementation focused a lot to make things ready to integrate evaluator with multiple correct answer.

Again, I think you did not understand the problem.

It's still not clear how are you planning to create an evaluator with multiple correct answers with this implementation.

Can you simply give an example.

I gave one my answer. Which is the Ragas example. There the evaluator need access to a list of context. This list can be of different size, so it does not make any sense to have each context in a different column. The list of context should be in one column.

I hope this helps understand the user story. If there is no other comments other than on this. Let's merge this.

Copy link
Collaborator

@aakrem aakrem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @mmabrouk

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 30, 2024
@mmabrouk mmabrouk merged commit 2080680 into access-to-all-columns May 30, 2024
9 checks passed
@mmabrouk mmabrouk deleted the fix-all-columns branch May 30, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend evaluation Frontend lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants