Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Evaluators - Debugging #2071

Merged
merged 193 commits into from
Sep 23, 2024
Merged
Show file tree
Hide file tree
Changes from 190 commits
Commits
Show all changes
193 commits
Select commit Hold shift + click to select a range
5c5ad15
feat (backend): add utility function to ensure event loop is retrieve…
aybruhm Aug 1, 2024
60064ef
feat (backend): add endpoint to run evaluation on a specific evaluator
aybruhm Aug 1, 2024
0a66e26
refactor (backend): make use of ensure_event_loop utility function
aybruhm Aug 1, 2024
7d534fc
Merge branch 'rag' into feature/age-491-poc-1e-expose-running-evaluat…
aybruhm Aug 2, 2024
9ba496f
docs (backend): improve docstring in ensure_event_loop function
aybruhm Aug 2, 2024
bd9b2d3
minor refactor (build): replace use of 'docker-compose' to 'docker co…
aybruhm Aug 4, 2024
3f6b507
feat (backend): created evaluator mapping and input interfaces
aybruhm Aug 8, 2024
4a1604d
feat (backend): implemented endpoints to map experiment data tree to …
aybruhm Aug 8, 2024
55c727e
refactor (backend): update evaluator handlers to make use of new hand…
aybruhm Aug 8, 2024
165f7a7
Merge branch 'rag' into feature/age-491-poc-1e-expose-running-evaluat…
aybruhm Aug 9, 2024
6fb0b02
chore (backend): remove interfaces that are redundant
aybruhm Aug 9, 2024
0facf64
refactor (backend): convert evaluator functions to asynchronous
aybruhm Aug 12, 2024
b81d01b
refactor (backend): remove event_loop_utils module
aybruhm Aug 12, 2024
53bb2ed
refactor (backend): improve run evaluator endpoint to be asynchronous
aybruhm Aug 12, 2024
6f876f4
refactor (backend): update rag faithfulness and context relevancy eva…
aybruhm Aug 12, 2024
5214108
refactor (backend): ensure rag evaluators is compatible with evaluate…
aybruhm Aug 12, 2024
133fc73
refactor (tests): convert unit tests to async using pytest.mark.asyncio
aybruhm Aug 12, 2024
7e229b3
refactor (backend): resolve pr#1956 comments (r1714066376, r171406851…
aybruhm Aug 12, 2024
bbbb925
minor refactor (backend): remove redundant 'get_nested_value' function
aybruhm Aug 12, 2024
58760ee
Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…
aybruhm Aug 13, 2024
dd805a2
refactor (backend): run evaluator_service 'evaluate' asynchronous fro…
aybruhm Aug 13, 2024
d3f7315
refactor (backend): improve error handling for auto_contains_json eva…
aybruhm Aug 13, 2024
08b9e87
feat (tests): add tests for dictionary-based output handling in conta…
aybruhm Aug 13, 2024
28320b8
refactor (backend): add check for OpenAI API key with clear exception…
aybruhm Aug 14, 2024
d8a1bbd
feat (tests): add test case for auto_ai_critique and evaluators requi…
aybruhm Aug 14, 2024
9c10025
feat (tests): added mock trace data for a simple finance assisstant
aybruhm Aug 18, 2024
03e28ee
feat (tests): created fixtures for evaluator experiment tree map and …
aybruhm Aug 18, 2024
6b6b6d6
feat (tests): created tests for evaluator experiment tree map and run…
aybruhm Aug 18, 2024
b808342
Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…
aybruhm Aug 18, 2024
e02fefa
refactor (backend): rewrite db function to check if evaluators exist …
aybruhm Aug 19, 2024
4cee49f
chore (backend): remove deprecated function 'check_ai_critique_inputs'
aybruhm Aug 19, 2024
c6ee3c8
feat (backend): implemented helper functions to:
aybruhm Aug 19, 2024
a8c1273
refactor (backend): update evaluator_router to:
aybruhm Aug 19, 2024
f3367ef
feat (tests): added test to create evaluation with no llm keys
aybruhm Aug 20, 2024
c499a19
refactor (backend): added
aybruhm Aug 20, 2024
cc90567
Merge branch 'main' into feature/age-532-poc-1e-add-llm-api-key-check…
aybruhm Aug 20, 2024
f26dd78
Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…
aybruhm Aug 20, 2024
7197942
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 20, 2024
f0cc8c6
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 20, 2024
d1fe5aa
chore (backend): remove redundant error message
aybruhm Aug 20, 2024
169a994
chore (backend): cleanup in levenshtein distance evaluato
aybruhm Aug 20, 2024
05ae4b5
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 20, 2024
3fe306f
Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…
aybruhm Aug 21, 2024
ac1ac7e
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 21, 2024
c663fb4
style (website): format cookbooks with black@23.12.0
aybruhm Aug 21, 2024
9309f43
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 21, 2024
23be8b6
refactor (backend): centralize validation of string and json output a…
aybruhm Aug 21, 2024
b6db4f1
feat (tests): update parameters for BaseResponse compatibility and re…
aybruhm Aug 21, 2024
80f3eff
minor refactor (backend): update 'validate_json_output' function retu…
aybruhm Aug 21, 2024
892a351
chore (style): format evaluators_service with black@23.12.0
aybruhm Aug 21, 2024
a4ecc3c
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 21, 2024
33e6e17
refactor (backend): clean up LLM key checks in evaluators
aybruhm Aug 22, 2024
7c28f6d
chore (tests): add '@pytest.mark.asyncio' to test cases in test_user_…
aybruhm Aug 22, 2024
3cad5db
Enforce in Union[str, Dict[str, Any]] in BaseResponse in SDK
jp-agenta Aug 23, 2024
91d23d8
fix ai critique
jp-agenta Aug 23, 2024
cd2546a
initial commit: setup configure evaluator modal
bekossy Aug 23, 2024
31d10e1
minor refactor (backend): include ai_critique evaluator settings_valu…
aybruhm Aug 23, 2024
b224f10
chore (style): format evaluators_service with black@23.12.0
aybruhm Aug 23, 2024
ca81cea
minor refactor (backend): resolve ValueError when casting string to f…
aybruhm Aug 23, 2024
6238cd8
Merge branch 'main' of github.com:Agenta-AI/agenta
jp-agenta Aug 23, 2024
f3546ef
Merge branch 'main' into feature/age-573-evaluators-fail-gracefully-w…
jp-agenta Aug 23, 2024
2402f94
fix exception message and bump SDK out of pre-release
jp-agenta Aug 23, 2024
532a4bb
Merge pull request #1987 from Agenta-AI/feature/age-573-evaluators-fa…
jp-agenta Aug 23, 2024
824c96e
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Aug 23, 2024
c160b72
improved file structure(frontend)
bekossy Aug 24, 2024
d5eb285
design(frontend): added evaluator modal component steps
bekossy Aug 24, 2024
e8ee411
fix(frontend): passed prop
bekossy Aug 24, 2024
4a32a95
fix(frontend): added create new evaluator section
bekossy Aug 25, 2024
c2813fb
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Aug 25, 2024
0ce0022
Merge branch 'feature/age-491-poc-1e-expose-running-evaluators-via-ap…
aybruhm Aug 26, 2024
cc33a66
Update evaluators_service.py
jp-agenta Aug 26, 2024
bfe4cdb
fix(frontend): modified configure evaluator state to use query param,…
bekossy Aug 26, 2024
8ba96f2
fix(frontend): improved config evaluator modal
bekossy Aug 27, 2024
b4ca6fc
fix(frontend): added 600px height to fixed overflow in config evaluat…
bekossy Aug 27, 2024
ce455b1
fix(frontend): added fetch variants state
bekossy Aug 27, 2024
59d64f6
design(frontend): added select variant modal in evaluator config modal
bekossy Aug 27, 2024
a75d801
fix(frontend): displayed evaluator table content
bekossy Aug 27, 2024
ebb5301
design(frontend): ui improvements
bekossy Aug 27, 2024
bf51d9b
fix(frontend): added helper func to get single model and ab testing e…
bekossy Aug 28, 2024
c2580a0
design(frontend): added human evaluation tabs
bekossy Aug 28, 2024
8dd6ab4
fix(frontend): fetched evaluation result list and passed result to ea…
bekossy Aug 28, 2024
9212c7b
Merge pull request #1989 from Agenta-AI/feature/age-532-poc-1e-add-ll…
aybruhm Aug 29, 2024
a338053
fix(frontend): fixed config evaluator modal alignment and updated hum…
bekossy Aug 29, 2024
f563103
Merge pull request #1956 from Agenta-AI/feature/age-491-poc-1e-expose…
bekossy Aug 30, 2024
c55d1e4
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
bekossy Aug 30, 2024
35d5eb3
ui(frontend): implemented human evaluations
ashrafchowdury Aug 30, 2024
d6aadf5
fix(frontend): added evaluator mapping endpoints
bekossy Aug 30, 2024
41745ff
design(frontend): improved config evaluator ui
bekossy Aug 31, 2024
244e8bd
fix(frontend): implemented select test case functionality
bekossy Sep 1, 2024
54a0d49
fix(frontend): added conditional to select test case
bekossy Sep 1, 2024
704ef9b
fix: resolved merge conflict
ashrafchowdury Sep 2, 2024
c8bc24e
refactor: improved structure
ashrafchowdury Sep 2, 2024
7d0acf4
fix(frontend): implemented run variant functionality
bekossy Sep 2, 2024
528d223
refactor(frontend): rename directory to improve clarity and context
bekossy Sep 2, 2024
fe0bc46
refactor(frontend): removed unused codes
ashrafchowdury Sep 3, 2024
0a02a48
fix(frontend): failing cypress tests
ashrafchowdury Sep 3, 2024
90afba5
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Sep 3, 2024
e91e37b
fix: invalid import in evaluation router and improved editor
bekossy Sep 3, 2024
4732b7b
Merge branch 'AGE-587/-implement-evaluation-main-page' into feat/impl…
bekossy Sep 3, 2024
959e773
test(frontend): tests for eval tabs
ashrafchowdury Sep 3, 2024
538b852
fix(frontend): prettier error
ashrafchowdury Sep 3, 2024
60d237e
Merge pull request #2047 from Agenta-AI/feat/implement-human-evaluations
bekossy Sep 3, 2024
f9c2c39
refactor(frontend): moved annotations to evaluations dir and cleanup
bekossy Sep 3, 2024
e4c09b9
Merge branch 'main' of https://github.com/Agenta-AI/agenta into enhan…
ashrafchowdury Sep 3, 2024
0a7e551
ui(frontend): automatic eval funcational table
ashrafchowdury Sep 3, 2024
71c36e2
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
ashrafchowdury Sep 3, 2024
1fe2bab
fix(frontend): search issue with numbers
ashrafchowdury Sep 3, 2024
4fc0fbf
feat(frontend): added helper function to transfor trace tree to json
bekossy Sep 3, 2024
0ca0bc1
minor naming improvement
bekossy Sep 3, 2024
32ea255
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
ashrafchowdury Sep 4, 2024
cacad57
ui(frontend): added table results column
ashrafchowdury Sep 4, 2024
a1cbbd6
refactor(frontend): removed unsed code
ashrafchowdury Sep 4, 2024
f3c541b
refactor(frontend): improved component name for clarify
bekossy Sep 4, 2024
dc2067e
fix(frontend): transform trace tree and setup mapping
bekossy Sep 4, 2024
c7f1afd
fix(frontend): failing cypress test due to layout change
ashrafchowdury Sep 5, 2024
827680e
enhance(frontend): improved structure
ashrafchowdury Sep 5, 2024
d8bca04
fix(frontend): prettier format
ashrafchowdury Sep 5, 2024
baf3e68
fix(frontend): lint error
ashrafchowdury Sep 5, 2024
2410b3f
feat(frontend): implemented run evaluator logic and updated Evaluator…
bekossy Sep 5, 2024
c620759
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Sep 5, 2024
af8728e
Merge branch 'AGE-587/-implement-evaluation-main-page' into enhance/a…
bekossy Sep 5, 2024
cc8cf94
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
ashrafchowdury Sep 5, 2024
a1b048d
fix(backend): Updated Evaluator model to requires_llm_api_keys field
bekossy Sep 5, 2024
910999d
refactor(frontend): improve handling of testcase mapping and evaluato…
bekossy Sep 5, 2024
d04631c
Merge branch 'enhance/automatic-tab-functionalitis' of https://github…
ashrafchowdury Sep 5, 2024
aecbc5b
fix(backend): fixed rag evaluator inputs and bug in exact match evalu…
bekossy Sep 5, 2024
7e0c469
refactor(frontend): clean up and optimize fetchEvalMapper logic
bekossy Sep 5, 2024
751245d
fix(frontend): bug fixes
bekossy Sep 5, 2024
c43e6c1
fix(frontend): added EvaluationErrorPopover component and improve eva…
bekossy Sep 5, 2024
11bcc90
Merge branch 'enhance/automatic-tab-functionalitis' of https://github…
ashrafchowdury Sep 6, 2024
21f193e
fix(frontend): run variant with chat template
bekossy Sep 6, 2024
9cf40f7
enhance(frontend): edit columns and short columns
ashrafchowdury Sep 6, 2024
bbaf1e9
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
ashrafchowdury Sep 6, 2024
2ab3da3
fix(frontend): fixed status update issue
ashrafchowdury Sep 6, 2024
8c3cadf
fix(frontend): update json variant result, json field title and updat…
bekossy Sep 6, 2024
2985804
refactor(frontend): added hepler to transform trace settings to remov…
bekossy Sep 7, 2024
10db7b7
fix(frontend): filter empty/falsy values from json data output
bekossy Sep 7, 2024
72e8be4
feat(frontend): setup view configuration and deletion
bekossy Sep 7, 2024
84f7cba
feat(frontend): setup open config and clone features
bekossy Sep 8, 2024
55828ae
fix(frontend): removed old evaluators code
bekossy Sep 8, 2024
51ebdfe
minor fix
bekossy Sep 8, 2024
fea95b7
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Sep 8, 2024
0e88e11
feat(frontend): generated color for evaluator configs and updated types
bekossy Sep 8, 2024
db8793e
refactor(frontend): set evaluator display default to list and ui impr…
bekossy Sep 8, 2024
2d13427
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
ashrafchowdury Sep 8, 2024
0b8dbe7
Merge branch 'AGE-587/-implement-evaluation-main-page' into enhance/a…
bekossy Sep 8, 2024
10098ab
fix(frontend): removed bad code
bekossy Sep 8, 2024
74c32ea
Merge branch 'enhance/automatic-tab-functionalitis' of https://github…
ashrafchowdury Sep 9, 2024
b8480d3
test(frontend): fixed evaluator tests
ashrafchowdury Sep 9, 2024
27e3f84
fix(backend): updated auto_custom_code_run default code
bekossy Sep 9, 2024
4d1ae19
fix(frontend): added maxWidth to configuration form, disabled editor …
bekossy Sep 9, 2024
3d856ee
Merge branch 'AGE-587/-implement-evaluation-main-page' into enhance/a…
bekossy Sep 9, 2024
44a66ba
fix(frontend): icons placement issue
ashrafchowdury Sep 9, 2024
5ebeb6c
fix(frontend): improved StatusRenderer logic to update status count
bekossy Sep 9, 2024
62b5745
Merge pull request #2058 from Agenta-AI/enhance/automatic-tab-functio…
bekossy Sep 9, 2024
699f6ce
fix(frontend): evaluator category button border left style
bekossy Sep 9, 2024
38c0cb7
fix(frontend): cleanup
bekossy Sep 9, 2024
4fe5e1f
feat(frontend/backend): enabled evaluator filtering by category
bekossy Sep 9, 2024
275df9d
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Sep 10, 2024
2517a07
fix(frontend): new cypress tests failures
ashrafchowdury Sep 10, 2024
cf37bb5
fix(frontend): select new evaluator table column issue
ashrafchowdury Sep 10, 2024
0b832ee
fix(frontend): added tags and improved styles
bekossy Sep 10, 2024
5fe9aa6
fix(frontend): replaced version static in evaluator card with created…
bekossy Sep 10, 2024
68414aa
design(frontend): enabled hover effect and displayed arrow-right on h…
bekossy Sep 10, 2024
05c18fb
fix(frontend): displayed updated_at in evaluator card
bekossy Sep 10, 2024
65f0a90
fix(frontend): improved naming, removed evaluator card view and impro…
bekossy Sep 11, 2024
3f6f0ae
fix(frontend): set font family
bekossy Sep 11, 2024
d804edc
fix(frontend): fixed cypress test
bekossy Sep 11, 2024
c5efd0e
fix(frontend): modified font family
bekossy Sep 11, 2024
65ac8a9
fix(frontend): moved debug evaluator feature code to cloud
bekossy Sep 12, 2024
7f737c9
fix(frontend): removed testcase tab
bekossy Sep 12, 2024
c4a3e53
minor refactor (backend): explictly get testcase correct_answer key
aybruhm Sep 13, 2024
381c628
minor refactor (tests): update the run inputs of rag_faithfulness eva…
aybruhm Sep 13, 2024
f785729
Merge branch 'AGE-587/-implement-evaluation-main-page' of https://git…
bekossy Sep 13, 2024
973c256
fix(frontend): removed checkbox and tags columns from evaluator table…
bekossy Sep 14, 2024
6ebab98
refactor(frontend): modified test toggle button icon
bekossy Sep 14, 2024
6c15e42
design(frontend): added transition and conditional to evaluator modal
bekossy Sep 14, 2024
c1f14df
fix(frontend): moved selected testset state to parent level
bekossy Sep 14, 2024
cea7ab9
minor fix
bekossy Sep 14, 2024
d1dc894
Merge branch 'main' into AGE-587/-implement-evaluation-main-page
bekossy Sep 14, 2024
f8ebba1
bug fix(frontend)
bekossy Sep 14, 2024
75ab1ce
code cleanup(frontend)
bekossy Sep 15, 2024
b186838
test(frontend): fixed failed evaluator test
ashrafchowdury Sep 16, 2024
246e467
fix(frontend): fixed edit column functionality to have all cols selec…
bekossy Sep 16, 2024
a425cd5
fix(frontend): changed testcase state to an object
bekossy Sep 17, 2024
2f3a5de
fix(frontend): edit columns and short dates
ashrafchowdury Sep 17, 2024
68e8e9e
test(frontend): fixes failed evaluation test
ashrafchowdury Sep 17, 2024
9b466a0
fix(frontend): edit column checked by default
bekossy Sep 18, 2024
dc4c978
design(frontend): change evaluators view title to 16px and added divi…
bekossy Sep 19, 2024
bac7a51
fix(backend): removed correct_answer_key in settings_template for aut…
bekossy Sep 19, 2024
675169e
cleanup(frontend)
bekossy Sep 20, 2024
e16c31e
fix(frontend): chat inputs not shown in ab testing and single model e…
bekossy Sep 22, 2024
10797f4
test(frontend): fixed failed evaluator test
ashrafchowdury Sep 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion agenta-backend/agenta_backend/models/api/evaluation_model.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from enum import Enum
from datetime import datetime
from pydantic import BaseModel
from typing import Optional, List, Dict, Any, Union

from pydantic import BaseModel, Field, model_validator

from agenta_backend.models.api.api_models import Result


Expand All @@ -12,6 +14,8 @@ class Evaluator(BaseModel):
settings_template: dict
description: Optional[str] = None
oss: Optional[bool] = False
requires_llm_api_keys: Optional[bool] = False
tags: List[str]


class EvaluatorConfig(BaseModel):
Expand Down Expand Up @@ -80,6 +84,25 @@ class Evaluation(BaseModel):
updated_at: datetime


class EvaluatorInputInterface(BaseModel):
inputs: Dict[str, Any] = Field(default_factory=dict)
settings: Optional[Dict[str, Any]] = None
credentials: Optional[Dict[str, Any]] = None


class EvaluatorOutputInterface(BaseModel):
outputs: Dict[str, Any]


class EvaluatorMappingInputInterface(BaseModel):
inputs: Dict[str, Any]
mapping: Dict[str, Any]


class EvaluatorMappingOutputInterface(BaseModel):
outputs: Dict[str, Any]


class SimpleEvaluationOutput(BaseModel):
id: str
variant_ids: List[str]
Expand Down
51 changes: 40 additions & 11 deletions agenta-backend/agenta_backend/resources/evaluators/evaluators.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,10 @@
"name": "Exact Match",
"key": "auto_exact_match",
"direct_use": True,
"settings_template": {
"correct_answer_key": {
"label": "Expected Answer Column",
"default": "correct_answer",
"type": "string",
"advanced": True, # Tells the frontend that this setting is advanced and should be hidden by default
"ground_truth_key": True, # Tells the frontend that is the name of the column in the test set that should be shown as a ground truth to the user
"description": "The name of the column in the test data that contains the correct answer",
},
},
"settings_template": {},
"description": "Exact Match evaluator determines if the output exactly matches the specified correct answer, ensuring precise alignment with expected results.",
"oss": True,
"tags": ["functional"],
},
{
"name": "Contains JSON",
Expand All @@ -49,6 +41,7 @@
"settings_template": {},
"description": "'Contains JSON' evaluator checks if the output contains the a valid JSON.",
"oss": True,
"tags": ["functional", "classifiers"],
},
{
"name": "Similarity Match",
Expand All @@ -75,11 +68,13 @@
},
"description": "Similarity Match evaluator checks if the generated answer is similar to the expected answer. You need to provide the similarity threshold. It uses the Jaccard similarity to compare the answers.",
"oss": True,
"tags": ["similarity", "functional"],
},
{
"name": "Semantic Similarity Match",
"key": "auto_semantic_similarity",
"direct_use": False,
"requires_llm_api_keys": True,
"description": "Semantic Similarity Match evaluator measures the similarity between two pieces of text by analyzing their meaning and context. It compares the semantic content, providing a score that reflects how closely the texts match in terms of meaning, rather than just exact word matches.",
"settings_template": {
"correct_answer_key": {
Expand All @@ -92,6 +87,7 @@
},
},
"oss": True,
"tags": ["similarity", "ai_llm"],
},
{
"name": "Regex Test",
Expand All @@ -114,6 +110,7 @@
},
},
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "JSON Field Match",
Expand All @@ -138,6 +135,7 @@
},
"description": "JSON Field Match evaluator compares specific fields within JSON (JavaScript Object Notation) data. This matching can involve finding similarities or correspondences between fields in different JSON objects.",
"oss": True,
"tags": ["functional"],
},
{
"name": "JSON Diff Match",
Expand Down Expand Up @@ -176,11 +174,13 @@
},
},
"oss": True,
"tags": ["similarity", "functional"],
},
{
"name": "LLM-as-a-judge",
"key": "auto_ai_critique",
"direct_use": False,
"requires_llm_api_keys": True,
"settings_template": {
"prompt_template": {
"label": "Prompt Template",
Expand All @@ -200,16 +200,25 @@
},
"description": "AI Critique evaluator sends the generated answer and the correct_answer to an LLM model and uses it to evaluate the correctness of the answer. You need to provide the evaluation prompt (or use the default prompt).",
"oss": True,
"tags": ["ai_llm", "functional"],
},
{
"name": "Code Evaluation",
"key": "auto_custom_code_run",
"direct_use": False,
"settings_template": {
"requires_llm_api_keys": {
"label": "Requires LLM API Key(s)",
"type": "boolean",
"required": True,
"default": False,
"advanced": True,
"description": "Indicates whether the evaluation requires LLM API key(s) to function.",
},
"code": {
"label": "Evaluation Code",
"type": "code",
"default": "from typing import Dict\n\ndef evaluate(\n app_params: Dict[str, str],\n inputs: Dict[str, str],\n output: Union[str, Dict[str, Any]], # output of the llm app\n datapoint: Dict[str, str] # contains the testset row \n) -> float:\n if output in datapoint.get('correct_answer', None):\n return 1.0\n else:\n return 0.0\n",
"default": "from typing import Dict, Union, Any\n\ndef evaluate(\n app_params: Dict[str, str],\n inputs: Dict[str, str],\n output: Union[str, Dict[str, Any]], # output of the llm app\n correct_answer: str # contains the testset row \n) -> float:\n if output in correct_answer:\n return 1.0\n else:\n return 0.0\n",
"description": "Code for evaluating submissions",
"required": True,
},
Expand All @@ -224,12 +233,21 @@
},
"description": "Code Evaluation allows you to write your own evaluator in Python. You need to provide the Python code for the evaluator.",
"oss": True,
"tags": ["functional"],
},
{
"name": "Webhook test",
"key": "auto_webhook_test",
"direct_use": False,
"settings_template": {
"requires_llm_api_keys": {
"label": "Requires LLM API Key(s)",
"type": "boolean",
"required": True,
"default": False,
"advanced": True,
"description": "Indicates whether the evaluation requires LLM API key(s) to function.",
},
"webhook_url": {
"label": "Webhook URL",
"type": "string",
Expand All @@ -247,6 +265,7 @@
},
"description": "Webhook test evaluator sends the generated answer and the correct_answer to a webhook and expects a response, in JSON format, indicating the correctness of the answer, along with a 200 HTTP status. You need to provide the URL of the webhook and the response of the webhook must be between 0 and 1.",
"oss": True,
"tags": ["functional"],
},
{
"name": "Starts With",
Expand All @@ -268,6 +287,7 @@
},
"description": "Starts With evaluator checks if the output starts with a specified prefix, considering case sensitivity based on the settings.",
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "Ends With",
Expand All @@ -289,6 +309,7 @@
},
"description": "Ends With evaluator checks if the output ends with a specified suffix, considering case sensitivity based on the settings.",
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "Contains",
Expand All @@ -310,6 +331,7 @@
},
"description": "Contains evaluator checks if the output contains a specified substring, considering case sensitivity based on the settings.",
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "Contains Any",
Expand All @@ -331,6 +353,7 @@
},
"description": "Contains Any evaluator checks if the output contains any of the specified substrings from a comma-separated list, considering case sensitivity based on the settings.",
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "Contains All",
Expand All @@ -352,6 +375,7 @@
},
"description": "Contains All evaluator checks if the output contains all of the specified substrings from a comma-separated list, considering case sensitivity based on the settings.",
"oss": True,
"tags": ["classifiers", "functional"],
},
{
"name": "Levenshtein Distance",
Expand All @@ -375,20 +399,25 @@
},
"description": "This evaluator calculates the Levenshtein distance between the output and the correct answer. If a threshold is provided in the settings, it returns a boolean indicating whether the distance is within the threshold. If no threshold is provided, it returns the actual Levenshtein distance as a numerical value.",
"oss": True,
"tags": ["functional"],
},
{
"name": "RAG Faithfulness",
"key": "rag_faithfulness",
"direct_use": False,
"requires_llm_api_keys": True,
"settings_template": rag_evaluator_settings_template,
"description": "RAG Faithfulness evaluator assesses the accuracy and reliability of responses generated by Retrieval-Augmented Generation (RAG) models. It evaluates how faithfully the responses adhere to the retrieved documents or sources, ensuring that the generated text accurately reflects the information from the original sources.",
"tags": ["rag"],
},
{
"name": "RAG Context Relevancy",
"key": "rag_context_relevancy",
"direct_use": False,
"requires_llm_api_keys": True,
"settings_template": rag_evaluator_settings_template,
"description": "RAG Context Relevancy evaluator measures how relevant the retrieved documents or contexts are to the given question or prompt. It ensures that the selected documents provide the necessary information for generating accurate and meaningful responses, improving the overall quality of the RAG model's output.",
"tags": ["rag"],
},
]

Expand Down
13 changes: 6 additions & 7 deletions agenta-backend/agenta_backend/routers/evaluation_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from fastapi.responses import JSONResponse
from fastapi import HTTPException, Request, status, Response, Query

from agenta_backend.services import helpers
from agenta_backend.models import converters
from agenta_backend.tasks.evaluations import evaluate
from agenta_backend.utils.common import APIRouter, isCloudEE
Expand All @@ -14,9 +15,6 @@
NewEvaluation,
DeleteEvaluation,
)
from agenta_backend.services.evaluator_manager import (
check_ai_critique_inputs,
)
from agenta_backend.services import evaluation_service, db_manager, app_manager

if isCloudEE():
Expand Down Expand Up @@ -112,8 +110,9 @@ async def create_evaluation(
status_code=403,
)

success, response = await check_ai_critique_inputs(
payload.evaluators_configs, payload.lm_providers_keys
llm_provider_keys = helpers.format_llm_provider_keys(payload.lm_providers_keys)
success, response = await helpers.ensure_required_llm_keys_exist(
payload.evaluators_configs, llm_provider_keys
)
if not success:
return response
Expand All @@ -134,8 +133,8 @@ async def create_evaluation(
evaluators_config_ids=payload.evaluators_configs,
testset_id=payload.testset_id,
evaluation_id=evaluation.id,
rate_limit_config=payload.rate_limit.dict(),
lm_providers_keys=payload.lm_providers_keys,
rate_limit_config=payload.rate_limit.model_dump(),
lm_providers_keys=llm_provider_keys,
)
evaluations.append(evaluation)

Expand Down
69 changes: 68 additions & 1 deletion agenta-backend/agenta_backend/routers/evaluators_router.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,27 @@
import logging
import traceback

from typing import List
from fastapi import HTTPException, Request
from fastapi.responses import JSONResponse

from agenta_backend.utils.common import APIRouter, isCloudEE
from agenta_backend.services import evaluator_manager, db_manager, app_manager
from agenta_backend.services import (
evaluator_manager,
db_manager,
evaluators_service,
app_manager,
)

from agenta_backend.models.api.evaluation_model import (
Evaluator,
EvaluatorConfig,
NewEvaluatorConfig,
UpdateEvaluatorConfig,
EvaluatorInputInterface,
EvaluatorOutputInterface,
EvaluatorMappingInputInterface,
EvaluatorMappingOutputInterface,
)

if isCloudEE():
Expand Down Expand Up @@ -47,6 +57,63 @@ async def get_evaluators_endpoint():
raise HTTPException(status_code=500, detail=str(e))


@router.post("/map/", response_model=EvaluatorMappingOutputInterface)
async def evaluator_data_map(request: Request, payload: EvaluatorMappingInputInterface):
"""Endpoint to map the experiment data tree to evaluator interface.
Args:
request (Request): The request object.
payload (EvaluatorMappingInputInterface): The payload containing the request data.
Returns:
EvaluatorMappingOutputInterface: the evaluator mapping output object
"""

try:
mapped_outputs = await evaluators_service.map(mapping_input=payload)
return mapped_outputs
except Exception as e:
logger.error(f"Error mapping data tree: {str(e)}")
raise HTTPException(
status_code=500,
detail={
"message": "Error mapping data tree",
"stacktrace": traceback.format_exc(),
},
)


@router.post("/{evaluator_key}/run/", response_model=EvaluatorOutputInterface)
async def evaluator_run(
request: Request, evaluator_key: str, payload: EvaluatorInputInterface
):
"""Endpoint to evaluate LLM app run
Args:
request (Request): The request object.
evaluator_key (str): The key of the evaluator.
payload (EvaluatorInputInterface): The payload containing the request data.
Returns:
result: EvaluatorOutputInterface object containing the outputs.
"""

try:
result = await evaluators_service.run(
evaluator_key=evaluator_key, evaluator_input=payload
)
return result
except Exception as e:
logger.error(f"Error while running {evaluator_key} evaluator: {str(e)}")
raise HTTPException(
status_code=500,
detail={
"message": f"Error while running {evaluator_key} evaluator",
"stacktrace": traceback.format_exc(),
},
)


@router.get("/configs/", response_model=List[EvaluatorConfig])
async def get_evaluator_configs(app_id: str, request: Request):
"""Endpoint to fetch evaluator configurations for a specific app.
Expand Down
Loading
Loading