[feature] Evaluators - Debugging #2071

bekossy · 2024-09-10T14:08:51Z

Description

This PR enhances the Evaluation UI for both auto and human evaluations, improving user experience and workflow efficiency.

Related Issue

Closes AGE-587

Related PR

Commons PR #108

Key Changes

Auto-Evaluation: Redesigned interface and added filters for easier navigation.
Human-Evaluation: Improved flow, feedback, and layout for better usability.
Evaluator Management: Full UI overhaul, added evaluator debug/test feature.

QA Instructions

Evaluators:
- Create New Evaluators: Ensure new evaluators can be created without issues.
- Test Evaluator Debug Feature: Thoroughly test the debug feature by selecting a test set and a variant, and confirm the variant runs successfully.
- Customize Evaluator Output: Verify that the advanced settings allow for proper customization of the evaluator output.
- Use Evaluator Filters: Apply all available filters to ensure you can accurately find the desired evaluator in the suggestions list.
- Perform C.R.U.D. Operations: Test the creation, reading, updating, and deletion (CRUD) operations on evaluators to confirm full functionality.
Auto-Evaluation:
- Create New Evaluations: Confirm that new auto-evaluations can be created successfully.
- Status Update Check: Verify that the evaluation status updates correctly and does not remain stuck at 0s.
- Filters, Sorting, and Editing: Test all filter, sort, and edit functions to ensure they work as expected.
- Batch Evaluation Creation: Successfully create multiple evaluations at once and confirm they are processed correctly.
- UI Interaction: Interact with every visible element in the UI and ensure everything functions properly.
Human Evaluations:
- Create New Human Evaluations: Test the successful creation of new human evaluations.
- Delete Evaluations: Ensure that both single and multiple evaluations can be deleted without errors.
- UI Interaction: Engage with every visible UI element and confirm all components are working as intended.

…d or created

…ors-via-api-to-playground

…mpose'

…evaluator interface and modified endpoint to evaluate llm app run

…lers

…ors-via-api-to-playground

…luators

… and run interface handlers

…7, r1714064544, r1714061425)

…tors-via-api-to-playground

…m celery task

…luator

…ins_json evaluator

… message

…ring OpenAI API key

…run endpoints

… endpoint

…tors-via-api-to-playground

…in evaluators

…der in new evaluators view

…o_exact_match in evaluators resources

…valuation view

aybruhm added 30 commits August 1, 2024 21:26

feat (backend): add utility function to ensure event loop is retrieve…

5c5ad15

…d or created

feat (backend): add endpoint to run evaluation on a specific evaluator

60064ef

refactor (backend): make use of ensure_event_loop utility function

0a66e26

Merge branch 'rag' into feature/age-491-poc-1e-expose-running-evaluat…

7d534fc

…ors-via-api-to-playground

docs (backend): improve docstring in ensure_event_loop function

9ba496f

minor refactor (build): replace use of 'docker-compose' to 'docker co…

bd9b2d3

…mpose'

feat (backend): created evaluator mapping and input interfaces

3f6b507

feat (backend): implemented endpoints to map experiment data tree to …

4a1604d

…evaluator interface and modified endpoint to evaluate llm app run

refactor (backend): update evaluator handlers to make use of new hand…

55c727e

…lers

Merge branch 'rag' into feature/age-491-poc-1e-expose-running-evaluat…

165f7a7

…ors-via-api-to-playground

chore (backend): remove interfaces that are redundant

6fb0b02

refactor (backend): convert evaluator functions to asynchronous

0facf64

refactor (backend): remove event_loop_utils module

b81d01b

refactor (backend): improve run evaluator endpoint to be asynchronous

53bb2ed

refactor (backend): update rag faithfulness and context relevancy eva…

6f876f4

…luators

refactor (backend): ensure rag evaluators is compatible with evaluate…

5214108

… and run interface handlers

refactor (tests): convert unit tests to async using pytest.mark.asyncio

133fc73

refactor (backend): resolve pr#1956 comments (r1714066376, r171406851…

7e229b3

…7, r1714064544, r1714061425)

minor refactor (backend): remove redundant 'get_nested_value' function

bbbb925

Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…

58760ee

…tors-via-api-to-playground

refactor (backend): run evaluator_service 'evaluate' asynchronous fro…

dd805a2

…m celery task

refactor (backend): improve error handling for auto_contains_json eva…

d3f7315

…luator

feat (tests): add tests for dictionary-based output handling in conta…

08b9e87

…ins_json evaluator

refactor (backend): add check for OpenAI API key with clear exception…

28320b8

… message

feat (tests): add test case for auto_ai_critique and evaluators requi…

d8a1bbd

…ring OpenAI API key

feat (tests): added mock trace data for a simple finance assisstant

9c10025

feat (tests): created fixtures for evaluator experiment tree map and …

03e28ee

…run endpoints

feat (tests): created tests for evaluator experiment tree map and run…

6b6b6d6

… endpoint

Merge branch 'main' into feature/age-491-poc-1e-expose-running-evalua…

b808342

…tors-via-api-to-playground

refactor (backend): rewrite db function to check if evaluators exist …

e02fefa

…in evaluators

design(frontend): change evaluators view title to 16px and added divi…

dc4c978

…der in new evaluators view

bekossy temporarily deployed to oss September 19, 2024 11:02 — with GitHub Actions Inactive

vercel bot deployed to Preview – agenta-documentation September 19, 2024 11:03 View deployment

vercel bot deployed to Preview – agenta September 19, 2024 11:06 View deployment

fix(backend): removed correct_answer_key in settings_template for aut…

bac7a51

…o_exact_match in evaluators resources

bekossy temporarily deployed to oss September 19, 2024 13:08 — with GitHub Actions Inactive

vercel bot deployed to Preview – agenta-documentation September 19, 2024 13:09 View deployment

vercel bot deployed to Preview – agenta September 19, 2024 13:10 View deployment

Dhaneshwarguiyan approved these changes Sep 20, 2024

View reviewed changes

cleanup(frontend)

675169e

bekossy temporarily deployed to oss September 20, 2024 20:41 — with GitHub Actions Inactive

vercel bot deployed to Preview – agenta-documentation September 20, 2024 20:41 View deployment

vercel bot deployed to Preview – agenta September 20, 2024 20:44 View deployment

fix(frontend): chat inputs not shown in ab testing and single model e…

e16c31e

…valuation view

bekossy had a problem deploying to oss September 22, 2024 10:22 — with GitHub Actions Failure

vercel bot deployed to Preview – agenta September 22, 2024 10:24 View deployment

vercel bot deployed to Preview – agenta-documentation September 22, 2024 10:25 View deployment

bekossy temporarily deployed to oss September 22, 2024 16:34 — with GitHub Actions Inactive

bekossy temporarily deployed to oss September 22, 2024 16:35 — with GitHub Actions Inactive

test(frontend): fixed failed evaluator test

10797f4

ashrafchowdury temporarily deployed to oss September 23, 2024 04:31 — with GitHub Actions Inactive

vercel bot deployed to Preview – agenta-documentation September 23, 2024 04:32 View deployment

vercel bot deployed to Preview – agenta September 23, 2024 04:34 View deployment

mmabrouk merged commit 7ad6d7c into main Sep 23, 2024
14 checks passed

mmabrouk deleted the AGE-587/-implement-evaluation-main-page branch September 23, 2024 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Evaluators - Debugging #2071

[feature] Evaluators - Debugging #2071

bekossy commented Sep 10, 2024 •

edited

Loading

[feature] Evaluators - Debugging #2071

[feature] Evaluators - Debugging #2071

Conversation

bekossy commented Sep 10, 2024 • edited Loading

Description

Related Issue

Related PR

Key Changes

QA Instructions

bekossy commented Sep 10, 2024 •

edited

Loading