Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation docs Overhaul #2061

Merged
merged 24 commits into from
Sep 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
10e4e67
docs(frontend): AGE-517 improve card ui
mmabrouk Sep 5, 2024
ab33ae4
docs(app): AGE-517 add overview page (incomplete)
mmabrouk Sep 5, 2024
066607e
docs(app): AGE-517 update test set page
mmabrouk Sep 5, 2024
a2b9ee8
docs(app): AGE-517 rename doc files (restructuring eval)
mmabrouk Sep 5, 2024
5e3117d
docs(app): AGE-517 update test set docs
mmabrouk Sep 6, 2024
e898294
docs(frontend): AGE-517 fix detail component style
mmabrouk Sep 6, 2024
9c75446
docs(app): AGE-517 add list evaluators to overview
mmabrouk Sep 6, 2024
aa99190
docs(app): AGE-517 added new configure evaluators page
mmabrouk Sep 6, 2024
676894c
docs(app): AGE-517 restructured evals
mmabrouk Sep 6, 2024
80299ab
docs(app): AGE-517 ui-evaluation docs
mmabrouk Sep 7, 2024
71190de
docs(app): AGE-517 minor update eval overview page
mmabrouk Sep 7, 2024
d541f6f
docs(app): AGE-517 simplify configure eval page
mmabrouk Sep 7, 2024
299fe4e
docs(app): AGE-517 renamed ui eval to no-code eval
mmabrouk Sep 7, 2024
58e629d
docs(app): AGE-517 added sdk eval page
mmabrouk Sep 7, 2024
68512c0
docs(app): AGE-517 no code eval image udpate
mmabrouk Sep 7, 2024
50cb9d1
docs(app): AGE-517 added llm as a judge
mmabrouk Sep 7, 2024
18f3612
docs(app): AGE-517 updated custom evaluator
mmabrouk Sep 7, 2024
ee91840
docs(app): AGE-517 updated webhook evals
mmabrouk Sep 7, 2024
ea21147
docs(app): AGE-517 added pattern matching docs
mmabrouk Sep 7, 2024
88c69e3
docs(app): AGE-517 added semantic sim
mmabrouk Sep 8, 2024
49fc4d2
docs(app): AGE-517 restructure
mmabrouk Sep 8, 2024
e49e623
docs(app): AGE-517 classification eval docs
mmabrouk Sep 9, 2024
c25a169
docs(app): AGE-517 update links
mmabrouk Sep 9, 2024
dabd9f8
docs(app): AGE-517 minor fix
mmabrouk Sep 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/docs/evaluation/01-overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Overview"
description: Systematically evaluate your LLM applications and compare their performance.
sidebar_position: 1
---

```mdx-code-block
import DocCard from '@theme/DocCard';
import clsx from 'clsx';

```

The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use agenta to quickly evaluate and compare the performance of your LLM applications.

### Set up evaluation

<section className='row'>
<article key='1' className="col col--6 margin-bottom--lg">

<DocCard
item={{
type: "link",
href: "/evaluation/configure-evaluators",
label: "Configure Evaluators",
description: "Configure evaluators for your use case",
}}
/>
</article>

<article key='2' className="col col--6 margin-bottom--lg">
<DocCard
item={{
type: "link",
href: "/evaluation/create-test-sets",
label: "Create Test Sets",
description: "Create Test Sets",
}}
/>
</article>
</section>

### Run evaluations

<section className='row'>

<article key="1" className="col col--6 margin-bottom--lg">
<DocCard
item={{
type: "link",
href: "/evaluation/overview",
label: "Run Evaluations from the UI",
description: "Learn about the evaluation process in Agenta",
}}
/>
</article>

<article key='2' className="col col--6 margin-bottom--lg">
<DocCard
item={{
type: "link",
href: "/evaluation/overview",
label: "Run Evaluations with the SDK",
description: "Learn about the evaluation process in Agenta",
}}
/>
</article>
</section>

### Available evaluators

| **Evaluator Name** | **Use Case** | **Type** | **Description** |
| ------------------------------------------------------------------------------------------------- | -------------------------------- | ------------------ | -------------------------------------------------------------------------------- |
| [Exact Match](/evaluation/evaluators/classification-entiry-extraction#exact-match) | Classification/Entity Extraction | Pattern Matching | Checks if the output exactly matches the expected result. |
| [Contains JSON](/evaluation/evaluators/classification-entiry-extraction#contains-json) | Classification/Entity Extraction | Pattern Matching | Ensures the output contains valid JSON. |
| [Regex Test](/evaluation/evaluators/pattern-matching#regular-expression) | Classification/Entity Extraction | Pattern Matching | Checks if the output matches a given regex pattern. |
| [JSON Field Match](/evaluation/evaluators/classification-entiry-extraction#json-field-match) | Classification/Entity Extraction | Pattern Matching | Compares specific fields within JSON data. |
| [JSON Diff Match](/evaluation/evaluators/classification-entiry-extraction#json-diff-match) | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values. |
| [Similarity Match](/evaluation/evaluators/semantic-similarity#similarity-match) | Text Generation / Chatbot | Similarity Metrics | Compares generated output with expected using Jaccard similarity. |
| [Semantic Similarity Match](/evaluation/evaluators/semantic-similarity#semantic-similarity-match) | Text Generation / Chatbot | Semantic Analysis | Compares the meaning of the generated output with the expected result. |
| [Starts With](/evaluation/evaluators/pattern-matching#starts-with) | Text Generation / Chatbot | Pattern Matching | Checks if the output starts with a specified prefix. |
| [Ends With](/evaluation/evaluators/pattern-matching#ends-with) | Text Generation / Chatbot | Pattern Matching | Checks if the output ends with a specified suffix. |
| [Contains](/evaluation/evaluators/pattern-matching#contains) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains a specific substring. |
| [Contains Any](/evaluation/evaluators/pattern-matching#contains-any) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains any of a list of substrings. |
| [Contains All](/evaluation/evaluators/pattern-matching#contains-all) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains all of a list of substrings. |
| [Levenshtein Distance](/evaluation/evaluators/semantic-similarity#levenshtein-distance) | Text Generation / Chatbot | Similarity Metrics | Calculates the Levenshtein distance between output and expected result. |
| [LLM-as-a-judge](/evaluation/evaluators/llm-as-a-judge) | Text Generation / Chatbot | LLM-based | Sends outputs to an LLM model for critique and evaluation. |
| [RAG Faithfulness](/evaluation/evaluators/rag-evaluators) | RAG / Text Generation / Chatbot | LLM-based | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
| [RAG Context Relevancy](/evaluation/evaluators/rag-evaluators) | RAG / Text Generation / Chatbot | LLM-based | Measures the relevancy of retrieved documents to the given question in RAG. |
| [Custom Code Evaluation](/evaluation/evaluators/custom-evaluator) | Custom Logic | Custom | Allows users to define their own evaluator in Python. |
| [Webhook Evaluator](/evaluation/evaluators/webhook-evaluator) | Custom Logic | Custom | Sends output to a webhook for external evaluation. |
146 changes: 146 additions & 0 deletions docs/docs/evaluation/02-create-test-sets.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
title: "Create Test Sets"
---

This guide outlines the various methods for creating test sets in Agenta and provides specifications for the test set schema.

Test sets are used for runnning automatic or human evaluation. They can also be loaded into the playground, allowing you to experiment with different prompts.

Test sets contain input data for the LLM application. They may also include a reference output (i.e., expected output or ground truth), though this is optional.

You can create a test set in Agenta using the following methods:

- [By uploading a CSV or JSON file](#creating-a-test-set-from-a-csv-or-json)
- [Using the API](#creating-a-test-set-using-the-api)
- [Using the UI](#creatingediting-a-test-set-from-the-ui)
- [From the playground](#creating-a-test-set-from-the-playground)
- [From traces in observability](#adding-data-from-traces)

## Creating a Test Set from a CSV or JSON

To create a test set from a CSV or JSON file:

1. Go to `Test sets`
2. Click `Upload test sets`
3. Select either `CSV` or `JSON`

<img src="/images/test-sets/upload_test_set.png" />

### CSV Format

We use CSV with commas (,) as separators and double quotes (") as quote characters. The first row should contain the header with column names. Each input should have its own column. The column containing the reference answer can have any name, but we use "correct_answer" by default.

:::info
If you choose a different column name for the reference answer, you'll need to configure the evaluator later with that specific name.
:::

Here's an example of a valid CSV:

```csv
text,instruction,correct_answer
Hello,How are you?,I'm good.
"Tell me a joke.",Sure, here's one:...
```

### JSON Format

The test set should be in JSON format with the following structure:

1. A JSON file containing an array of objects.
2. Each object in the array represents a row, with keys as column headers and values as row data. Here's an example of a valid JSON file:

```json
[
{ "recipe_name": "Chicken Parmesan", "correct_answer": "Chicken" },
{ "recipe_name": "a, special, recipe", "correct_answer": "Beef" }
]
```

### Schema for Chat Applications

For chat applications created using the chat template in Agenta, the input should be saved in the column called `chat`, which would contain the input list of messages:

```json
[
{ "content": "message.", "role": "user" },
{ "content": "message.", "role": "assistant" }
// Add more messages if necessary
]
```

The reference answer column (by default `correct_answer`) should follow the same format:

```json
{ "content": "message.", "role": "assistant" }
```

## Creating a Test Set Using the API

You can upload a test set using our API. Find the [API endpoint reference here](/reference/api/upload-file).

Here's an example of such a call:

**HTTP Request:**

```
POST /testsets/{app_id}/

```

**Request Body:**

```json
{
"name": "testsetname",
"csvdata": [
{ "column1": "row1col1", "column2": "row1col2" },
{ "column1": "row2col1", "column2": "row2col2" }
]
}
```

## Creating/Editing a Test Set from the UI

To create or edit a test set from the UI:

1. Go to `Test sets`
2. Choose `Create a test set with UI` or select the test set
3. Name your test set and specify the columns for input types.
4. Add the dataset.

Remember to click `Save test set`

<img src="/images/test-sets/add_test_set_ui.png" />

## Creating a Test Set from the Playground

The playground offers a convenient way to create and add data to a test set. This workflow is useful if you want to build your test set ad hoc, each time you find an interesting input for the LLM app, you can immediately add these inputs to the test set and optionally set a reference answer.

To add a data point to a test set from the playground, simply click the `Add to test set` button located near the `Run` button.

A drawer will display the inputs and outputs from the playground. Here, you can modify inputs and correct answers if needed. Select an existing test set to add to, or choose `+Add new` to create a new one. Once you're satisfied, click `Add` to finalize.

:::warning
Currently, when adding a test point from the playground, the correct answer is always added to a column called `correct_answer`.
:::

:::warning
When adding a new data point, ensure that the column names in the test set match those of the LLM application. All columns from the playground (input columns and `correct_answer`) must exist in the test set. They will be created automatically if you're making a new test set. Any additional columns in the test set not available in the playground will be left empty.
:::

<img src="/images/test-sets/add_test_set_playground.png" />

### Adding Chat History from the Playground

When adding chat history, you can choose to include all turns from the conversation. For example:

- User: Hi
- Assistant: Hi, how can I help you?
- User: I would like to book a table
- Assistant: Sure, for how many people?

If you select "Turn by Turn," two rows will be added to the test set: one for "Hi/Hi, how can I help you?" and another for "Hi/Hi, how can I help you?/I would like to book a table/Sure, for how many people?"

## Adding Data From Traces

You can add any data logged to agenta to test sets. Simply navigate to observability, select the trace (or any span), then click on `Add to testset` or the `+` button.
78 changes: 78 additions & 0 deletions docs/docs/evaluation/03-configure-evaluators.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
title: "Configure Evaluators"
description: "Set up evaluators for your use case"
---

In this guide will show you how to configure evaluators for your LLM application.

### What are evaluators?

Evaluators are functions that assess the output of an LLM application.

Evaluators typically take as input:

- The output of the LLM application
- (Optional) The reference answer (i.e., expected output or ground truth)
- (Optional) The inputs to the LLM application
- Any other relevant data, such as context

Evaluators return either a float or a boolean value.

<img style={{ width: "70%" }} src="/images/evaluation/evaluators.png" />

### Configuring evaluators

To create a new evaluator, click on the `Configure Evaluators` button in the `Evaluations` view.

![The configure evaluators button in agenta.](/images/evaluation/configure-evaluators-1.png)

### Selecting evaluators

Agenta offers a growing list of pre-built evaluators suitable for most use cases. We also provide options for [creating custom evaluators](/evaluation/evaluators/custom-evaluator) (by writing your own Python function) or [using webhooks](/evaluation/evaluators/webhook-evaluator) for evaluation.

<details id="available-evaluators">
<summary>Available Evaluators</summary>

| **Evaluator Name** | **Use Case** | **Type** | **Description** |
| ------------------------------------------------------------------------------------------------- | -------------------------------- | ------------------ | -------------------------------------------------------------------------------- |
| [Exact Match](/evaluation/evaluators/classification-entiry-extraction#exact-match) | Classification/Entity Extraction | Pattern Matching | Checks if the output exactly matches the expected result. |
| [Contains JSON](/evaluation/evaluators/classification-entiry-extraction#contains-json) | Classification/Entity Extraction | Pattern Matching | Ensures the output contains valid JSON. |
| [Regex Test](/evaluation/evaluators/pattern-matching#regular-expression) | Classification/Entity Extraction | Pattern Matching | Checks if the output matches a given regex pattern. |
| [JSON Field Match](/evaluation/evaluators/classification-entiry-extraction#json-field-match) | Classification/Entity Extraction | Pattern Matching | Compares specific fields within JSON data. |
| [JSON Diff Match](/evaluation/evaluators/classification-entiry-extraction#json-diff-match) | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values. |
| [Similarity Match](/evaluation/evaluators/semantic-similarity#similarity-match) | Text Generation / Chatbot | Similarity Metrics | Compares generated output with expected using Jaccard similarity. |
| [Semantic Similarity Match](/evaluation/evaluators/semantic-similarity#semantic-similarity-match) | Text Generation / Chatbot | Semantic Analysis | Compares the meaning of the generated output with the expected result. |
| [Starts With](/evaluation/evaluators/pattern-matching#starts-with) | Text Generation / Chatbot | Pattern Matching | Checks if the output starts with a specified prefix. |
| [Ends With](/evaluation/evaluators/pattern-matching#ends-with) | Text Generation / Chatbot | Pattern Matching | Checks if the output ends with a specified suffix. |
| [Contains](/evaluation/evaluators/pattern-matching#contains) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains a specific substring. |
| [Contains Any](/evaluation/evaluators/pattern-matching#contains-any) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains any of a list of substrings. |
| [Contains All](/evaluation/evaluators/pattern-matching#contains-all) | Text Generation / Chatbot | Pattern Matching | Checks if the output contains all of a list of substrings. |
| [Levenshtein Distance](/evaluation/evaluators/semantic-similarity#levenshtein-distance) | Text Generation / Chatbot | Similarity Metrics | Calculates the Levenshtein distance between output and expected result. |
| [LLM-as-a-judge](/evaluation/evaluators/llm-as-a-judge) | Text Generation / Chatbot | LLM-based | Sends outputs to an LLM model for critique and evaluation. |
| [RAG Faithfulness](/evaluation/evaluators/rag-evaluators) | RAG / Text Generation / Chatbot | LLM-based | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
| [RAG Context Relevancy](/evaluation/evaluators/rag-evaluators) | RAG / Text Generation / Chatbot | LLM-based | Measures the relevancy of retrieved documents to the given question in RAG. |
| [Custom Code Evaluation](/evaluation/evaluators/custom-evaluator) | Custom Logic | Custom | Allows users to define their own evaluator in Python. |
| [Webhook Evaluator](/evaluation/evaluators/webhook-evaluator) | Custom Logic | Custom | Sends output to a webhook for external evaluation. |

</details>

![Screen for selecting an evaluator.](/images/evaluation/configure-evaluators-2.png)

## Evaluators' settings

Each evaluator comes with it's unique settings. For instance in the screen below, the JSON field match evaluator requires you to specify which field in the output JSON you need to consider for evaluation. You'll find detailed information about these parameters on each evaluator's documentation page.

![Screen for configuring an evaluator.](/images/evaluation/configure-evaluators-3.png)

## Mappings evaluator's inputs to the LLM data

Evaluators need to know which parts of the data contain the output and the reference answer. Most evaluators allow you to configure this mapping, typically by specifying the name of the column in the test set that contains the `reference answer`.

For more sophisticated evaluators, such as `RAG evaluators` (_available only in cloud and enterprise versions_), you need to define more complex mappings (see figure below).

![Figure showing how RAGAS faithfulness evaluator maps to an example LLM
generation.](/images/evaluation/evaluator_config_mapping.png)

Configuring the evaluator is done by mapping the evaluator inputs to the generation data:

![Figure showing how RAGAS faithfulness evaluator is configured in agenta.](/images/evaluation/configure_mapping.png)
Loading