Agenta-AI · mmabrouk · Sep 10, 2024 · Sep 5, 2024 · Sep 5, 2024 · Sep 5, 2024
diff --git a/docs/docs/evaluation/01-overview.mdx b/docs/docs/evaluation/01-overview.mdx
@@ -0,0 +1,90 @@
+---
+title: "Overview"
+description: Systematically evaluate your LLM applications and compare their performance.
+sidebar_position: 1
+---
+
+```mdx-code-block
+import DocCard from '@theme/DocCard';
+import clsx from 'clsx';
+
+```
+
+The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use agenta to quickly evaluate and compare the performance of your LLM applications.
+
+### Set up evaluation
+
+<section className='row'>
+<article key='1' className="col col--6 margin-bottom--lg">
+
+  <DocCard
+    item={{
+      type: "link",
+      href: "/evaluation/configure-evaluators",
+      label: "Configure Evaluators",
+      description: "Configure evaluators for your use case",
+    }}
+  />
+  </article>
+
+  <article key='2' className="col col--6 margin-bottom--lg">
+  <DocCard
+    item={{
+      type: "link",
+      href: "/evaluation/create-test-sets",
+      label: "Create Test Sets",
+      description: "Create Test Sets",
+    }}
+  />
+  </article>
+  </section>
+
+### Run evaluations
+
+  <section className='row'>
+
+<article key="1" className="col col--6 margin-bottom--lg">
+  <DocCard
+    item={{
+      type: "link",
+      href: "/evaluation/overview",
+      label: "Run Evaluations from the UI",
+      description: "Learn about the evaluation process in Agenta",
+    }}
+  />
+</article>
+
+  <article key='2' className="col col--6 margin-bottom--lg">
+  <DocCard
+    item={{
+      type: "link",
+      href: "/evaluation/overview",
+      label: "Run Evaluations with the SDK",
+      description: "Learn about the evaluation process in Agenta",
+    }}
+    />
+  </article>
+  </section>
+
+### Available evaluators
+
+| **Evaluator Name**                                                                                | **Use Case**                     | **Type**           | **Description**                                                                  |
+| ------------------------------------------------------------------------------------------------- | -------------------------------- | ------------------ | -------------------------------------------------------------------------------- |
+| [Exact Match](/evaluation/evaluators/classification-entiry-extraction#exact-match)                | Classification/Entity Extraction | Pattern Matching   | Checks if the output exactly matches the expected result.                        |
+| [Contains JSON](/evaluation/evaluators/classification-entiry-extraction#contains-json)            | Classification/Entity Extraction | Pattern Matching   | Ensures the output contains valid JSON.                                          |
+| [Regex Test](/evaluation/evaluators/pattern-matching#regular-expression)                          | Classification/Entity Extraction | Pattern Matching   | Checks if the output matches a given regex pattern.                              |
+| [JSON Field Match](/evaluation/evaluators/classification-entiry-extraction#json-field-match)      | Classification/Entity Extraction | Pattern Matching   | Compares specific fields within JSON data.                                       |
+| [JSON Diff Match](/evaluation/evaluators/classification-entiry-extraction#json-diff-match)        | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values.      |
+| [Similarity Match](/evaluation/evaluators/semantic-similarity#similarity-match)                   | Text Generation / Chatbot        | Similarity Metrics | Compares generated output with expected using Jaccard similarity.                |
+| [Semantic Similarity Match](/evaluation/evaluators/semantic-similarity#semantic-similarity-match) | Text Generation / Chatbot        | Semantic Analysis  | Compares the meaning of the generated output with the expected result.           |
+| [Starts With](/evaluation/evaluators/pattern-matching#starts-with)                                | Text Generation / Chatbot        | Pattern Matching   | Checks if the output starts with a specified prefix.                             |
+| [Ends With](/evaluation/evaluators/pattern-matching#ends-with)                                    | Text Generation / Chatbot        | Pattern Matching   | Checks if the output ends with a specified suffix.                               |
+| [Contains](/evaluation/evaluators/pattern-matching#contains)                                      | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains a specific substring.                              |
+| [Contains Any](/evaluation/evaluators/pattern-matching#contains-any)                              | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains any of a list of substrings.                       |
+| [Contains All](/evaluation/evaluators/pattern-matching#contains-all)                              | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains all of a list of substrings.                       |
+| [Levenshtein Distance](/evaluation/evaluators/semantic-similarity#levenshtein-distance)           | Text Generation / Chatbot        | Similarity Metrics | Calculates the Levenshtein distance between output and expected result.          |
+| [LLM-as-a-judge](/evaluation/evaluators/llm-as-a-judge)                                           | Text Generation / Chatbot        | LLM-based          | Sends outputs to an LLM model for critique and evaluation.                       |
+| [RAG Faithfulness](/evaluation/evaluators/rag-evaluators)                                         | RAG / Text Generation / Chatbot  | LLM-based          | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
+| [RAG Context Relevancy](/evaluation/evaluators/rag-evaluators)                                    | RAG / Text Generation / Chatbot  | LLM-based          | Measures the relevancy of retrieved documents to the given question in RAG.      |
+| [Custom Code Evaluation](/evaluation/evaluators/custom-evaluator)                                 | Custom Logic                     | Custom             | Allows users to define their own evaluator in Python.                            |
+| [Webhook Evaluator](/evaluation/evaluators/webhook-evaluator)                                     | Custom Logic                     | Custom             | Sends output to a webhook for external evaluation.                               |
diff --git a/docs/docs/evaluation/02-create-test-sets.mdx b/docs/docs/evaluation/02-create-test-sets.mdx
@@ -0,0 +1,146 @@
+---
+title: "Create Test Sets"
+---
+
+This guide outlines the various methods for creating test sets in Agenta and provides specifications for the test set schema.
+
+Test sets are used for runnning automatic or human evaluation. They can also be loaded into the playground, allowing you to experiment with different prompts.
+
+Test sets contain input data for the LLM application. They may also include a reference output (i.e., expected output or ground truth), though this is optional.
+
+You can create a test set in Agenta using the following methods:
+
+- [By uploading a CSV or JSON file](#creating-a-test-set-from-a-csv-or-json)
+- [Using the API](#creating-a-test-set-using-the-api)
+- [Using the UI](#creatingediting-a-test-set-from-the-ui)
+- [From the playground](#creating-a-test-set-from-the-playground)
+- [From traces in observability](#adding-data-from-traces)
+
+## Creating a Test Set from a CSV or JSON
+
+To create a test set from a CSV or JSON file:
+
+1. Go to `Test sets`
+2. Click `Upload test sets`
+3. Select either `CSV` or `JSON`
+
+<img src="/images/test-sets/upload_test_set.png" />
+
+### CSV Format
+
+We use CSV with commas (,) as separators and double quotes (") as quote characters. The first row should contain the header with column names. Each input should have its own column. The column containing the reference answer can have any name, but we use "correct_answer" by default.
+
+:::info
+If you choose a different column name for the reference answer, you'll need to configure the evaluator later with that specific name.
+:::
+
+Here's an example of a valid CSV:
+
+```csv
+text,instruction,correct_answer
+Hello,How are you?,I'm good.
+"Tell me a joke.",Sure, here's one:...
+```
+
+### JSON Format
+
+The test set should be in JSON format with the following structure:
+
+1. A JSON file containing an array of objects.
+2. Each object in the array represents a row, with keys as column headers and values as row data. Here's an example of a valid JSON file:
+
+```json
+[
+  { "recipe_name": "Chicken Parmesan", "correct_answer": "Chicken" },
+  { "recipe_name": "a, special, recipe", "correct_answer": "Beef" }
+]
+```
+
+### Schema for Chat Applications
+
+For chat applications created using the chat template in Agenta, the input should be saved in the column called `chat`, which would contain the input list of messages:
+
+```json
+[
+  { "content": "message.", "role": "user" },
+  { "content": "message.", "role": "assistant" }
+  // Add more messages if necessary
+]
+```
+
+The reference answer column (by default `correct_answer`) should follow the same format:
+
+```json
+{ "content": "message.", "role": "assistant" }
+```
+
+## Creating a Test Set Using the API
+
+You can upload a test set using our API. Find the [API endpoint reference here](/reference/api/upload-file).
+
+Here's an example of such a call:
+
+**HTTP Request:**
+
+```
+POST /testsets/{app_id}/
+
+```
+
+**Request Body:**
+
+```json
+{
+  "name": "testsetname",
+  "csvdata": [
+    { "column1": "row1col1", "column2": "row1col2" },
+    { "column1": "row2col1", "column2": "row2col2" }
+  ]
+}
+```
+
+## Creating/Editing a Test Set from the UI
+
+To create or edit a test set from the UI:
+
+1. Go to `Test sets`
+2. Choose `Create a test set with UI` or select the test set
+3. Name your test set and specify the columns for input types.
+4. Add the dataset.
+
+Remember to click `Save test set`
+
+<img src="/images/test-sets/add_test_set_ui.png" />
+
+## Creating a Test Set from the Playground
+
+The playground offers a convenient way to create and add data to a test set. This workflow is useful if you want to build your test set ad hoc, each time you find an interesting input for the LLM app, you can immediately add these inputs to the test set and optionally set a reference answer.
+
+To add a data point to a test set from the playground, simply click the `Add to test set` button located near the `Run` button.
+
+A drawer will display the inputs and outputs from the playground. Here, you can modify inputs and correct answers if needed. Select an existing test set to add to, or choose `+Add new` to create a new one. Once you're satisfied, click `Add` to finalize.
+
+:::warning
+Currently, when adding a test point from the playground, the correct answer is always added to a column called `correct_answer`.
+:::
+
+:::warning
+When adding a new data point, ensure that the column names in the test set match those of the LLM application. All columns from the playground (input columns and `correct_answer`) must exist in the test set. They will be created automatically if you're making a new test set. Any additional columns in the test set not available in the playground will be left empty.
+:::
+
+<img src="/images/test-sets/add_test_set_playground.png" />
+
+### Adding Chat History from the Playground
+
+When adding chat history, you can choose to include all turns from the conversation. For example:
+
+- User: Hi
+- Assistant: Hi, how can I help you?
+- User: I would like to book a table
+- Assistant: Sure, for how many people?
+
+If you select "Turn by Turn," two rows will be added to the test set: one for "Hi/Hi, how can I help you?" and another for "Hi/Hi, how can I help you?/I would like to book a table/Sure, for how many people?"
+
+## Adding Data From Traces
+
+You can add any data logged to agenta to test sets. Simply navigate to observability, select the trace (or any span), then click on `Add to testset` or the `+` button.
diff --git a/docs/docs/evaluation/03-configure-evaluators.mdx b/docs/docs/evaluation/03-configure-evaluators.mdx
@@ -0,0 +1,78 @@
+---
+title: "Configure Evaluators"
+description: "Set up evaluators for your use case"
+---
+
+In this guide will show you how to configure evaluators for your LLM application.
+
+### What are evaluators?
+
+Evaluators are functions that assess the output of an LLM application.
+
+Evaluators typically take as input:
+
+- The output of the LLM application
+- (Optional) The reference answer (i.e., expected output or ground truth)
+- (Optional) The inputs to the LLM application
+- Any other relevant data, such as context
+
+Evaluators return either a float or a boolean value.
+
+<img style={{ width: "70%" }} src="/images/evaluation/evaluators.png" />
+
+### Configuring evaluators
+
+To create a new evaluator, click on the `Configure Evaluators` button in the `Evaluations` view.
+
+![The configure evaluators button in agenta.](/images/evaluation/configure-evaluators-1.png)
+
+### Selecting evaluators
+
+Agenta offers a growing list of pre-built evaluators suitable for most use cases. We also provide options for [creating custom evaluators](/evaluation/evaluators/custom-evaluator) (by writing your own Python function) or [using webhooks](/evaluation/evaluators/webhook-evaluator) for evaluation.
+
+<details id="available-evaluators">
+<summary>Available Evaluators</summary>
+
+| **Evaluator Name**                                                                                | **Use Case**                     | **Type**           | **Description**                                                                  |
+| ------------------------------------------------------------------------------------------------- | -------------------------------- | ------------------ | -------------------------------------------------------------------------------- |
+| [Exact Match](/evaluation/evaluators/classification-entiry-extraction#exact-match)                | Classification/Entity Extraction | Pattern Matching   | Checks if the output exactly matches the expected result.                        |
+| [Contains JSON](/evaluation/evaluators/classification-entiry-extraction#contains-json)            | Classification/Entity Extraction | Pattern Matching   | Ensures the output contains valid JSON.                                          |
+| [Regex Test](/evaluation/evaluators/pattern-matching#regular-expression)                          | Classification/Entity Extraction | Pattern Matching   | Checks if the output matches a given regex pattern.                              |
+| [JSON Field Match](/evaluation/evaluators/classification-entiry-extraction#json-field-match)      | Classification/Entity Extraction | Pattern Matching   | Compares specific fields within JSON data.                                       |
+| [JSON Diff Match](/evaluation/evaluators/classification-entiry-extraction#json-diff-match)        | Classification/Entity Extraction | Similarity Metrics | Compares generated JSON with a ground truth JSON based on schema or values.      |
+| [Similarity Match](/evaluation/evaluators/semantic-similarity#similarity-match)                   | Text Generation / Chatbot        | Similarity Metrics | Compares generated output with expected using Jaccard similarity.                |
+| [Semantic Similarity Match](/evaluation/evaluators/semantic-similarity#semantic-similarity-match) | Text Generation / Chatbot        | Semantic Analysis  | Compares the meaning of the generated output with the expected result.           |
+| [Starts With](/evaluation/evaluators/pattern-matching#starts-with)                                | Text Generation / Chatbot        | Pattern Matching   | Checks if the output starts with a specified prefix.                             |
+| [Ends With](/evaluation/evaluators/pattern-matching#ends-with)                                    | Text Generation / Chatbot        | Pattern Matching   | Checks if the output ends with a specified suffix.                               |
+| [Contains](/evaluation/evaluators/pattern-matching#contains)                                      | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains a specific substring.                              |
+| [Contains Any](/evaluation/evaluators/pattern-matching#contains-any)                              | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains any of a list of substrings.                       |
+| [Contains All](/evaluation/evaluators/pattern-matching#contains-all)                              | Text Generation / Chatbot        | Pattern Matching   | Checks if the output contains all of a list of substrings.                       |
+| [Levenshtein Distance](/evaluation/evaluators/semantic-similarity#levenshtein-distance)           | Text Generation / Chatbot        | Similarity Metrics | Calculates the Levenshtein distance between output and expected result.          |
+| [LLM-as-a-judge](/evaluation/evaluators/llm-as-a-judge)                                           | Text Generation / Chatbot        | LLM-based          | Sends outputs to an LLM model for critique and evaluation.                       |
+| [RAG Faithfulness](/evaluation/evaluators/rag-evaluators)                                         | RAG / Text Generation / Chatbot  | LLM-based          | Evaluates if the output is faithful to the retrieved documents in RAG workflows. |
+| [RAG Context Relevancy](/evaluation/evaluators/rag-evaluators)                                    | RAG / Text Generation / Chatbot  | LLM-based          | Measures the relevancy of retrieved documents to the given question in RAG.      |
+| [Custom Code Evaluation](/evaluation/evaluators/custom-evaluator)                                 | Custom Logic                     | Custom             | Allows users to define their own evaluator in Python.                            |
+| [Webhook Evaluator](/evaluation/evaluators/webhook-evaluator)                                     | Custom Logic                     | Custom             | Sends output to a webhook for external evaluation.                               |
+
+</details>
+
+![Screen for selecting an evaluator.](/images/evaluation/configure-evaluators-2.png)
+
+## Evaluators' settings
+
+Each evaluator comes with it's unique settings. For instance in the screen below, the JSON field match evaluator requires you to specify which field in the output JSON you need to consider for evaluation. You'll find detailed information about these parameters on each evaluator's documentation page.
+
+![Screen for configuring an evaluator.](/images/evaluation/configure-evaluators-3.png)
+
+## Mappings evaluator's inputs to the LLM data
+
+Evaluators need to know which parts of the data contain the output and the reference answer. Most evaluators allow you to configure this mapping, typically by specifying the name of the column in the test set that contains the `reference answer`.
+
+For more sophisticated evaluators, such as `RAG evaluators` (_available only in cloud and enterprise versions_), you need to define more complex mappings (see figure below).
+
+![Figure showing how RAGAS faithfulness evaluator maps to an example LLM
+  generation.](/images/evaluation/evaluator_config_mapping.png)
+
+Configuring the evaluator is done by mapping the evaluator inputs to the generation data:
+
+![Figure showing how RAGAS faithfulness evaluator is configured in agenta.](/images/evaluation/configure_mapping.png)