Evaluations

{% hint style="info" %} Evaluations are only available for Cloud and Enterprise plan {% endhint %}

Evaluations help you monitor and understand the performance of your Chatflow/Agentflow application. On the high level, an evaluation is a process that takes a set of inputs and corresponding outputs from your Chatflow/Agentflow, and generates scores. These scores can be derived by comparing outputs to reference results, such as through string matching, numeric comparison, or even leveraging an LLM as a judge. These evaluations are conducted using Datasets and Evaluators.

Datasets

Datasets are the inputs that will be used to run your Chatflow/Agentflow, along with the corresponding outputs for comparison. User can add the input and anticipated output manually, or upload a CSV file with 2 columns: Input and Output.

Input	Output
What is the capital of UK	Capital of UK is London
How many days are there in a year	There are 365 days in a year

Evaluators

Evaluators are like unit tests. During an evaluation, the inputs from Datasets are ran on the selected flows and the outputs are evaluated using selected evaluators. There are 3 types of evaluators:

Text Based: string based checking:
- Contains Any
- Contains All
- Does Not Contains Any
- Does Not Contains All
- Starts With
- Does Not Starts With

Numeric Based: numbers type checking:
- Total Tokens
- Prompt Tokens
- Completion Tokens
- API Latency
- LLM Latency
- Chatflow Latency
- Agentflow Latency (coming)
- Output Characters Length

LLM Based: using another LLM to grade the output
- Hallucination
- Correctness

Evaluations

Now that we have Datasets and Evaluators prepared, we can start running an evaluation.

1.) Select dataset and chatflow to evaluate. You can select multiple datasets and chatflows. Using the example below, every inputs from Dataset1 will be ran against 2 chatflows. Since Dataset1 has 2 inputs, a total of 4 outputs will be produced and evaluated.

2.) Select the evaluators. Only string based and numeric based evaluators are available to be selected at this stage.

3.) (Optional) Select LLM Based evaluator. Start Evaluation:

4.) Wait for evaluation to be completed:

5.) After evaluation is completed, click the graph icon at the right side to view the details:

The 3 charts above show the summary of the evaluation:

Pass/fail rate
Average prompt and completion tokens used
Average latency of the request

Table below the charts shows the details of each execution.

Re-run evaluation

When the flows used on evaluation have been updated/modified, a warning message will be shown:

You can re-run the same evaluation using the Re-Run Evaluation button at the top right corner. You will be able to see the different versions:

You can also view and compare the results from different versions:

Video Tutorial

{% embed url="https://youtu.be/kgUttHMkGFg?si=3rLplEp_0TI0p6UV&t=486" %}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluations.md

evaluations.md

Evaluations

Datasets

Evaluators

Evaluations

Re-run evaluation

Video Tutorial

Files

evaluations.md

Latest commit

History

evaluations.md

File metadata and controls

Evaluations

Datasets

Evaluators

Evaluations

Re-run evaluation

Video Tutorial