Skip to content

Latest commit

 

History

History
104 lines (64 loc) · 4.38 KB

evaluations.md

File metadata and controls

104 lines (64 loc) · 4.38 KB

Evaluations

{% hint style="info" %} Evaluations are only available for Cloud and Enterprise plan {% endhint %}

Evaluations help you monitor and understand the performance of your Chatflow/Agentflow application. On the high level, an evaluation is a process that takes a set of inputs and corresponding outputs from your Chatflow/Agentflow, and generates scores. These scores can be derived by comparing outputs to reference results, such as through string matching, numeric comparison, or even leveraging an LLM as a judge. These evaluations are conducted using Datasets and Evaluators.

Datasets

Datasets are the inputs that will be used to run your Chatflow/Agentflow, along with the corresponding outputs for comparison. User can add the input and anticipated output manually, or upload a CSV file with 2 columns: Input and Output.

Input Output
What is the capital of UK Capital of UK is London
How many days are there in a year There are 365 days in a year

Evaluators

Evaluators are like unit tests. During an evaluation, the inputs from Datasets are ran on the selected flows and the outputs are evaluated using selected evaluators. There are 3 types of evaluators:

  • Text Based: string based checking:
    • Contains Any
    • Contains All
    • Does Not Contains Any
    • Does Not Contains All
    • Starts With
    • Does Not Starts With

  • Numeric Based: numbers type checking:
    • Total Tokens
    • Prompt Tokens
    • Completion Tokens
    • API Latency
    • LLM Latency
    • Chatflow Latency
    • Agentflow Latency (coming)
    • Output Characters Length

  • LLM Based: using another LLM to grade the output
    • Hallucination
    • Correctness

Evaluations

Now that we have Datasets and Evaluators prepared, we can start running an evaluation.

1.) Select dataset and chatflow to evaluate. You can select multiple datasets and chatflows. Using the example below, every inputs from Dataset1 will be ran against 2 chatflows. Since Dataset1 has 2 inputs, a total of 4 outputs will be produced and evaluated.

2.) Select the evaluators. Only string based and numeric based evaluators are available to be selected at this stage.

3.) (Optional) Select LLM Based evaluator. Start Evaluation:

4.) Wait for evaluation to be completed:

5.) After evaluation is completed, click the graph icon at the right side to view the details:

The 3 charts above show the summary of the evaluation:

  • Pass/fail rate
  • Average prompt and completion tokens used
  • Average latency of the request

Table below the charts shows the details of each execution.

Re-run evaluation

When the flows used on evaluation have been updated/modified, a warning message will be shown:

You can re-run the same evaluation using the Re-Run Evaluation button at the top right corner. You will be able to see the different versions:

You can also view and compare the results from different versions:

Video Tutorial

{% embed url="https://youtu.be/kgUttHMkGFg?si=3rLplEp_0TI0p6UV&t=486" %}