Batch Evaluation

Batch evaluation lets you test AI workflow outputs against a set of test cases before deploying to production. Provide inputs, expected outputs, and actual AI responses — the judge LLM scores every case and produces a detailed quality report.

Overview

While live evaluation scores executions as they happen in production, batch evaluation is designed for pre-deploy validation. You supply a set of test cases with known inputs and outputs, and the judge LLM evaluates each one against your chosen criteria.

This is ideal for:

Validating prompt changes before promoting a workflow to production.
Regression testing after modifying AI node configurations or switching models.
Comparing output quality across different model providers or versions.
Building a gold standard test suite for continuous quality assurance.

Test Cases

Each test case consists of three fields:

Field	Required	Description
Input	Yes	The input data passed to the workflow, as a JSON object. For example, the user message or structured data the workflow receives.
Output	Yes	The actual AI-generated output to evaluate. This is the response your workflow produced for the given input.
Expected Output	No	The gold standard (correct) answer. When provided, the judge performs an additional comparison to determine if the actual output matches the expected one.

CSV Format

You can upload test cases via CSV. The file must include input and output columns. The expected_output column is optional.

test-cases.csv

input,output,expected_output
"What is 2+2?","The answer is 4.","4"
"Capital of France?","Paris is the capital of France.","Paris"
"Summarize this article","The article discusses AI safety...",""

Alternatively, add test cases manually using the form in the batch evaluation modal.

Running a Batch Evaluation

Open the workflow editor

Navigate to the workflow you want to evaluate. The workflow must contain at least one AI node.

Click the Evaluate button

In the toolbar, click the Evaluate button (cyan). This opens the batch evaluation modal.

Configure the judge LLM

Select the provider (OpenAI, Anthropic, Gemini, Azure, or Custom), model, and credential. Choose the evaluation dimensions and set a pass threshold (1.0–5.0).

Add test cases

Upload a CSV file or add test cases manually. Each test case needs an input (JSON) and the actual AI output. Optionally include an expected output for gold standard comparison.

Start evaluation

Click Start Evaluation. The evaluation runs asynchronously — results stream in as each test case is scored. You can monitor progress in real time via the Results tab.

Understanding Results

When a batch evaluation completes, you see a summary and detailed per-case results:

Metric	Description
Total Cases	The number of test cases in the batch.
Passed	Cases where the composite score met or exceeded the pass threshold.
Failed	Cases where the composite score fell below the pass threshold.
Average Score	The mean composite score across all completed cases.

Per-Case Details

Expand any result row to see:

Dimension scores— Individual 1–5 scores for each enabled criterion (groundedness, relevance, etc.).
Composite score— The average of all dimension scores for that case.
Reasoning— The judge model's explanation for each score, helping you understand why a case passed or failed.
Gold standard match— If an expected output was provided, whether the actual output semantically matches it, along with the judge's reasoning for the comparison.

Gold Standard Comparison

When you provide an expected output for a test case, the judge LLM performs an additional semantic comparison. Rather than doing an exact string match, the judge evaluates whether the actual output conveys the same meaning and correctness as the expected output.

The comparison produces:

A match/mismatch verdict.
A brief reasoning explaining why the outputs are considered equivalent or different.

Building a test suite

Start by collecting real inputs and outputs from production executions. Curate the best outputs as your gold standard expected outputs. Over time, grow this test suite to catch regressions whenever you change prompts, models, or workflow logic.

Evaluation History

All batch evaluation runs are saved and accessible from the History tab in the batch evaluation modal. Each run records:

Run name, status, and timestamp.
Judge provider and model used.
Summary statistics (total, passed, failed, average score).
Full per-case results with scores and reasoning.

Click any past run to view its full results. This makes it easy to compare quality across different model versions or prompt iterations.

Supported Judge Providers

Provider	Models
OpenAI	GPT-4o, GPT-4o Mini, GPT-4 Turbo
Anthropic	Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5
Google Gemini	Gemini 2.0 Flash, Gemini 1.5 Pro
Azure OpenAI	GPT-4o, GPT-4
Custom	Any OpenAI-compatible endpoint (provide your own model name)

Credential required

Each judge provider requires a valid credential. Add credentials in Settings → Credentials before running batch evaluations. The credential dropdown in the evaluation modal filters to show only credentials matching the selected provider.

Evaluations

Playbooks