Batch Evaluation
Batch evaluation lets you test AI workflow outputs against a set of test cases before deploying to production. Provide inputs, expected outputs, and actual AI responses — the judge LLM scores every case and produces a detailed quality report.
Overview
While live evaluation scores executions as they happen in production, batch evaluation is designed for pre-deploy validation. You supply a set of test cases with known inputs and outputs, and the judge LLM evaluates each one against your chosen criteria.
This is ideal for:
- Validating prompt changes before promoting a workflow to production.
- Regression testing after modifying AI node configurations or switching models.
- Comparing output quality across different model providers or versions.
- Building a gold standard test suite for continuous quality assurance.
Test Cases
Each test case consists of three fields:
| Field | Required | Description |
|---|---|---|
| Input | Yes | The input data passed to the workflow, as a JSON object. For example, the user message or structured data the workflow receives. |
| Output | Yes | The actual AI-generated output to evaluate. This is the response your workflow produced for the given input. |
| Expected Output | No | The gold standard (correct) answer. When provided, the judge performs an additional comparison to determine if the actual output matches the expected one. |
CSV Format
You can upload test cases via CSV. The file must include input and output columns. The expected_output column is optional.
input,output,expected_output
"What is 2+2?","The answer is 4.","4"
"Capital of France?","Paris is the capital of France.","Paris"
"Summarize this article","The article discusses AI safety...",""
Alternatively, add test cases manually using the form in the batch evaluation modal.
Running a Batch Evaluation
Open the workflow editor
Navigate to the workflow you want to evaluate. The workflow must contain at least one AI node.
Click the Evaluate button
In the toolbar, click the Evaluate button (cyan). This opens the batch evaluation modal.
Configure the judge LLM
Select the provider (OpenAI, Anthropic, Gemini, Azure, or Custom), model, and credential. Choose the evaluation dimensions and set a pass threshold (1.0–5.0).
Add test cases
Upload a CSV file or add test cases manually. Each test case needs an input (JSON) and the actual AI output. Optionally include an expected output for gold standard comparison.
Start evaluation
Click Start Evaluation. The evaluation runs asynchronously — results stream in as each test case is scored. You can monitor progress in real time via the Results tab.
Understanding Results
When a batch evaluation completes, you see a summary and detailed per-case results:
| Metric | Description |
|---|---|
| Total Cases | The number of test cases in the batch. |
| Passed | Cases where the composite score met or exceeded the pass threshold. |
| Failed | Cases where the composite score fell below the pass threshold. |
| Average Score | The mean composite score across all completed cases. |
Per-Case Details
Expand any result row to see:
- Dimension scores — Individual 1–5 scores for each enabled criterion (groundedness, relevance, etc.).
- Composite score — The average of all dimension scores for that case.
- Reasoning — The judge model's explanation for each score, helping you understand why a case passed or failed.
- Gold standard match — If an expected output was provided, whether the actual output semantically matches it, along with the judge's reasoning for the comparison.
Gold Standard Comparison
When you provide an expected output for a test case, the judge LLM performs an additional semantic comparison. Rather than doing an exact string match, the judge evaluates whether the actual output conveys the same meaning and correctness as the expected output.
The comparison produces:
- A match/mismatch verdict.
- A brief reasoning explaining why the outputs are considered equivalent or different.
Building a test suite
Evaluation History
All batch evaluation runs are saved and accessible from the History tab in the batch evaluation modal. Each run records:
- Run name, status, and timestamp.
- Judge provider and model used.
- Summary statistics (total, passed, failed, average score).
- Full per-case results with scores and reasoning.
Click any past run to view its full results. This makes it easy to compare quality across different model versions or prompt iterations.
Supported Judge Providers
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o Mini, GPT-4 Turbo |
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5 |
| Google Gemini | Gemini 2.0 Flash, Gemini 1.5 Pro |
| Azure OpenAI | GPT-4o, GPT-4 |
| Custom | Any OpenAI-compatible endpoint (provide your own model name) |
Credential required