LLM-as-Judge Evaluation
LLM-as-Judge evaluation uses a separate language model to automatically score the quality of your AI agent outputs. Define evaluation criteria, set passing thresholds, and continuously monitor output quality across all your workflows.
Overview
Traditional testing struggles with AI outputs because responses are non-deterministic and quality is subjective. LLM-as-Judge solves this by using a capable language model as an automated evaluator. The judge model receives the agent's input, output, and your evaluation criteria, then scores each response on a defined scale.
NodeLoom supports two evaluation modes:
- Live evaluation — Automatically scores workflow executions as they happen, using sampling to control costs.
- Batch evaluation — Run a set of test cases through the judge LLM on demand, ideal for pre-deploy validation. See the Batch Evaluation page for details.
Evaluation Criteria
NodeLoom provides five built-in evaluation criteria. You can enable any combination of these for each workflow:
| Criterion | Description | Scale |
|---|---|---|
| Groundedness | Measures whether the response is supported by the provided context or source documents. High scores indicate the agent did not fabricate information beyond what was given. | 1 (not grounded) to 5 (fully grounded) |
| Relevance | Evaluates how well the response addresses the user's question or request. Penalizes off-topic, overly broad, or tangential answers. | 1 (irrelevant) to 5 (highly relevant) |
| Factual Accuracy | Checks whether factual claims in the response are correct. The judge model cross-references claims against available context and general knowledge. | 1 (inaccurate) to 5 (fully accurate) |
| Tone Adherence | Assesses whether the response matches the expected tone defined in the system prompt (e.g. professional, friendly, concise). Useful for customer-facing agents. | 1 (mismatched tone) to 5 (perfect tone match) |
| Safety | Evaluates whether the response contains harmful, biased, or inappropriate content. Also checks for leaked credentials or system prompt details. | 1 (unsafe) to 5 (fully safe) |
Team-Level Configuration
The default evaluation settings apply to all workflows in your team. To configure them:
- Go to Settings → Monitoring and open the Evaluations tab.
- Enable LLM Evaluations and choose a judge provider and model (OpenAI, Anthropic, Gemini, Azure OpenAI, or a custom OpenAI-compatible endpoint).
- Select a credential that the judge model will use for API access.
- Choose which evaluation criteria to enable and set a passing threshold (composite score below this is flagged as a failure).
- Set the sampling rate (0–100%) to control what percentage of executions are evaluated. A hash-based deterministic sampler ensures consistent selection.
- Optionally enable failure notifications to trigger incident playbooks when scores drop below thresholds.
Per-Workflow Configuration
You can override the team-level defaults for any individual workflow. This is useful when different workflows have different quality requirements — for example, a customer-facing chatbot may need stricter tone adherence checks than an internal summarization workflow.
- Open a workflow that contains AI nodes in the workflow editor.
- Click the Eval Config button in the toolbar (only visible for workflows with AI nodes).
- Configure the same settings as the team level: provider, model, credential, sampling rate, dimensions, threshold, and notifications.
- Click Save Configuration. The workflow will now use these settings instead of the team defaults.
To revert a workflow back to team defaults, click Use Team Defaults in the eval config panel. This removes the workflow-level override entirely.
Configuration priority
Results and Reporting
Evaluation results are displayed in the workflow's monitoring dashboard. For each evaluated execution, you can see:
- Individual scores for each enabled criterion, along with the judge model's reasoning.
- Pass/fail status based on your configured thresholds.
- A composite score that averages all enabled dimension scores for quick comparison.
- Trend charts showing how scores change over time, helping you identify quality degradation early.
- Aggregated reports that can be exported for compliance documentation.
Evaluation failures can automatically trigger incident playbooks for automated response. For example, a sustained drop in groundedness scores could trigger a workflow quarantine.
Defense in depth