Drift Alerts

Drift detection identifies gradual performance degradation that individual anomaly scores might miss. By comparing recent execution metrics against established baselines, drift alerts surface trends before they become critical incidents.

Drift Types

NodeLoom monitors four categories of drift for each workflow:

Drift TypeWhat It Measures
Duration driftAverage execution time is increasing compared to the baseline. Indicates performance degradation, slow API responses, or resource contention.
Token driftAverage token consumption per execution is increasing. May indicate prompt bloat, unnecessary tool calls, or model changes.
Output size driftAverage output payload size is growing. Could indicate unbounded data fetches or downstream API changes returning more data.
Error rate driftThe percentage of failed executions is increasing relative to the baseline error rate.

How Detection Works

Drift detection compares a recent window of executions (the last 50 executions or last 7 days, whichever is smaller) against the baseline (calculated from the previous 30 days of successful executions). If the recent average exceeds the baseline by more than the configured threshold percentage, a drift alert is triggered.

Trigger Timing

Drift checks are triggered in two ways:

  • After execution: Each completed execution triggers a drift check for the workflow. This provides near-real-time detection for high-frequency workflows.
  • Scheduled scan: A periodic background scan checks all active workflows for drift, catching slow-moving trends in low-frequency workflows that do not execute often enough for per-execution detection.

Minimum data requirement

Drift detection requires a minimum number of baseline and recent executions before it becomes active for a workflow. This prevents false positives from small sample sizes.

Configurable Thresholds

Each team can customise drift thresholds from the workspace monitoring settings. Thresholds are expressed as a percentage above the baseline:

Drift TypeDescription
Duration thresholdHow much slower the recent average duration can be before triggering an alert.
Token thresholdHow much more tokens the recent average can consume before triggering an alert.
Output size thresholdHow much larger the recent average output size can be before triggering an alert.
Error rate thresholdThe absolute increase in error rate (percentage points) that triggers an alert.

Per-workflow overrides

If a specific workflow has naturally variable performance (e.g., a web scraper that varies by page size), you can set per-workflow threshold overrides from the workflow settings panel.

Alert Lifecycle

Drift alerts follow a simple lifecycle:

StateDescription
ActiveThe drift condition has been detected and the alert is visible in the monitoring dashboard. Notifications are sent.
AcknowledgedA team member has reviewed the alert. It remains visible but is marked as acknowledged.
ResolvedThe recent metrics have returned within the threshold. The alert is automatically resolved.

Alerts are automatically resolved when the recent window returns below the threshold on subsequent checks. You do not need to manually close resolved alerts.

Notifications

Drift alerts use the same notification channels as anomaly detection: email and webhook. Configure notification preferences per team from the workspace monitoring settings.

To avoid alert fatigue, NodeLoom deduplicates drift notifications. A new notification is only sent when:

  • A new drift type is detected for a workflow (e.g., duration drift appears for the first time).
  • A previously resolved drift type re-triggers.
  • The drift severity increases significantly (e.g., from 30% over baseline to 60% over baseline).

Drift vs Anomaly Detection

While both features monitor execution health, they serve different purposes:

Anomaly DetectionDrift Alerts
ScopeIndividual executionTrend across many executions
Question answeredWas this specific execution unusual?Is this workflow getting worse over time?
Detection speedImmediate (per execution)Gradual (requires a window of data)
Best forCatching one-off spikes, security eventsIdentifying regressions, performance degradation

Next Steps