State of AI Agent Waste 2026 — CAUM Systems Research Report

Executive Summary

Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Yet the vast majority of runtime failures are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.

Key finding: Failed agent sessions waste 3.4× more compute than successful ones. 98.7% of that waste comes from a single failure mode: behavioral loops — the agent executing the same or semantically identical actions cyclically without progress.

This report presents the first large-scale, cross-model, cross-framework measurement of AI agent compute waste. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.

Dataset & Methodology

We analyzed four trajectory datasets, totaling 99,167 sessions:

🧠

nebius/SWE-agent-trajectories

79,773 sessions. Multi-model. Real GitHub issues as tasks. The primary dataset.

⚡

Claude trajectories

8,588 sessions. Claude 3.5/3.7 on SWE-bench tasks via SWE-agent framework.

🔷

GPT-4o trajectories

5,816 sessions. GPT-4o on SWE-bench. Used for cross-model validation.

🦙

Llama trajectories

4,990 sessions. Llama 3.x on balanced benchmark tasks.

Detection Method

Waste is measured using the CAUM motor v10.31 — a behavioral analysis engine that computes semantic similarity between consecutive agent steps using SBERT embeddings, then classifies each step into one of four behavioral regimes:

EXPLORER — varied approaches, trying new strategies. Healthy.
GRIND — slow progress, limited variety. Marginal.
STAGNATION — no forward progress. Problematic.
LOOP — repeating the same actions cyclically. Waste.

Steps classified as LOOP or STAGNATION are counted as wasted compute. The engine reads only tool names and structural metadata — zero semantic content, zero prompt data, zero business logic.

Core Findings

Finding 1 — Failed sessions waste 3.4× more compute

Session Type	Sessions	Avg Waste %	Cohen's d
Resolved (successful)	19,591	4.04%	—
Unresolved (failed)	79,576	13.83%	—
Difference	—	3.4× higher in failures	+0.548

A Cohen's d of +0.548 (medium-large effect) means waste is a reliable, statistically significant predictor of session failure — not noise. This holds across all four model families tested.

Finding 2 — 98.7% of waste is behavioral loops

Of all wasted steps in the dataset, 98.7% were classified as LOOP — the agent repeating semantically identical actions. Only 1.3% were STAGNATION (partial progress with no new information). This is a critical finding for system designers: the primary failure mode is not "agent gives up" — it's "agent doesn't know it's stuck."

Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.

Finding 3 — Cross-model signal is consistent

Model Family	Framework	Cohen's d	AUC	Signal
GPT-4o	SWE-agent	+1.099	0.757	EXCELLENT
GPT-4o	OpenHands	+1.131	0.852	EXCELLENT
Llama 3.x	SWE-agent	+0.968	0.747	EXCELLENT
Gemini 3 Flash	mini-SWE	+0.804	0.722	EXCELLENT
Claude 3.7	SWE-agent	+0.775	0.650	GOOD
Claude 3.5	SWE-agent	+0.175	0.554	WEAK

The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is particularly notable — it shows that the behavioral signal is framework-agnostic. The same motor, with no retraining, works across different agent execution environments.

Finding 4 — The $95K/year enterprise cost

For an enterprise running 10,000 agent sessions per day, the measured waste translates directly to compute costs:

Failed sessions: ~78% of all sessions (based on SWE-bench resolution rates)
Average waste per failed session: 13.83% of total compute
At $0.03/session compute cost (conservative LLM API estimate)
Daily waste cost: ~$261
Annual waste: ~$95,000

This is a conservative estimate. It only counts token/compute cost. It excludes engineering time spent debugging loops, customer SLA breaches from stuck sessions, and infrastructure costs from processes that don't terminate cleanly.

Enterprise Validation — Real-World Tasks

Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.

Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.

Reproducibility

All analysis code and datasets used in this report are publicly available:

Primary dataset: nebius/SWE-agent-trajectories on HuggingFace
Toolathlon dataset: hkust-nlp/Toolathlon-Trajectories on HuggingFace
Motor version: CAUM v10.31.0 — available via API at caum.systems/upload
All session certificates are Ed25519 cryptographically signed for auditability

How CAUM Detects Waste in Real Time

CAUM integrates as a passive observer alongside any agent framework. It receives each tool call and result as the agent executes, computes a behavioral regime classification, and emits a running health score (UDS 0–1). No prompts, payloads, or business logic are read.

# One-time setup — works with any agent framework
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o")

# Per-step — call this after every tool execution
for tool, result in agent.steps():
    verdict = aud.push(tool, result)
    if verdict["regime"] == "LOOP":
        alert_team("Agent stuck in loop", verdict)

# End of session — get full audit certificate
cert = aud.finalize()  # cert["uds"] is the health score (0–1)

Analyze Your Agent's Trajectories

Upload a trajectory file and get a full 10-page forensic PDF report in under 3 minutes. Cryptographically signed. No prompts read.

Upload Trajectory → Try Interactive Demo

Questions? contact@caum.systems

State of AI AgentWaste 2026