Executive Summary
Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Yet the vast majority of runtime failures are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.
Key finding: Failed agent sessions waste 3.4× more compute than successful ones. 98.7% of that waste comes from a single failure mode: behavioral loops — the agent executing the same or semantically identical actions cyclically without progress.
This report presents the first large-scale, cross-model, cross-framework measurement of AI agent compute waste. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.
Dataset & Methodology
We analyzed four trajectory datasets, totaling 99,167 sessions:
Detection Method
Waste is measured using the CAUM motor v10.31 — a behavioral analysis engine that computes semantic similarity between consecutive agent steps using SBERT embeddings, then classifies each step into one of four behavioral regimes:
- EXPLORER — varied approaches, trying new strategies. Healthy.
- GRIND — slow progress, limited variety. Marginal.
- STAGNATION — no forward progress. Problematic.
- LOOP — repeating the same actions cyclically. Waste.
Steps classified as LOOP or STAGNATION are counted as wasted compute. The engine reads only tool names and structural metadata — zero semantic content, zero prompt data, zero business logic.
Core Findings
Finding 1 — Failed sessions waste 3.4× more compute
| Session Type | Sessions | Avg Waste % | Cohen's d |
|---|---|---|---|
| Resolved (successful) | 19,591 | 4.04% | — |
| Unresolved (failed) | 79,576 | 13.83% | — |
| Difference | — | 3.4× higher in failures | +0.548 |
A Cohen's d of +0.548 (medium-large effect) means waste is a reliable, statistically significant predictor of session failure — not noise. This holds across all four model families tested.
Finding 2 — 98.7% of waste is behavioral loops
Of all wasted steps in the dataset, 98.7% were classified as LOOP — the agent repeating semantically identical actions. Only 1.3% were STAGNATION (partial progress with no new information). This is a critical finding for system designers: the primary failure mode is not "agent gives up" — it's "agent doesn't know it's stuck."
Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.
Finding 3 — Cross-model signal is consistent
| Model Family | Framework | Cohen's d | AUC | Signal |
|---|---|---|---|---|
| GPT-4o | SWE-agent | +1.099 | 0.757 | EXCELLENT |
| GPT-4o | OpenHands | +1.131 | 0.852 | EXCELLENT |
| Llama 3.x | SWE-agent | +0.968 | 0.747 | EXCELLENT |
| Gemini 3 Flash | mini-SWE | +0.804 | 0.722 | EXCELLENT |
| Claude 3.7 | SWE-agent | +0.775 | 0.650 | GOOD |
| Claude 3.5 | SWE-agent | +0.175 | 0.554 | WEAK |
The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is particularly notable — it shows that the behavioral signal is framework-agnostic. The same motor, with no retraining, works across different agent execution environments.
Finding 4 — The $95K/year enterprise cost
For an enterprise running 10,000 agent sessions per day, the measured waste translates directly to compute costs:
- Failed sessions: ~78% of all sessions (based on SWE-bench resolution rates)
- Average waste per failed session: 13.83% of total compute
- At $0.03/session compute cost (conservative LLM API estimate)
- Daily waste cost: ~$261
- Annual waste: ~$95,000
This is a conservative estimate. It only counts token/compute cost. It excludes engineering time spent debugging loops, customer SLA breaches from stuck sessions, and infrastructure costs from processes that don't terminate cleanly.
Enterprise Validation — Real-World Tasks
Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.
Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.
Reproducibility
All analysis code and datasets used in this report are publicly available:
- Primary dataset: nebius/SWE-agent-trajectories on HuggingFace
- Toolathlon dataset: hkust-nlp/Toolathlon-Trajectories on HuggingFace
- Motor version: CAUM v10.31.0 — available via API at caum.systems/upload
- All session certificates are Ed25519 cryptographically signed for auditability
How CAUM Detects Waste in Real Time
CAUM integrates as a passive observer alongside any agent framework. It receives each tool call and result as the agent executes, computes a behavioral regime classification, and emits a running health score (UDS 0–1). No prompts, payloads, or business logic are read.
# One-time setup — works with any agent framework from caum import ZeroTrustAuditor aud = ZeroTrustAuditor(model_hint="gpt4o") # Per-step — call this after every tool execution for tool, result in agent.steps(): verdict = aud.push(tool, result) if verdict["regime"] == "LOOP": alert_team("Agent stuck in loop", verdict) # End of session — get full audit certificate cert = aud.finalize() # cert["uds"] is the health score (0–1)
Analyze Your Agent's Trajectories
Upload a trajectory file and get a full 10-page forensic PDF report in under 3 minutes. Cryptographically signed. No prompts read.
Questions? contact@caum.systems