Executive Summary
Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Many runtime problems are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.
Key finding: Unresolved sessions in this public research dataset showed 3.4× more loop/stagnation exposure than resolved ones. 98.7% of that exposure came from repeated loop patterns.
This report presents a large-scale, cross-model, cross-framework measurement of AI agent structural loop/stagnation exposure. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.
Dataset & Methodology
We analyzed four trajectory datasets, totaling 99,167 sessions:
Detection Method
Structural exposure is measured using the CAUM motor v10.31 — a research engine that compares structural step patterns and classifies each step into one of four behavioral regimes:
- EXPLORER — varied approaches, trying new strategies. Healthy.
- GRIND — slow progress, limited variety. Marginal.
- STAGNATION — no forward progress. Problematic.
- LOOP — repeating the same action patterns cyclically. Review evidence.
Steps classified as LOOP or STAGNATION are treated as structural review evidence, not semantic truth. The engine reads only tool names and structural metadata: zero prompt data, zero business logic.
Core Findings
Finding 1 - Unresolved sessions show more loop/stagnation exposure
| Session Type | Sessions | Avg Structural Exposure | Cohen's d |
|---|---|---|---|
| Resolved (successful) | 19,591 | 4.04% | — |
| Unresolved | 79,576 | 13.83% | — |
| Difference | — | 3.4× higher in unresolved sessions | +0.548 |
A Cohen's d of +0.548 indicates a dataset-level association between structural exposure and unresolved sessions. It is research evidence, not a production claim that CAUM predicts all failures.
Finding 2 - Loop evidence dominates the structural exposure bucket
Of the loop/stagnation exposure counted in this dataset, 98.7% was classified as LOOP: repeated structural action patterns. Only 1.3% was STAGNATION. This is useful for system designers because loops can be observed without reading private content.
Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.
Finding 3 - Cross-model signal is consistent in research data
| Model Family | Framework | Cohen's d | AUC | Signal |
|---|---|---|---|---|
| GPT-4o | SWE-agent | +1.099 | 0.757 | EXCELLENT |
| GPT-4o | OpenHands | +1.131 | 0.852 | EXCELLENT |
| Llama 3.x | SWE-agent | +0.968 | 0.747 | EXCELLENT |
| Gemini 3 Flash | mini-SWE | +0.804 | 0.722 | EXCELLENT |
| Claude 3.7 | SWE-agent | +0.775 | 0.650 | GOOD |
| Claude 3.5 | SWE-agent | +0.175 | 0.554 | WEAK |
The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is notable research evidence that similar structural signals can appear across agent execution environments. Production claims still require environment-specific validation.
Finding 4 - Example enterprise budget exposure
For commercial planning, structural loop/stagnation exposure should be translated into a percentage of monthly agent spend rather than tiny per-session dollar values:
- Example monthly agent/API spend: $10,000
- Conservative reviewable exposure scenario: 5% of spend
- Monthly reviewable exposure: $500
- Annualized reviewable exposure: $6,000
- Realized savings require customer-owned controls and before/after validation
This is scenario math, not a savings claim. CAUM can report reviewable structural exposure and help teams identify where retries, loops, and stagnation accumulate. The customer must prove any realized savings with their own baseline.
Enterprise Validation — Real-World Tasks
Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.
Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.
Reproducibility
All analysis code and datasets used in this report are publicly available:
- Primary dataset: nebius/SWE-agent-trajectories on HuggingFace
- Toolathlon dataset: hkust-nlp/Toolathlon-Trajectories on HuggingFace
- Motor version: CAUM v10.31.0 — available via API at caum.systems/upload
- Session receipts include hash commitments and claim-boundary metadata for audit review
How CAUM Observes Structural Exposure in Real Time
CAUM integrates as a passive observer alongside any agent framework. It receives structural event metadata as the agent executes, computes a structural health tier, and emits running review evidence. No prompts, payloads, or business logic are read.
# One-time setup — works with any agent framework from caum import ZeroTrustAuditor aud = ZeroTrustAuditor(model_hint="gpt4o") # Per-step — call this after every tool execution for tool, result in agent.steps(): signal = aud.push(tool, result) if signal["regime"] == "LOOP": alert_team("Agent stuck in loop", signal) # End of session — get structural observation receipt receipt = aud.finalize() # receipt includes tier, evidence grade, and hash metadata
Analyze Your Agent's Trajectories
Upload a trajectory file and get an observation-only structural PDF report with cryptographic receipts. No prompts read.
Questions? contact@caum.systems