Research Report · March 2026

State of AI Agent
Structural Exposure 2026

A reproducible study of loop and stagnation exposure in autonomous AI agent execution. 99,167 public sessions. 2,615,295 public steps.

Dataset: HuggingFace public trajectories
Models: GPT-4o, Claude, Llama, Gemini
Published: March 24, 2026
By: CAUM Systems Research
99,167
Real sessions analyzed
2.6M
Agent steps measured
3.4×
More exposure in unresolved sessions
98.7%
Of exposure represented by loops
$95K
Example exposed cost per 10K daily sessions

Executive Summary

Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Many runtime problems are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.

Key finding: Unresolved sessions in this public research dataset showed 3.4× more loop/stagnation exposure than resolved ones. 98.7% of that exposure came from repeated loop patterns.

This report presents a large-scale, cross-model, cross-framework measurement of AI agent structural loop/stagnation exposure. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.

Dataset & Methodology

We analyzed four trajectory datasets, totaling 99,167 sessions:

🧠
nebius/SWE-agent-trajectories
79,773 sessions. Multi-model. Real GitHub issues as tasks. The primary dataset.
Claude trajectories
8,588 sessions. Claude 3.5/3.7 on SWE-bench tasks via SWE-agent framework.
🔷
GPT-4o trajectories
5,816 sessions. GPT-4o on SWE-bench. Used for cross-model validation.
🦙
Llama trajectories
4,990 sessions. Llama 3.x on balanced benchmark tasks.

Detection Method

Structural exposure is measured using the CAUM motor v10.31 — a research engine that compares structural step patterns and classifies each step into one of four behavioral regimes:

Steps classified as LOOP or STAGNATION are treated as structural review evidence, not semantic truth. The engine reads only tool names and structural metadata: zero prompt data, zero business logic.

Core Findings

Finding 1 - Unresolved sessions show more loop/stagnation exposure

Session TypeSessionsAvg Structural ExposureCohen's d
Resolved (successful)19,5914.04%
Unresolved79,57613.83%
Difference3.4× higher in unresolved sessions+0.548

A Cohen's d of +0.548 indicates a dataset-level association between structural exposure and unresolved sessions. It is research evidence, not a production claim that CAUM predicts all failures.

Finding 2 - Loop evidence dominates the structural exposure bucket

Of the loop/stagnation exposure counted in this dataset, 98.7% was classified as LOOP: repeated structural action patterns. Only 1.3% was STAGNATION. This is useful for system designers because loops can be observed without reading private content.

Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.

Finding 3 - Cross-model signal is consistent in research data

Model FamilyFrameworkCohen's dAUCSignal
GPT-4oSWE-agent+1.0990.757 EXCELLENT
GPT-4oOpenHands+1.1310.852 EXCELLENT
Llama 3.xSWE-agent+0.9680.747 EXCELLENT
Gemini 3 Flashmini-SWE+0.8040.722 EXCELLENT
Claude 3.7SWE-agent+0.7750.650 GOOD
Claude 3.5SWE-agent+0.1750.554 WEAK

The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is notable research evidence that similar structural signals can appear across agent execution environments. Production claims still require environment-specific validation.

Finding 4 - Example enterprise budget exposure

For commercial planning, structural loop/stagnation exposure should be translated into a percentage of monthly agent spend rather than tiny per-session dollar values:

This is scenario math, not a savings claim. CAUM can report reviewable structural exposure and help teams identify where retries, loops, and stagnation accumulate. The customer must prove any realized savings with their own baseline.

Enterprise Validation — Real-World Tasks

Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.

Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.

Reproducibility

All analysis code and datasets used in this report are publicly available:

How CAUM Observes Structural Exposure in Real Time

CAUM integrates as a passive observer alongside any agent framework. It receives structural event metadata as the agent executes, computes a structural health tier, and emits running review evidence. No prompts, payloads, or business logic are read.

# One-time setup — works with any agent framework
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o")

# Per-step — call this after every tool execution
for tool, result in agent.steps():
    signal = aud.push(tool, result)
    if signal["regime"] == "LOOP":
        alert_team("Agent stuck in loop", signal)

# End of session — get structural observation receipt
receipt = aud.finalize()  # receipt includes tier, evidence grade, and hash metadata

Analyze Your Agent's Trajectories

Upload a trajectory file and get an observation-only structural PDF report with cryptographic receipts. No prompts read.

Questions? contact@caum.systems