State of AI Agent Structural Exposure 2026 — CAUM Systems Research Report

Executive Summary

Autonomous AI agents — systems that plan, execute tools, and iterate toward a goal — have rapidly moved from research into production. Engineering teams now run thousands of agent sessions per day. Many runtime problems are silent: the agent doesn't crash, it just spins, repeating the same actions in an infinite loop until it exhausts its budget.

Key finding: Unresolved sessions in this public research dataset showed 3.4× more loop/stagnation exposure than resolved ones. 98.7% of that exposure came from repeated loop patterns.

This report presents a large-scale, cross-model, cross-framework measurement of AI agent structural loop/stagnation exposure. All data comes from publicly available trajectory datasets on HuggingFace. All analysis is reproducible.

Dataset & Methodology

We analyzed four trajectory datasets, totaling 99,167 sessions:

🧠

nebius/SWE-agent-trajectories

79,773 sessions. Multi-model. Real GitHub issues as tasks. The primary dataset.

⚡

Claude trajectories

8,588 sessions. Claude 3.5/3.7 on SWE-bench tasks via SWE-agent framework.

🔷

GPT-4o trajectories

5,816 sessions. GPT-4o on SWE-bench. Used for cross-model validation.

🦙

Llama trajectories

4,990 sessions. Llama 3.x on balanced benchmark tasks.

Detection Method

Structural exposure is measured using the CAUM motor v10.31 — a research engine that compares structural step patterns and classifies each step into one of four behavioral regimes:

EXPLORER — varied approaches, trying new strategies. Healthy.
GRIND — slow progress, limited variety. Marginal.
STAGNATION — no forward progress. Problematic.
LOOP — repeating the same action patterns cyclically. Review evidence.

Steps classified as LOOP or STAGNATION are treated as structural review evidence, not semantic truth. The engine reads only tool names and structural metadata: zero prompt data, zero business logic.

Core Findings

Finding 1 - Unresolved sessions show more loop/stagnation exposure

Session Type	Sessions	Avg Structural Exposure	Cohen's d
Resolved (successful)	19,591	4.04%	—
Unresolved	79,576	13.83%	—
Difference	—	3.4× higher in unresolved sessions	+0.548

A Cohen's d of +0.548 indicates a dataset-level association between structural exposure and unresolved sessions. It is research evidence, not a production claim that CAUM predicts all failures.

Finding 2 - Loop evidence dominates the structural exposure bucket

Of the loop/stagnation exposure counted in this dataset, 98.7% was classified as LOOP: repeated structural action patterns. Only 1.3% was STAGNATION. This is useful for system designers because loops can be observed without reading private content.

Why this matters: Loops are detectable. Because CAUM monitors behavioral structure rather than semantic content, it can identify loops in real time — while the agent is running — without reading a single character of prompt or payload data.

Finding 3 - Cross-model signal is consistent in research data

Model Family	Framework	Cohen's d	AUC	Signal
GPT-4o	SWE-agent	+1.099	0.757	EXCELLENT
GPT-4o	OpenHands	+1.131	0.852	EXCELLENT
Llama 3.x	SWE-agent	+0.968	0.747	EXCELLENT
Gemini 3 Flash	mini-SWE	+0.804	0.722	EXCELLENT
Claude 3.7	SWE-agent	+0.775	0.650	GOOD
Claude 3.5	SWE-agent	+0.175	0.554	WEAK

The GPT-4o / OpenHands result (d=+1.131, AUC=0.852) is notable research evidence that similar structural signals can appear across agent execution environments. Production claims still require environment-specific validation.

Finding 4 - Example enterprise budget exposure

For commercial planning, structural loop/stagnation exposure should be translated into a percentage of monthly agent spend rather than tiny per-session dollar values:

Example monthly agent/API spend: $10,000
Conservative reviewable exposure scenario: 5% of spend
Monthly reviewable exposure: $500
Annualized reviewable exposure: $6,000
Realized savings require customer-owned controls and before/after validation

This is scenario math, not a savings claim. CAUM can report reviewable structural exposure and help teams identify where retries, loops, and stagnation accumulate. The customer must prove any realized savings with their own baseline.

Enterprise Validation — Real-World Tasks

Beyond code-focused benchmarks, we validated the CAUM motor on the hkust-nlp/Toolathlon-Trajectories dataset — 6,818 sessions of real-world agentic tasks: WooCommerce order management, railway ticketing, form filling, API orchestration.

Signal was confirmed in 10 out of 22 models tested (d > 0.20). Best performer: Gemini 2.5 Flash at d=+0.527. The signal is weaker on enterprise tasks than coding tasks — consistent with the hypothesis that enterprise agents have more diverse tool vocabularies, making loop patterns harder to detect without domain-specific calibration.

Reproducibility

All analysis code and datasets used in this report are publicly available:

Primary dataset: nebius/SWE-agent-trajectories on HuggingFace
Toolathlon dataset: hkust-nlp/Toolathlon-Trajectories on HuggingFace
Motor version: CAUM v10.31.0 — available via API at caum.systems/upload
Session receipts include hash commitments and claim-boundary metadata for audit review

How CAUM Observes Structural Exposure in Real Time

CAUM integrates as a passive observer alongside any agent framework. It receives structural event metadata as the agent executes, computes a structural health tier, and emits running review evidence. No prompts, payloads, or business logic are read.

# One-time setup — works with any agent framework
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o")

# Per-step — call this after every tool execution
for tool, result in agent.steps():
    signal = aud.push(tool, result)
    if signal["regime"] == "LOOP":
        alert_team("Agent stuck in loop", signal)

# End of session — get structural observation receipt
receipt = aud.finalize()  # receipt includes tier, evidence grade, and hash metadata

Analyze Your Agent's Trajectories

Upload a trajectory file and get an observation-only structural PDF report with cryptographic receipts. No prompts read.

Upload Trajectory → Try Interactive Demo

Questions? contact@caum.systems

State of AI AgentStructural Exposure 2026