Why SWE-agent sessions need behavioral monitoring

SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session fails, it typically doesn't crash — it loops. The agent keeps trying the same file edits, the same searches, the same commands in a cycle until it hits the step budget.

Key finding from 99,167 real SWE-bench sessions: Failed sessions waste 3.4× more compute than successful ones (13.83% vs 4.04% of steps). 98.7% of that waste is behavioral loops.

At 1,000 sessions/day with $0.03/session average cost, that's ~$28,000/year in wasted API spend — just from loops. Use the calculator for your exact numbers.

Validated results on SWE-agent trajectories

+1.099

Cohen's d (GPT-4o + SWE-agent)

0.757

AUC (GPT-4o discrimination)

+0.775

Cohen's d (Claude 3.7)

+0.968

Cohen's d (Llama)

A Cohen's d of +1.099 is a large effect — it means waste is a reliable predictor of session failure across all model families tested, not noise.

Integration — 3 lines

CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.

# Works with any SWE-agent version
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent")

# Call after every tool execution in the agent loop
for tool, result in swe_agent.run(issue):
    verdict = aud.push(tool, result)
    print(verdict["regime"])  # EXPLORER / GRIND / STAGNATION / LOOP

cert = aud.finalize()  # UDS score 0–1, tier T1–T5, Ed25519 signed

What CAUM detects

Each step is classified into one of four behavioral regimes based on semantic similarity to recent steps:

EXPLORER — diverse approaches, trying new strategies. Healthy. Sessions usually resolve.
GRIND — slow progress, limited variety. Marginal. May resolve but inefficiently.
STAGNATION — no forward progress. Bad. Session unlikely to resolve.
LOOP — repeating same actions cyclically. Waste. 98.7% of wasted compute in our dataset.

Privacy guarantee: CAUM reads only tool names and step structure — never code content, file contents, prompts, or business data. A CAUM integration adds zero additional data exposure.

Session certificate

Every session produces a cryptographically signed audit certificate (Ed25519) containing:

UDS — Unified Diagnostic Score (0–1), overall session health
Tier — T1 (OK) through T5 (CRITICAL)
Regime distribution — % of steps in each behavioral state
Waste % — exact fraction of compute wasted
Shields fired — specific anomaly events detected

Analyze your SWE-agent trajectories now

Upload a trajectory JSONL and get a full 10-page forensic PDF report in under 3 minutes. First analysis free with code PIONEER.

Upload Trajectory → Read the Research

Monitoring forSWE-agent sessions