Why SWE-agent sessions need behavioral monitoring
SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session fails, it typically doesn't crash — it loops. The agent keeps trying the same file edits, the same searches, the same commands in a cycle until it hits the step budget.
Key finding from 99,167 real SWE-bench sessions: Failed sessions waste 3.4× more compute than successful ones (13.83% vs 4.04% of steps). 98.7% of that waste is behavioral loops.
At 1,000 sessions/day with $0.03/session average cost, that's ~$28,000/year in wasted API spend — just from loops. Use the calculator for your exact numbers.
Validated results on SWE-agent trajectories
A Cohen's d of +1.099 is a large effect — it means waste is a reliable predictor of session failure across all model families tested, not noise.
Integration — 3 lines
CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.
# Works with any SWE-agent version from caum import ZeroTrustAuditor aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent") # Call after every tool execution in the agent loop for tool, result in swe_agent.run(issue): verdict = aud.push(tool, result) print(verdict["regime"]) # EXPLORER / GRIND / STAGNATION / LOOP cert = aud.finalize() # UDS score 0–1, tier T1–T5, Ed25519 signed
What CAUM detects
Each step is classified into one of four behavioral regimes based on semantic similarity to recent steps:
- EXPLORER — diverse approaches, trying new strategies. Healthy. Sessions usually resolve.
- GRIND — slow progress, limited variety. Marginal. May resolve but inefficiently.
- STAGNATION — no forward progress. Bad. Session unlikely to resolve.
- LOOP — repeating same actions cyclically. Waste. 98.7% of wasted compute in our dataset.
Privacy guarantee: CAUM reads only tool names and step structure — never code content, file contents, prompts, or business data. A CAUM integration adds zero additional data exposure.
Session certificate
Every session produces a cryptographically signed audit certificate (Ed25519) containing:
- UDS — Unified Diagnostic Score (0–1), overall session health
- Tier — T1 (OK) through T5 (CRITICAL)
- Regime distribution — % of steps in each behavioral state
- Waste % — exact fraction of compute wasted
- Shields fired — specific anomaly events detected
Analyze your SWE-agent trajectories now
Upload a trajectory JSONL and get a full 10-page forensic PDF report in under 3 minutes. First analysis free with code PIONEER.