Why SWE-agent sessions need behavioral monitoring
SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session stalls, it often does not crash; it repeats structural patterns. The agent may keep trying the same file edits, searches, or commands until it hits the step budget.
Research finding from 99,167 SWE-bench sessions: Unresolved sessions showed 3.4x more loop/stagnation exposure than resolved ones (13.83% vs 4.04% of steps). 98.7% of that exposure was loop evidence.
The commercial framing should use percentage of agent budget, not tiny per-session dollars. For example, at $25,000/month in agent/API spend, a 5% reviewable exposure scenario is $1,250/month or $15,000/year. Use the planner for your exact assumptions. This is scenario math, not guaranteed savings.
Validated results on SWE-agent trajectories
A Cohen's d of +1.099 is a large effect in this research slice. It supports structural review, not a production claim that CAUM predicts all session failures.
Integration — 3 lines
CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.
# Works with any SWE-agent version from caum import ZeroTrustAuditor aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent") # Call after every tool execution in the agent loop for tool, result in swe_agent.run(issue): signal = aud.push(tool, result) print(signal["regime"]) # EXPLORER / GRIND / STAGNATION / LOOP receipt = aud.finalize() # structural tier, evidence grade, hash-linked receipt
What CAUM detects
Each step is assigned a structural health signal from recent action patterns and event shape:
- EXPLORER - diverse structural action patterns.
- GRIND - slow structural progress and limited variety.
- STAGNATION - weak structural movement; review recommended.
- LOOP - repeated action cycles; cost/token exposure should be reviewed.
Privacy boundary: CAUM is designed to ingest tool names, step structure, counters, and hashes rather than code content, file contents, prompts, or business data. Teams should validate their integration before production use.
Session receipt
Each session can produce a structural observation receipt containing:
- Structural health score - overall trace health
- Tier - T1 through T5 structural health band
- Regime distribution - share of steps in each structural state
- Exposure evidence - loop/stagnation/cycle evidence and token/cost exposure
- Review triggers - specific structural anomaly events detected
Analyze your SWE-agent trajectories now
Upload a trajectory JSONL and get an observation-only structural PDF report. First analysis free with code PIONEER.