Why SWE-agent sessions need behavioral monitoring

SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session stalls, it often does not crash; it repeats structural patterns. The agent may keep trying the same file edits, searches, or commands until it hits the step budget.

Research finding from 99,167 SWE-bench sessions: Unresolved sessions showed 3.4x more loop/stagnation exposure than resolved ones (13.83% vs 4.04% of steps). 98.7% of that exposure was loop evidence.

The commercial framing should use percentage of agent budget, not tiny per-session dollars. For example, at $25,000/month in agent/API spend, a 5% reviewable exposure scenario is $1,250/month or $15,000/year. Use the planner for your exact assumptions. This is scenario math, not guaranteed savings.

Validated results on SWE-agent trajectories

+1.099

Cohen's d (GPT-4o + SWE-agent)

0.757

AUC (GPT-4o discrimination)

+0.775

Cohen's d (Claude 3.7)

+0.968

Cohen's d (Llama)

A Cohen's d of +1.099 is a large effect in this research slice. It supports structural review, not a production claim that CAUM predicts all session failures.

Integration â€” 3 lines

CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.

# Works with any SWE-agent version
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent")

# Call after every tool execution in the agent loop
for tool, result in swe_agent.run(issue):
    signal = aud.push(tool, result)
    print(signal["regime"])  # EXPLORER / GRIND / STAGNATION / LOOP

receipt = aud.finalize()  # structural tier, evidence grade, hash-linked receipt

What CAUM detects

Each step is assigned a structural health signal from recent action patterns and event shape:

EXPLORER - diverse structural action patterns.
GRIND - slow structural progress and limited variety.
STAGNATION - weak structural movement; review recommended.
LOOP - repeated action cycles; cost/token exposure should be reviewed.

Privacy boundary: CAUM is designed to ingest tool names, step structure, counters, and hashes rather than code content, file contents, prompts, or business data. Teams should validate their integration before production use.

Session receipt

Each session can produce a structural observation receipt containing:

Structural health score - overall trace health
Tier - T1 through T5 structural health band
Regime distribution - share of steps in each structural state
Exposure evidence - loop/stagnation/cycle evidence and token/cost exposure
Review triggers - specific structural anomaly events detected

Analyze your SWE-agent trajectories now

Upload a trajectory JSONL and get an observation-only structural PDF report. First analysis free with code PIONEER.

Upload Trajectory â†’ Read the Research

Monitoring forSWE-agent sessions