SWE-agent Integration

Monitoring for
SWE-agent sessions

CAUM was evaluated on SWE-bench trajectories. Observe loops, stagnation, and compute cost exposure without reading code or prompts.

99,167
SWE sessions analyzed
+1.099
Cohen's d on GPT-4o
3.4×
More exposure in unresolved runs
0
Bytes of code/prompt read

Why SWE-agent sessions need behavioral monitoring

SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session stalls, it often does not crash; it repeats structural patterns. The agent may keep trying the same file edits, searches, or commands until it hits the step budget.

Research finding from 99,167 SWE-bench sessions: Unresolved sessions showed 3.4x more loop/stagnation exposure than resolved ones (13.83% vs 4.04% of steps). 98.7% of that exposure was loop evidence.

The commercial framing should use percentage of agent budget, not tiny per-session dollars. For example, at $25,000/month in agent/API spend, a 5% reviewable exposure scenario is $1,250/month or $15,000/year. Use the planner for your exact assumptions. This is scenario math, not guaranteed savings.

Validated results on SWE-agent trajectories

+1.099
Cohen's d (GPT-4o + SWE-agent)
0.757
AUC (GPT-4o discrimination)
+0.775
Cohen's d (Claude 3.7)
+0.968
Cohen's d (Llama)

A Cohen's d of +1.099 is a large effect in this research slice. It supports structural review, not a production claim that CAUM predicts all session failures.

Integration — 3 lines

CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.

# Works with any SWE-agent version
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent")

# Call after every tool execution in the agent loop
for tool, result in swe_agent.run(issue):
    signal = aud.push(tool, result)
    print(signal["regime"])  # EXPLORER / GRIND / STAGNATION / LOOP

receipt = aud.finalize()  # structural tier, evidence grade, hash-linked receipt

What CAUM detects

Each step is assigned a structural health signal from recent action patterns and event shape:

Privacy boundary: CAUM is designed to ingest tool names, step structure, counters, and hashes rather than code content, file contents, prompts, or business data. Teams should validate their integration before production use.

Session receipt

Each session can produce a structural observation receipt containing:

Analyze your SWE-agent trajectories now

Upload a trajectory JSONL and get an observation-only structural PDF report. First analysis free with code PIONEER.