SWE-agent Integration

Monitoring for
SWE-agent sessions

CAUM was validated directly on SWE-bench trajectories. Detect loops, stagnation, and compute waste in real time — without reading a single line of code or prompt.

99,167
SWE sessions analyzed
+1.099
Cohen's d on GPT-4o
3.4×
More waste in failed runs
0
Bytes of code/prompt read

Why SWE-agent sessions need behavioral monitoring

SWE-agent runs autonomous software engineering sessions that can span hundreds of tool calls. When a session fails, it typically doesn't crash — it loops. The agent keeps trying the same file edits, the same searches, the same commands in a cycle until it hits the step budget.

Key finding from 99,167 real SWE-bench sessions: Failed sessions waste 3.4× more compute than successful ones (13.83% vs 4.04% of steps). 98.7% of that waste is behavioral loops.

At 1,000 sessions/day with $0.03/session average cost, that's ~$28,000/year in wasted API spend — just from loops. Use the calculator for your exact numbers.

Validated results on SWE-agent trajectories

+1.099
Cohen's d (GPT-4o + SWE-agent)
0.757
AUC (GPT-4o discrimination)
+0.775
Cohen's d (Claude 3.7)
+0.968
Cohen's d (Llama)

A Cohen's d of +1.099 is a large effect — it means waste is a reliable predictor of session failure across all model families tested, not noise.

Integration — 3 lines

CAUM attaches alongside SWE-agent as a passive observer. It receives each tool call and result, classifies the behavioral regime, and emits a running health score. The agent's execution is never modified.

# Works with any SWE-agent version
from caum import ZeroTrustAuditor

aud = ZeroTrustAuditor(model_hint="gpt4o", framework_hint="swe-agent")

# Call after every tool execution in the agent loop
for tool, result in swe_agent.run(issue):
    verdict = aud.push(tool, result)
    print(verdict["regime"])  # EXPLORER / GRIND / STAGNATION / LOOP

cert = aud.finalize()  # UDS score 0–1, tier T1–T5, Ed25519 signed

What CAUM detects

Each step is classified into one of four behavioral regimes based on semantic similarity to recent steps:

Privacy guarantee: CAUM reads only tool names and step structure — never code content, file contents, prompts, or business data. A CAUM integration adds zero additional data exposure.

Session certificate

Every session produces a cryptographically signed audit certificate (Ed25519) containing:

Analyze your SWE-agent trajectories now

Upload a trajectory JSONL and get a full 10-page forensic PDF report in under 3 minutes. First analysis free with code PIONEER.