Framework-agnostic validation
The key test for any behavioral monitor is whether it works across frameworks — not just the one it was trained on. CAUM was validated on both SWE-agent and OpenHands sessions using the same motor, with no retraining.
Framework-agnostic result confirmed: GPT-4o on SWE-agent: d=+1.099, AUC=0.757. GPT-4o on OpenHands: d=+1.131, AUC=0.852. The signal is stronger on OpenHands — the motor generalizes across execution environments.
Performance across all tested configurations
| Model | Framework | Cohen's d | AUC | Status |
|---|---|---|---|---|
| GPT-4o | OpenHands | +1.131 | 0.852 | BEST RESULT |
| GPT-4o | SWE-agent | +1.099 | 0.757 | EXCELLENT |
| Llama 3.x | SWE-agent | +0.968 | 0.747 | EXCELLENT |
| Gemini Flash | mini-SWE | +0.804 | 0.722 | EXCELLENT |
| Claude 3.7 | SWE-agent | +0.775 | 0.650 | GOOD |
Integration with OpenHands
# Works with OpenHands cloud and self-hosted from caum import ZeroTrustAuditor aud = ZeroTrustAuditor( model_hint="gpt4o", framework_hint="openhands" ) # Wrap the OpenHands event stream for event in openhands_runtime.events(): verdict = aud.push(event.action, event.observation) if verdict["regime"] == "LOOP": notify_team("Loop detected", verdict["severity"]) cert = aud.finalize() # cert["uds"] health score · cert["tier"] T1–T5 · Ed25519 signed
Analyze your OpenHands trajectories
Upload a trajectory JSONL and get a 10-page forensic PDF report. First analysis free with code PIONEER.