Framework-agnostic validation
The key test for any behavioral monitor is whether it works across frameworks — not just the one it was trained on. CAUM was validated on both SWE-agent and OpenHands sessions using the same motor, with no retraining.
Framework-agnostic result confirmed: GPT-4o on SWE-agent: d=+1.099, AUC=0.757. GPT-4o on OpenHands: d=+1.131, AUC=0.852. The signal is stronger on OpenHands — the motor generalizes across execution environments.
Performance across all tested configurations
| Model | Framework | Cohen's d | AUC | Status |
|---|---|---|---|---|
| GPT-4o | OpenHands | +1.131 | 0.852 | BEST RESULT |
| GPT-4o | SWE-agent | +1.099 | 0.757 | EXCELLENT |
| Llama 3.x | SWE-agent | +0.968 | 0.747 | EXCELLENT |
| Gemini Flash | mini-SWE | +0.804 | 0.722 | EXCELLENT |
| Claude 3.7 | SWE-agent | +0.775 | 0.650 | GOOD |
Integration with OpenHands
# Works with OpenHands cloud and self-hosted from caum import ZeroTrustAuditor aud = ZeroTrustAuditor( model_hint="gpt4o", framework_hint="openhands" ) # Wrap the OpenHands event stream for event in openhands_runtime.events(): signal = aud.push(event.action, event.observation) if signal["regime"] == "LOOP": notify_team("Loop detected", signal["severity"]) receipt = aud.finalize() # receipt["tier"] T1-T5 · evidence grade · hash-linked metadata
Analyze your OpenHands trajectories
Upload a trajectory JSONL and get an observation-only structural PDF report. First analysis free with code PIONEER.