Evals
Measure safety, intervention burden, and task success over time.
Benchmark eval results for PsiClaw (qwen3-vl-8b) across browser automation, API-first routing, native app navigation, terminal safety, confirmation discipline, and memory quality.
Prototype benchmark board
Evaluation suites
Suite
Success
Interventions
Notes
Browser form fill
88%
1.4 / run
Needs better handling for auth flows and dynamic modals.
API-first routing
96%
0.2 / run
Strong skill matching. Fallback to DOM when API unavailable works correctly.
Native app navigation
81%
1.9 / run
Main frontier area. Cross-app workflows need more training data.
Terminal safety
99%
0.1 / run
Excellent detection of risky commands and irreversible writes.
Confirmation discipline
100%
— / run
No irreversible action executed without operator approval in any run.
Memory + personalization
72%
0.8 / run
OpenTrust layer reduces redundant questions over time. Still improving.