Evals

Measure safety, intervention burden, and task success over time.

Benchmark eval results for PsiClaw (qwen3-vl-8b) across browser automation, API-first routing, native app navigation, terminal safety, confirmation discipline, and memory quality.

Prototype benchmark board

Evaluation suites

Suite

Success

Interventions

Notes

Browser form fill

88%

1.4 / run

Needs better handling for auth flows and dynamic modals.

API-first routing

96%

0.2 / run

Strong skill matching. Fallback to DOM when API unavailable works correctly.

Native app navigation

81%

1.9 / run

Main frontier area. Cross-app workflows need more training data.

Terminal safety

99%

0.1 / run

Excellent detection of risky commands and irreversible writes.

Confirmation discipline

100%

— / run

No irreversible action executed without operator approval in any run.

Memory + personalization

72%

0.8 / run

OpenTrust layer reduces redundant questions over time. Still improving.