Multi-model dashboard

Agent Peak Bench compares models by deployment decisions: business fit, tool reliability, context stability, schema reliability, and harness dependency.

Current public data includes MiniMax M2.7 High r7 measured pilot output. Other rows in the sample contract are fixtures and are explicitly marked as not measured.

Measured model1
Fixture rows2
Sample trials133
StatusPilot
Multi-model dashboard sample

Benchmark output sample

The public sample shows aggregate measured output shape without raw traces, API keys, or private tool results.

Benchmark sample output screenshot

Readiness labels

LabelMeaning
assistant_onlyUseful for drafting or advisory flows; no autonomous side effects.
human_in_loopCan prepare decisions, but approval is required.
guarded_autonomyCan execute low-risk steps with verifier, router, and audit controls.
not_readyFailure clusters are too broad or unstable for the target workflow.