Mentor Worker Benchmark Leaderboard
What This Measures
This benchmark asks one question: does mentor guidance help a worker model solve objective coding tasks scored by tests?
Baseline and Mentored are means across replicates with task-family bootstrap confidence intervals.
Lift is mentored minus baseline, with a paired task-family bootstrap CI and a sig marker when CI excludes 0.
Errors and Timeouts count model-call failures; sanity runs focus on harness health and are not headline performance claims.
Using the benchmark or leaderboard? Optional $5 support receipt, Browser Operator OS $39, Mini audit $149, or Workflow audit $750. Written audits are for redacted mentor/worker eval workflows only; no secrets, private transcripts, credentials, call, or gated result access. Browser Operator OS is self-serve material only; no Chrome plugin repair, guaranteed automation, account access, custom setup, calls, or posting without human approval.
Hover glossary chips for plain-English definitions.
Leaderboard
| Submission | Role | Pack | Suite | Top Worker | Baseline | Mentored | Lift | Errors | Timeouts | Commit |
|---|
Headline rows are baseline performance numbers. Sanity rows are harness-health checks.