Mentor Worker Benchmark Leaderboard

Generated: 2026-05-10T06:52:00+00:00 | Submissions: 9 | Official: 4 | Community: 5

What This Measures

This benchmark asks one question: does mentor guidance help a worker model solve objective coding tasks scored by tests?

Baseline and Mentored are means across replicates with task-family bootstrap confidence intervals.

Lift is mentored minus baseline, with a paired task-family bootstrap CI and a sig marker when CI excludes 0.

Errors and Timeouts count model-call failures; sanity runs focus on harness health and are not headline performance claims.

Using the benchmark or leaderboard? Optional $5 support receipt, Browser Operator OS $39, Mini audit $149, or Workflow audit $750. Written audits are for redacted mentor/worker eval workflows only; no secrets, private transcripts, credentials, call, or gated result access. Browser Operator OS is self-serve material only; no Chrome plugin repair, guaranteed automation, account access, custom setup, calls, or posting without human approval.

Hover glossary chips for plain-English definitions.

Baseline Mentored Lift Model errors Timeouts Pack Suite

Leaderboard

Submission	Role	Pack	Suite	Top Worker	Baseline	Mentored	Lift	Errors	Timeouts	Commit

Headline rows are baseline performance numbers. Sanity rows are harness-health checks.

Raw normalized summaries: leaderboard/summary.json | Markdown: docs/leaderboard.md