Mentor Worker Benchmark Leaderboard
What This Measures
This benchmark asks one question: does mentor guidance help a worker model solve objective coding tasks scored by tests?
Baseline and Mentored are means across replicates with task-family bootstrap confidence intervals.
Lift is mentored minus baseline, with a paired task-family bootstrap CI and a sig marker when CI excludes 0.
Errors and Timeouts count model-call failures; sanity runs focus on harness health and are not headline performance claims.
Hover glossary chips for plain-English definitions.
Baseline
Mentored
Lift
Model errors
Timeouts
Pack
Suite
Leaderboard
| Submission | Role | Pack | Suite | Top Worker | Baseline | Mentored | Lift | Errors | Timeouts | Commit |
|---|
Headline rows are baseline performance numbers. Sanity rows are harness-health checks.