Mentor Worker Benchmark Leaderboard

Generated: 2026-03-07T16:23:17+00:00 | Submissions: 9 | Official: 4 | Community: 5

What This Measures

This benchmark asks one question: does mentor guidance help a worker model solve objective coding tasks scored by tests?

Baseline and Mentored are means across replicates with task-family bootstrap confidence intervals.

Lift is mentored minus baseline, with a paired task-family bootstrap CI and a sig marker when CI excludes 0.

Errors and Timeouts count model-call failures; sanity runs focus on harness health and are not headline performance claims.

Hover glossary chips for plain-English definitions.

Baseline Mentored Lift Model errors Timeouts Pack Suite

Leaderboard

Submission Role Pack Suite Top Worker Baseline Mentored Lift Errors Timeouts Commit

Headline rows are baseline performance numbers. Sanity rows are harness-health checks.

Raw normalized summaries: leaderboard/summary.json | Markdown: docs/leaderboard.md