Blog

ScaffBench: measuring coding agents on real fullstack scaffolding

102 runs across Claude Code, Codex CLI, Gemini CLI, Kilo, and opencode, measuring speed, cost, tokens, and whether generated projects install and build.

June 12, 2026 · Ibrahim Elkamali

benchmarkmcpclaude-codeagents

Coding agents are very good at writing code and surprisingly bad at starting projects. Ask one to scaffold a production-grade fullstack monorepo and it will happily spend ten minutes hand-writing manifests, lockfiles, and config — and the result frequently doesn't build.

ScaffBench measures exactly that: an agent in an empty workspace, a project spec, and a hard question at the end — does the generated project install and build? We ran the same specs through three creation paths to isolate how much tooling helps:

  • MCP — the agent uses the Better-Fullstack MCP server to plan and scaffold the project.
  • BF mention — no MCP; the prompt points the agent at the Better-Fullstack CLI and docs, and it composes the create command itself.
  • Prompt — no Better-Fullstack at all; the agent hand-writes every file from scratch.

We ran the full suite twice, on two different agent CLIs: OpenAI Codex CLI with three GPT models (June 10), and Claude Code with three Claude models (June 12) — 72 runs in total. Same specs, same prompts, same validation. We then added a deliberately light third sweep — nine more models across Gemini CLI, Kilo, and opencode, on the lightest spec only — bringing the total to 102 runs and fifteen models.

Headline results

The headline numbers come from the Claude Code sweep — 36 runs: three models (Fable 5, Opus 4.8, Sonnet 4.6) × three creation paths × four project specs. Same machine, same harness, same prompts apart from the creation-mode instructions. The GPT sweep shows the same structure and is reported separately below.

Creation pathAvg timeMedian timeAvg output tokensBuilds passing
MCP113.4s85.9s5,55312/12 (100%)
BF mention219.6s172.1s11,0609/9 (100%)*
Prompt516.2s424.0s25,8599/12 (75%)

* Three BF-mention runs are excluded from the denominator because they failed on a template generator bug on our side, not on anything the agent did — see the scoring policy.

Three things stood out:

  1. MCP is ~4.6× faster than prompt-only on average (and ~4.9× on medians), with ~4.7× fewer output tokens. The agent ships configuration, not code.
  2. Every MCP run passed validation, for every model. Prompt-only runs passed 75% — two runs hit the 15-minute timeout and one produced a monorepo that didn't build.
  3. Tooling compresses model differences. On the MCP path even the smallest model matched the frontier models at 100% pass rate — it just got there faster and cheaper.

Methodology

The three creation paths

Every run starts in an empty directory with the same base prompt:

You are running in an empty benchmark workspace:
{run_dir}

Create exactly one project directory named `{project_name}`.
Do not ask questions. Do not start a dev server. Do not write outside the
current working directory.
At the end, report the commands you ran and any errors you hit.

Benchmark target: {spec_title}
Requirements:
{spec_requirements}

The only difference between paths is the creation-mode instruction appended to that prompt. Prompt-only runs are explicitly forbidden from touching anything of ours:

Creation mode: prompt-only.
Do not use the Better-Fullstack MCP server, Better-Fullstack CLI,
Better-Fullstack website, or files from the Better-Fullstack repository.
Create the project from scratch by writing the files and manifests needed
for a runnable starter.

MCP runs get the opposite:

Creation mode: Better-Fullstack MCP.
Use the Better-Fullstack MCP tools, starting with bfs_get_guidance. Then use
schema/compatibility/plan as needed and call bfs_create_project to create
the project.

And BF-mention runs sit in between — no MCP tools, but the prompt names the CLI and a README the agent may read before composing a non-interactive create command itself.

The four project specs

The specs are deliberately spread from "weekend project" to "the stack a real team would argue about for a week":

SpecWhat it asks for
light-tsReact + Vite, Hono on Bun, tRPC, SQLite + Drizzle, Tailwind + DaisyUI, Pino, Vitest, Biome — no auth, no payments
heavy-tsNext.js and an Expo/React Native app, Hono on Bun, oRPC, PostgreSQL + Drizzle, Better Auth, Stripe, Resend, UploadThing, Effect, TanStack Store/Form/Query, Valibot, Vitest + Playwright, Vercel AI SDK, Socket.IO, Inngest, Framer Motion, OpenTelemetry, PostHog, Umami, Sanity, Upstash Redis, Algolia, S3, shadcn/ui, Turborepo, MSW, Storybook, PWA, Docker compose
python-aiFastAPI, SQLModel, PostgreSQL, Pydantic, JWT auth, Celery, Strawberry GraphQL, LangChain + OpenAI SDK + LangGraph + CrewAI, Ruff
multi-ecosystemTypeScript Next.js frontend, Python FastAPI backend (SQLModel/Pydantic/LangChain/Celery), Go Gin service (GORM, gRPC, Cobra, Zap), shared PostgreSQL

heavy-ts is the stress test. It is also where every interesting failure in this benchmark happened.

The harness

Two sweeps, one harness design:

  • Claude sweep: Claude Code in headless mode (claude -p), JSON event stream captured per run. Models: claude-fable-5, claude-opus-4-8, claude-sonnet-4-6, default reasoning effort.
  • GPT sweep: OpenAI Codex CLI (codex exec), JSON event stream captured per run. Models: gpt-5.3-codex-spark (high effort), gpt-5.4 (medium), gpt-5.5 (medium).
  • Isolation: every run gets a fresh empty workspace; MCP runs get a strict MCP config with only the Better-Fullstack server attached.
  • Timeout: 900 seconds per run. A timeout counts as a failure for the agent.
  • Metrics: wall-clock time, token usage from the agent's own usage report, MCP tool calls from the event stream. Dollar cost is captured on the Claude sweep only — our Codex harness doesn't report metered cost.

One important asymmetry: the GPT sweep ran two days earlier, against the generator before we fixed the template bugs it exposed. That's why the generator-bug exclusions below hit the GPT results harder — and why the two sweeps shouldn't be compared head-to-head on pass rates.

What counts as passing

Validation runs after the agent finishes, against whatever it left on disk:

  • TypeScript: bun install, then the project's own build (or check/lint/test) script.
  • Python: compileall across the project, excluding virtualenvs and vendored packages.
  • Go: go mod download and the module must compile.

A run passes only if every applicable check exits zero. "It looks like a project" doesn't count; it has to build.

What counts as a failure

On the Claude sweep this mattered exactly three times: all three models, on the BF-mention path, scaffolded heavy-ts correctly via the CLI — and all three hit the same generator bug in our native (Expo) template where app.json sets a web output mode that only works with expo-router. The agents did everything right; the template couldn't build. Those three runs are excluded. Prompt-path failures (two timeouts, one broken hand-written build) are entirely agent-authored and count in full.

On the GPT sweep — which ran against the pre-fix generator — it mattered seven times: heavy-ts through MCP and through the CLI failed for all three models on the since-fixed Storybook/Expo bug chain, plus one multi-ecosystem scaffold whose Go service shipped without a generated go.sum. All seven are excluded for the same reason; the GPT agents' own failures (three broken hand-written builds on the prompt path) count in full.

Results

Per model and path

Times and tokens are averages across the four specs; cost is the metered API total for those four runs.

ModelPathAvg timeAvg output tokensCost (4 specs)Builds passing
Fable 5MCP172.6s7,590$9.344/4
Fable 5BF mention405.7s17,748$15.663/3*
Fable 5Prompt572.8s24,905$11.26†3/4
Opus 4.8MCP97.1s5,206$3.434/4
Opus 4.8BF mention154.7s10,596$4.123/3*
Opus 4.8Prompt510.8s21,485$6.89†3/4
Sonnet 4.6MCP70.3s3,863$1.474/4
Sonnet 4.6BF mention98.3s4,834$1.313/3*
Sonnet 4.6Prompt464.9s31,188$7.113/4

* heavy-ts excluded (generator bug on our side). † Timed-out runs report no cost, so these totals under-count the true spend.

Cost

The Claude sweep cost $60.60 in metered API usage (less than the true number — the two timed-out runs report $0; the Codex harness doesn't report cost at all). Reading the spread is more interesting than the total:

  • Cheapest passing run: $0.12 — Sonnet 4.6 scaffolding light-ts through the CLI in 21 seconds.
  • Most expensive passing runs: $3.50–$4.21 — Fable 5 hand-writing projects on the prompt path.
  • For Sonnet, the prompt path cost ~5× its MCP path ($7.11 vs $1.47) and was ~6.6× slower, for a lower pass rate.

The pattern holds across all three models: the more of the project the agent has to author itself, the more you pay for a less reliable result.

GPT models on Codex CLI

The GPT sweep ran the identical specs and prompts through OpenAI's Codex CLI two days before the Claude sweep: gpt-5.3-codex-spark at high reasoning effort, gpt-5.4 and gpt-5.5 at medium. Generator-bug failures are excluded exactly as above (seven runs here — this sweep predates the template fixes).

ModelPathAvg timeAvg output tokensBuilds passing
GPT-5.3 Codex SparkMCP32.4s5,7983/3*
GPT-5.3 Codex SparkBF mention65.6s9,8943/3*
GPT-5.3 Codex SparkPrompt44.8s31,4182/4
GPT-5.4MCP92.0s5,1723/3*
GPT-5.4BF mention156.0s7,0843/3*
GPT-5.4Prompt203.1s13,3283/4
GPT-5.5MCP76.5s3,8382/2*
GPT-5.5BF mention74.1s4,5483/3*
GPT-5.5Prompt264.2s15,6504/4

* heavy-ts excluded on MCP and BF mention for all three models, plus one GPT-5.5 multi-ecosystem MCP run — all on since-fixed generator bugs (this sweep ran before the fixes shipped).

What the GPT sweep adds to the picture:

  • The structure is vendor-independent. For every GPT model, MCP is the fastest path and prompt-only is the most expensive in output tokens — the same shape as the Claude results, on a different vendor's models and a different agent CLI.
  • Spark is built for speed, and it shows. GPT-5.3 Codex Spark averaged 32 seconds per MCP scaffold — the fastest cells in the whole benchmark — but on the prompt path its speed came at the price of reliability: it hand-wrote projects in under a minute and only half of them built.
  • GPT-5.5 is the prompt-path outlier. It was the only model across both sweeps to pass every prompt-only spec, including heavy-ts — at 542 seconds and 28k output tokens for that one run, versus 108 seconds through MCP.
  • GPT models lean harder on the tools. Codex runs made 10–34 MCP calls per scaffold versus Claude's 3–10, re-checking compatibility and re-planning more often — with the same end result.

A caveat worth repeating: the two sweeps ran on different agent harnesses, different days, and different generator versions. Comparing paths within a sweep is the experiment; comparing vendors across sweeps is not.

The light sweep: Gemini, Kilo, and opencode

The full suite costs real money and real rate-limit budget, so for the next batch of models we ran a deliberately light version: one spec (light-ts) × three creation paths × one run per cell. Nine models across three more agent CLIs, all on June 12:

  • Gemini CLI — Gemini 3.1 Pro (Google's latest, and the CLI's default).
  • opencode (Go subscription) — Kimi K2.6, GLM-5.1, MiniMax M3, Qwen3.7 Max, DeepSeek-V4 Pro: the current open-weights flagship tier.
  • Kilo CLI (free model tier) — Step-3.7 Flash, Laguna m.1, Nex N2-Pro.

Same prompts, same 900-second budget, same validation as the main sweeps. Scoring matches the headline criterion — does the project install and build — so lint-only failures don't count against a run (the strict logs are in the run artifacts). Gemini got a second, accidental repeat of all three paths, which we kept as a free variance sample: 30 runs total. (A tenth model, Nemotron-3 Ultra on Kilo's free tier, is excluded — the endpoint was too slow to drive an agent at all; see the appendix note.)

ModelMCPBF mentionPrompt-only
Gemini 3.1 Pro✅ 60s · 1.6k✅ 27s · 0.9k✅ 124s · 13.2k
Kimi K2.6✅ 311s · 2.8k✅ 24s · 1.6k✅ 372s · 13.5k
GLM-5.1✅ 46s · 1.2k✅ 67s · 2.2k✅ 694s · 29.5k
MiniMax M3✅ 69s · 2.5k✅ 32s · 0.8k❌ build break
Qwen3.7 Max✅ 80s · 2.7k✅ 35s · 0.8k✅ 379s · 13.3k
DeepSeek-V4 Pro✅ 43s · 1.4k✅ 40s · 1.4k✅ 357s · 11.7k
Step-3.7 Flash✅ 45s · 2.0k✅ 16s · 0.6k❌ broken install
Laguna m.1✅ 225s · 1.6k✅ 592s · 3.7k✅ 900s (timeout) · 9.4k
Nex N2-Pro✅ 227s · 2.4k✅ 146s · 1.3k❌ build break

What the light sweep adds:

  • The assisted paths went 18-for-18. Every model passed both MCP and BF mention. That includes Step-3.7 Flash — a small, fast, free model — scaffolding a correct fullstack monorepo in 16 seconds and 640 output tokens via the BF mention path. The structure we saw on Claude and GPT isn't a frontier-model phenomenon; the tooling levels the playing field all the way down.
  • Prompt-only is where the field splits. Three of nine models shipped broken projects, each in an instructive way: Step-3.7 Flash pinned pino-http to a version that doesn't exist (install failed); MiniMax M3 named a tRPC procedure useContext, colliding with tRPC's built-in client method (type check failed); Nex N2-Pro wrote an import that doesn't resolve (Rollup failed). None of these failure modes exist on the assisted paths, because the agent never hand-writes a manifest.
  • The economics hold on a third, fourth, and fifth harness. Every assisted run finished under 3k output tokens. Every prompt-only run cost 8–36k — MiniMax M3's 36k being the most expensive single run in ScaffBench so far, for a project that didn't build.
  • Gemini 3.1 Pro is excellent at this task. Fastest BF mention cell of the entire benchmark (26.7 seconds, 863 tokens), a textbook four-call MCP flow, and one of the few models to pass prompt-only. Its repeat sample is also a useful honesty check: the first prompt run failed Biome lint, the second passed cleanly — single-run cells are directional, not definitive.

Light-sweep caveats: one run per cell, and 25 of the runs executed concurrently, so wall-clock times carry contention noise (Gemini's BF mention run measured 27s solo and 50s in the parallel batch — same outcome, different traffic). Treat path structure and pass/fail as the signal, exact seconds as approximate.

Qualitative analysis

MCP keeps agents on rails

MCP runs converged on the same tool sequence: bfs_get_guidance → compatibility check → plan → bfs_create_project, between 3 and 10 calls per run (the multi-ecosystem spec needed the most iterations to settle a valid stack). Output stays small because the agent's job collapses to choosing a configuration — the file-writing is done by the generator, deterministically. That's also why pass rates don't degrade as the spec grows: heavy-ts through MCP passed for every model.

Prompt-only agents drown in the heavy spec

On the prompt path, heavy-ts defeated almost everyone. Fable 5 and Opus 4.8 were still writing files at the 15-minute timeout. Sonnet 4.6 finished a 90-file monorepo in 767 seconds — and it didn't build. GPT-5.3 Codex Spark sprinted to a hand-written project in 68 seconds — it didn't build either. The single exception across both sweeps was GPT-5.5, which ground through heavy-ts prompt-only in 542 seconds and passed. The lighter specs mostly passed prompt-only, but at minutes of wall-clock and 10k–37k output tokens each. Hand-writing a starter is something frontier models can do; it's just the slowest, most expensive, least reliable way to get one.

The benchmark audited our own templates

The most useful failures were ours. Validating every generated project surfaced six layered template bugs in our generator, most hiding behind a single "Storybook build fails" symptom on heavy-ts. The GPT sweep hit them first — eight of its nine heavy-ts builds failed, six of them on this generator chain — and the Claude sweep two days later confirmed which fixes held:

  1. Storybook templates tested the frontend with an equality check, but frontend is an array.
  2. A database package path mismatch left the DB package without its dependencies in graph-part mode.
  3. expo-network was missing from a native template variant.
  4. Storybook 8 framework packages don't re-export Meta/StoryObj types — imports had to move to the renderer packages.
  5. @better-auth/expo needed four Expo peer dependencies installed explicitly.
  6. Native app.json sets a static web output mode that only works with expo-router — this one is still open, and is the bug behind the three excluded BF-mention runs.

Five of the six were fixed and shipped in create-better-fullstack 2.0.2 before this post went out. Running your own product through an agent benchmark turns out to be a brutally effective QA pass.

Model character

Sonnet 4.6 was the fastest and cheapest on every path, and with tooling it gave up nothing in reliability. Fable 5 was the most deliberate — highest token counts, longest runs, and on the prompt path that thoroughness still wasn't enough to beat the timeout on heavy-ts. Opus 4.8 sat reliably in between. The ranking never changed across paths; the gap did. Tooling is the great equalizer: on MCP, the spread between best and worst model was 102 seconds; prompt-only it was varying minutes and, twice, a wall-clock ceiling.

Limitations

  • One run per cell. 102 runs is enough to see the structure, not to put confidence intervals on it. Scaffold times also vary with API load — and the light sweep's parallel batch adds contention noise on top (Gemini's own repeat showed both a 2× time spread and a lint-fail/pass flip between identical runs).
  • The sweeps aren't head-to-head. The GPT runs used a different agent CLI (Codex), different reasoning-effort settings, an earlier generator version, and report no dollar cost; the light sweep covers one spec on three further CLIs. Within-sweep path comparisons are sound; cross-vendor model rankings are not what this benchmark measures.
  • We benchmarked our own tool. The harness, prompts, and validation are public in spirit — prompts and policy are quoted above — and the full per-run table is in the appendix. Treat the comparison between paths (tooling vs no tooling) as the finding, not a claim about other scaffolders.
  • Builds, not features. Validation proves the project installs and builds — not that the auth flow works or the Stripe webhook is wired correctly. A passing prompt-only project may still be missing more of the spec than a generated one.
  • Timeout cost under-reporting. Runs killed at 900s report $0 cost, which flatters the prompt path's totals.

Appendix: all 102 runs

Claude sweep (Claude Code, June 12)

ModelPathSpecTimeOutput tokensCostResult
Fable 5MCPheavy-ts290s11,891$2.97pass
Fable 5MCPlight-ts82s4,166$1.62pass
Fable 5MCPmulti-ecosystem240s10,895$3.22pass
Fable 5MCPpython-ai78s3,406$1.54pass
Fable 5BF mentionheavy-ts613s30,259$6.98build failed*
Fable 5BF mentionlight-ts314s16,132$3.74pass
Fable 5BF mentionmulti-ecosystem470s15,196$2.98pass
Fable 5BF mentionpython-ai226s9,404$1.95pass
Fable 5Promptheavy-ts900stimed out
Fable 5Promptlight-ts413s32,080$3.50pass
Fable 5Promptmulti-ecosystem570s37,436$4.21pass
Fable 5Promptpython-ai408s30,103$3.55pass
Opus 4.8MCPheavy-ts178s5,805$0.88pass
Opus 4.8MCPlight-ts46s2,894$0.72pass
Opus 4.8MCPmulti-ecosystem118s8,903$1.10pass
Opus 4.8MCPpython-ai47s3,223$0.74pass
Opus 4.8BF mentionheavy-ts345s24,383$2.29build failed*
Opus 4.8BF mentionlight-ts39s2,754$0.32pass
Opus 4.8BF mentionmulti-ecosystem116s8,038$0.70pass
Opus 4.8BF mentionpython-ai118s7,211$0.81pass
Opus 4.8Promptheavy-ts900stimed out
Opus 4.8Promptlight-ts435s33,390$2.50pass
Opus 4.8Promptmulti-ecosystem395s28,598$2.43pass
Opus 4.8Promptpython-ai313s23,952$1.96pass
Sonnet 4.6MCPheavy-ts90s5,094$0.41pass
Sonnet 4.6MCPlight-ts47s2,530$0.33pass
Sonnet 4.6MCPmulti-ecosystem101s5,904$0.42pass
Sonnet 4.6MCPpython-ai43s1,923$0.31pass
Sonnet 4.6BF mentionheavy-ts98s5,527$0.28build failed*
Sonnet 4.6BF mentionlight-ts21s795$0.12pass
Sonnet 4.6BF mentionmulti-ecosystem237s11,333$0.73pass
Sonnet 4.6BF mentionpython-ai38s1,682$0.18pass
Sonnet 4.6Promptheavy-ts767s52,926$3.18build failed
Sonnet 4.6Promptlight-ts520s32,767$1.88pass
Sonnet 4.6Promptmulti-ecosystem347s22,642$1.20pass
Sonnet 4.6Promptpython-ai226s16,418$0.85pass

* Failed on a since-identified template generator bug on our side (excluded from agent pass rates). The Sonnet 4.6 prompt-path heavy-ts failure is agent-authored and counts.

GPT sweep (Codex CLI, June 10 — pre-fix generator)

ModelPathSpecTimeOutput tokensResult
GPT-5.3 Codex SparkMCPheavy-ts26s5,300build failed*
GPT-5.3 Codex SparkMCPlight-ts18s2,870pass
GPT-5.3 Codex SparkMCPmulti-ecosystem66s10,988pass
GPT-5.3 Codex SparkMCPpython-ai19s4,035pass
GPT-5.3 Codex SparkBF mentionheavy-ts52s10,742build failed*
GPT-5.3 Codex SparkBF mentionlight-ts11s1,221pass
GPT-5.3 Codex SparkBF mentionmulti-ecosystem147s18,676pass
GPT-5.3 Codex SparkBF mentionpython-ai53s8,936pass
GPT-5.3 Codex SparkPromptheavy-ts68s54,325build failed
GPT-5.3 Codex SparkPromptlight-ts51s27,446build failed
GPT-5.3 Codex SparkPromptmulti-ecosystem38s21,353pass
GPT-5.3 Codex SparkPromptpython-ai23s22,549pass
GPT-5.4MCPheavy-ts51s3,016build failed*
GPT-5.4MCPlight-ts44s2,153pass
GPT-5.4MCPmulti-ecosystem217s13,235pass
GPT-5.4MCPpython-ai55s2,284pass
GPT-5.4BF mentionheavy-ts322s13,834build failed*
GPT-5.4BF mentionlight-ts30s753pass
GPT-5.4BF mentionmulti-ecosystem144s7,870pass
GPT-5.4BF mentionpython-ai128s5,881pass
GPT-5.4Promptheavy-ts236s15,502build failed
GPT-5.4Promptlight-ts251s15,271pass
GPT-5.4Promptmulti-ecosystem170s11,795pass
GPT-5.4Promptpython-ai155s10,745pass
GPT-5.5MCPheavy-ts108s6,704build failed*
GPT-5.5MCPlight-ts58s2,544pass
GPT-5.5MCPmulti-ecosystem97s4,101build failed*
GPT-5.5MCPpython-ai43s2,003pass
GPT-5.5BF mentionheavy-ts120s6,851build failed*
GPT-5.5BF mentionlight-ts26s1,513pass
GPT-5.5BF mentionmulti-ecosystem84s5,480pass
GPT-5.5BF mentionpython-ai66s4,347pass
GPT-5.5Promptheavy-ts542s28,342pass
GPT-5.5Promptlight-ts163s10,601pass
GPT-5.5Promptmulti-ecosystem242s15,787pass
GPT-5.5Promptpython-ai110s7,869pass

* Failed on a since-fixed template generator bug on our side (excluded from agent pass rates) — this sweep ran before the fixes shipped. GPT prompt-path failures are agent-authored and count.

Light sweep (Gemini CLI / Kilo / opencode, June 12 — light-ts only)

ModelAgent CLIPathTimeOutput tokensResult
Gemini 3.1 ProGemini CLIMCP60.2s1,584pass
Gemini 3.1 ProGemini CLIBF mention26.7s863pass
Gemini 3.1 ProGemini CLIPrompt123.6s13,188pass**
Gemini 3.1 Pro (repeat)Gemini CLIMCP69.6s1,280pass
Gemini 3.1 Pro (repeat)Gemini CLIBF mention50.5s873pass
Gemini 3.1 Pro (repeat)Gemini CLIPrompt254.8s15,752pass
Laguna m.1KiloMCP224.8s1,560pass
Laguna m.1KiloBF mention592.5s3,653pass
Laguna m.1KiloPrompt900.0s†9,439pass
Nex N2-ProKiloMCP226.6s2,362pass
Nex N2-ProKiloBF mention146.2s1,294pass
Nex N2-ProKiloPrompt426.4s7,806fail (build)
Step-3.7 FlashKiloMCP45.5s1,985pass
Step-3.7 FlashKiloBF mention15.6s640pass
Step-3.7 FlashKiloPrompt166.0s12,738fail (install)
DeepSeek-V4 ProopencodeMCP42.9s1,369pass
DeepSeek-V4 ProopencodeBF mention40.0s1,391pass
DeepSeek-V4 ProopencodePrompt357.4s11,740pass
GLM-5.1opencodeMCP45.9s1,172pass
GLM-5.1opencodeBF mention67.4s2,197pass
GLM-5.1opencodePrompt693.9s29,527pass
Kimi K2.6opencodeMCP311.2s2,816pass
Kimi K2.6opencodeBF mention24.3s1,618pass
Kimi K2.6opencodePrompt371.9s13,527pass
MiniMax M3opencodeMCP68.7s2,473pass
MiniMax M3opencodeBF mention31.9s846pass
MiniMax M3opencodePrompt712.9s36,034fail (build)
Qwen3.7 MaxopencodeMCP80.2s2,669pass
Qwen3.7 MaxopencodeBF mention35.2s843pass
Qwen3.7 MaxopencodePrompt378.9s13,313pass

Notes:

  • Laguna m.1 prompt-only hit the 900-second budget, but the project written before the cutoff installs and builds.
  • Gemini 3.1 Pro prompt-only passed install + build; it failed the project's own Biome lint script, which the headline criterion doesn't count.

We also attempted Nemotron-3 Ultra 550B on Kilo's free tier, but its runs are excluded from the results: the endpoint was effectively unusable for agentic work in our harness (minutes per turn, zero or one tool execution per 15-minute budget on every path). That's an infrastructure failure — we couldn't meaningfully reach the model — not a benchmark result about the model.


Want to see the fast path yourself? Point your agent at the Better-Fullstack MCP server and ask it to scaffold something heavy.