Bench essay · 2026-05-26

How I learned to stop worrying
and route everything through Mistral.

$20 unlocked Tier 2. $1.84 paid for a month of benchmarks. Five surprising findings from the Set-Bench matrix.

I burned $20 on Mistral to unlock Tier 2 rate limits.

That's not the story. The story is that running every other benchmark — the full Set-Bench matrix, 8 axes × multiple model snapshots × n=10, plus a month of ad-hoc cell-runs across the entire Mistral roster — cost $1.84 cumulative. The $20 was a one-time threshold-crossing fee to lift the 1 RPS Tier-1 cap. The actual benching was rounding error.

I'm publishing the numbers because they were surprising to me, and because the "agents are expensive" narrative deserves a pushback that doesn't depend on you trusting me. The raw JSON for every cell is in the lynox repo — pull it, re-aggregate it, run the matrix yourself. The harness is npx tsx scripts/set-bench/run.ts.

What follows are five findings I didn't expect, each with one paragraph of context. If you came for headline tables, sorry: this is the essay version. Tables are in the JSON.


Finding 1 — A 3B model orchestrates as well as Haiku, for an order of magnitude less money.

ministral-3b-2512 (pinned) on the sub-agent orchestration axis (parent fans out, 3 children search, parent merges, 100% pass): $0.00007 warm per task. Haiku 4.5 on the same scenario from the May-16 matrix: ~$0.001. Same scenarios, same pass criteria, n=10. Roughly fourteen times cheaper.

I expected the small Mistral models to be useful as routers — cheap classifier on the front, big model behind. I did not expect them to be Haiku-level at running an agent loop: deciding which sub-agent to spawn, merging their outputs, deciding when to stop. They are. Every haiku-tier axis I tested has a 100%-pass ministral-3b or ministral-8b cell that's at least an order of magnitude cheaper than the Anthropic baseline. This breaks the "small models are for trivial tasks" intuition I came in with.

Finding 2 — mistral-large-2512 matches Sonnet on tool-chains at ~80% lower cost. But only if you pin the snapshot.

mistral-large-2512 (pinned) on tool-chain-with-backtrack: 100% pass, $0.00035 warm. Sonnet 4.6 on the same: ~$0.002, 100% pass. Same task, ~5× cheaper.

Here's the trap I walked into: mistral-large-latest on the same axis scored 80% pass at a similar cost. The pinned snapshot is better than latest. I ran it three times to make sure. This pattern repeats across six of the eight axes — pinned beats latest by 10 to 30 percentage points. Mistral ships fixes via -latest first and stabilizes them into the next pin, which means the floating tag is, counterintuitively, the less stable one for production. Pin your model versions. (I now do.)

Finding 3 — Every Anthropic tier has a 100%-pass Mistral replacement.

I went into this expecting Mistral to lose on at least one axis — long-context reasoning, multi-step planning, code review, something. It didn't. Across eight production-shaped axes (multi-turn loops, sub-agent orchestration, memory-grounded reasoning, workflow composition, long-context with tools, tool-chain with back-track, cron cold-start, real-world strategy), there is a Mistral cell on the Pareto frontier at 100% pass on every single one.

That doesn't mean Mistral is better than Anthropic. It means the "Anthropic-quality, Mistral-cost" trade is structural now, not aspirational. For a self-hosted EU-resident agent runtime (which is what lynox ships), that's the whole game.

Finding 4 — magistral is opus-tier on quality and disqualifying on latency.

magistral-medium-2509 ties Opus 4.7 at the multi-step reasoning tier (100% pass, ~5.9× cheaper). Beautiful number. Then I looked at the wall-clock: p50 16–30s across most tasks, but on workflow composition specifically p50 climbs to ~48s and p95 to ~66s.

mistral-large-2512 on the same axes: p50 7–12s. Magistral pays a ~6× latency tax for the reasoning chain — those are reasoning tokens the model is forced to emit, and you wait for them whether they help or not. For batch and analysis paths where the user accepts higher latency, magistral is the right pick. For a chat-UX agent firing N sequential tool calls, it's a non-starter. Cost-per-task and latency-per-task are not the same axis.

Finding 5 — Mistral shipped native prompt caching mid-bench. Cache-hit jumped from 0% to 50–94% the same week we wired it.

When I started this matrix in early May, Mistral cache-hit rates in our adapter were stuck at 0%. Anthropic was reporting 79–96% on the same prompt shape. I had a Slack-shaped dread about that being "the structural Mistral disadvantage" the cost-numbers would never overcome.

Mid-May, Mistral shipped prompt_cache_key on the chat-completion endpoint — 90% discount on cached input tokens. We wired it up PR #565 the same week, re-ran the matrix on 2026-05-24, and cache-hit jumped to 50–94% across axes (96% peak on workflow composition with ministral-3b). The cost numbers above are after that fix. Before it, Finding 2 would have read "Sonnet costs 2× less than Mistral after caching." Provider features move fast enough that any bench more than two months old is fiction.


What this means if you're running an agent in production

The cost narrative has flipped. For most agent-shaped workloads — sub-agents, tool-chains, workflows, cron tasks — Mistral with pinned snapshots and prompt_cache_key set is structurally cheaper than Anthropic at the same pass rate. Anthropic still wins on prompt-caching maturity (cache_control markers, deeper docs) and on the latency tail on reasoning-heavy paths. Both providers have a place. The interesting layer is the routing decision, not the provider choice.

For lynox specifically: this is why we ship Mistral as a first-class native provider, not a fallback. EU residency + cost floor + Anthropic-level pass rates is a real combination, not a marketing line. The matrix is in the open-source repo. If a Mistral snapshot regresses, you'll see it in the next run before we do.

Caveats I owe you

  • Anthropic baselines in this essay are from the May-16 matrix run on main@390a625. The May-24 re-run (which I'm publishing the JSON for) is Mistral-only, because that was the surface I was hardening for HN. Anthropic numbers are stable enough cross-run that I'm comfortable citing them — but you'll see "Mistral-only" if you open the latest JSON, that's why.
  • magistral-small-2509 (pinned) is 0% pass on KG-extraction and memory-extraction. -latest is 100% on both. I have no satisfying explanation. Pin-with-care.
  • codestral is 0% pass on code-review. The "code" in the name is misleading for bug-finding workloads.
  • Tokenizer ratio is ~6 chars/token for Mistral on English technical prose, not the 4 you've internalized from GPT/Claude. If you're sizing prompts with OpenAI math, you're under-using Mistral's context window by a third.

Raw data

If you find a number I got wrong, file an issue. If you re-run with different scenarios, send me the JSON.

If this is your kind of nerdy, lynox is here — the agent runtime these benchmarks describe.