Set-Bench · agent benchmark

Cost, pass and latency, measured in the open.

Anthropic and Mistral across eight agent axes, n=10 per cell. The harness and every result JSON are public. Run it yourself, or pick it apart.

Read this first

The caveats come before the numbers.

We host both Anthropic and Mistral, and this benchmark favours the cheaper one we also sell — treat it with the appropriate suspicion. The axes run against deterministic mock tools, so this measures loop mechanics and cost, not real-world tool quality. Pass is regex-pinned: a floor, not a quality grade. There is no quality column, because a cross-vendor quality score can't be made bias-free with available judges (why). And n=10 means point estimates with wide intervals — “100%” is 10/10, not a law.

The eight axes: multi-turn loops · sub-agent orchestration · memory-grounded reasoning · workflow composition · long-context tool use · back-tracking tool-chains · cron cold-starts · real-world strategy.

Read the findings → Harness source Raw JSON

Matrix run 2026-05-30 · 8 axes × 9 models × n=10 (point estimates). Warm = with prompt-cache discount, cold = without. Deterministic mock tools — measures loop mechanics + cost, not real-world tool quality.

Multi-turn loop completion

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00009	$0.00044	94%	1.6s	2.3s
`mistral-ministral-8b-2512`	pinned	100%	$0.00013	$0.00066	94%	3.0s	3.7s
`mistral-ministral-14b-2512`	pinned	100%	$0.00022	$0.00087	88%	5.3s	5.7s
`mistral-large-2512`	pinned	100%	$0.00096	$0.00239	77%	12.0s	14.2s
`anthropic-haiku-4-5`	pinned	100%	$0.00793	$0.00793	0%	3.4s	4.5s
`mistral-medium-2604`	pinned	100%	$0.01281	$0.01686	30%	3.0s	3.3s
`anthropic-sonnet-4-6`	pinned	100%	$0.01370	$0.02468	82%	8.3s	12.1s
`anthropic-opus-4-7`	pinned	100%	$0.02486	$0.05377	90%	7.8s	12.0s
`mistral-large-latest`	latest	80%	$0.00107	$0.00191	57%	7.7s	18.6s

Sub-agent orchestration

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00007	$0.00029	91%	1.5s	1.7s
`mistral-ministral-8b-2512`	pinned	100%	$0.00011	$0.00043	89%	2.7s	3.8s
`mistral-ministral-14b-2512`	pinned	100%	$0.00016	$0.00057	87%	3.4s	4.7s
`mistral-large-2512`	pinned	100%	$0.00101	$0.00172	59%	7.7s	10.9s
`mistral-medium-2604`	pinned	100%	$0.00471	$0.00582	31%	1.6s	2.3s
`anthropic-haiku-4-5`	pinned	100%	$0.00588	$0.00588	0%	3.5s	6.2s
`anthropic-sonnet-4-6`	pinned	100%	$0.01014	$0.01824	83%	8.0s	10.8s
`anthropic-opus-4-7`	pinned	100%	$0.02519	$0.03994	77%	7.3s	8.2s
`mistral-large-latest`	latest	90%	$0.00090	$0.00163	64%	4.5s	19.1s

Memory-grounded reasoning

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00006	$0.00034	94%	0.6s	0.8s
`mistral-ministral-8b-2512`	pinned	100%	$0.00006	$0.00034	93%	0.5s	0.8s
`mistral-ministral-14b-2512`	pinned	100%	$0.00008	$0.00045	93%	0.5s	0.6s
`mistral-large-2512`	pinned	100%	$0.00048	$0.00118	68%	8.5s	9.8s
`mistral-medium-2604`	pinned	100%	$0.00299	$0.00367	22%	0.8s	1.0s
`anthropic-haiku-4-5`	pinned	100%	$0.00409	$0.00409	0%	1.9s	3.9s
`anthropic-sonnet-4-6`	pinned	100%	$0.00443	$0.01189	95%	3.6s	4.9s
`anthropic-opus-4-7`	pinned	100%	$0.01077	$0.02645	88%	3.4s	8.2s
`mistral-large-latest`	latest	70%	$0.00049	$0.00088	52%	1.5s	16.3s

Workflow composition

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00013	$0.00074	95%	1.5s	3.6s
`mistral-ministral-14b-2512`	pinned	100%	$0.00014	$0.00071	93%	1.6s	2.0s
`mistral-ministral-8b-2512`	pinned	100%	$0.00014	$0.00084	95%	1.8s	2.2s
`mistral-large-2512`	pinned	100%	$0.00065	$0.00145	70%	8.3s	10.5s
`mistral-medium-2604`	pinned	100%	$0.00359	$0.00475	34%	1.1s	1.2s
`anthropic-haiku-4-5`	pinned	100%	$0.00609	$0.00609	0%	3.1s	5.1s
`anthropic-sonnet-4-6`	pinned	100%	$0.00787	$0.02088	94%	6.7s	8.6s
`anthropic-opus-4-7`	pinned	100%	$0.02887	$0.04472	87%	7.0s	10.9s
`mistral-large-latest`	latest	80%	$0.00043	$0.00124	84%	2.9s	17.1s

Long-context with tools

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00013	$0.00066	89%	0.5s	0.7s
`mistral-ministral-8b-2512`	pinned	100%	$0.00020	$0.00100	89%	0.6s	0.7s
`mistral-ministral-14b-2512`	pinned	100%	$0.00027	$0.00133	89%	1.0s	1.7s
`mistral-large-latest`	latest	100%	$0.00219	$0.00337	40%	1.3s	1.7s
`mistral-medium-2604`	pinned	100%	$0.00324	$0.01033	80%	0.7s	2.8s
`anthropic-haiku-4-5`	pinned	100%	$0.00444	$0.01713	96%	2.6s	4.9s
`anthropic-sonnet-4-6`	pinned	100%	$0.01290	$0.05132	98%	5.0s	9.5s
`anthropic-opus-4-7`	pinned	100%	$0.01725	$0.05638	99%	2.5s	19.6s
`mistral-large-2512`	pinned	80%	$0.00092	$0.00269	75%	1.6s	16.8s

Tool-chain with back-track

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00008	$0.00047	95%	0.9s	1.3s
`mistral-ministral-8b-2512`	pinned	100%	$0.00011	$0.00071	95%	1.1s	2.0s
`mistral-ministral-14b-2512`	pinned	100%	$0.00015	$0.00095	95%	1.2s	1.8s
`mistral-large-2512`	pinned	100%	$0.00060	$0.00127	63%	7.7s	9.0s
`mistral-medium-2604`	pinned	100%	$0.00640	$0.00760	19%	1.4s	1.5s
`anthropic-sonnet-4-6`	pinned	100%	$0.00725	$0.02536	97%	7.9s	9.3s
`anthropic-haiku-4-5`	pinned	100%	$0.00825	$0.00825	0%	3.7s	5.1s
`anthropic-opus-4-7`	pinned	100%	$0.01611	$0.05407	94%	6.8s	9.0s
`mistral-large-latest`	latest	80%	$0.00036	$0.00108	80%	2.3s	17.2s

Cron task / cold-start

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00004	$0.00023	91%	0.4s	0.7s
`mistral-ministral-8b-2512`	pinned	100%	$0.00007	$0.00035	91%	0.5s	0.6s
`mistral-ministral-14b-2512`	pinned	100%	$0.00013	$0.00046	81%	0.7s	1.3s
`mistral-large-2512`	pinned	100%	$0.00053	$0.00118	63%	8.4s	8.6s
`mistral-medium-2604`	pinned	100%	$0.00271	$0.00368	31%	0.7s	0.8s
`anthropic-sonnet-4-6`	pinned	100%	$0.00381	$0.01163	96%	3.6s	4.4s
`anthropic-haiku-4-5`	pinned	100%	$0.00417	$0.00417	0%	2.2s	3.9s
`anthropic-opus-4-7`	pinned	100%	$0.00956	$0.02526	89%	3.4s	5.7s
`mistral-large-latest`	latest	70%	$0.00024	$0.00089	85%	1.4s	16.4s

Real-world grounded strategy

Model	Tag	Pass	Cost (warm)	Cost (cold)	Cache hit	p50	p95
`mistral-ministral-3b-2512`	pinned	100%	$0.00006	$0.00026	92%	1.0s	1.8s
`mistral-ministral-8b-2512`	pinned	100%	$0.00010	$0.00040	92%	2.0s	2.8s
`mistral-ministral-14b-2512`	pinned	100%	$0.00013	$0.00053	92%	2.4s	2.8s
`mistral-large-2512`	pinned	100%	$0.00081	$0.00148	62%	7.1s	11.4s
`mistral-medium-2604`	pinned	100%	$0.00353	$0.00477	37%	1.3s	1.5s
`anthropic-haiku-4-5`	pinned	100%	$0.00490	$0.00490	0%	3.7s	6.5s
`anthropic-sonnet-4-6`	pinned	100%	$0.00832	$0.01615	93%	9.5s	11.0s
`anthropic-opus-4-7`	pinned	100%	$0.01974	$0.03545	88%	8.6s	9.8s
`mistral-large-latest`	latest	90%	$0.00077	$0.00140	60%	4.4s	19.7s

The conflict of interest

A benchmark is only worth what its author is willing to lose by publishing it.

The cheaper model wins most of these cells, and we sell it. That is exactly why the harness, the raw JSON and the losing runs are all public: so you don't have to take our word for any of it. The full write-up — conflict of interest and all — is in the findings.

Notes on the numbers

Cost is per task. Warm bills cache-read tokens at the published cache-read rate; cold applies no cache discount. Pinned snapshots (e.g. mistral-large-2512) are recommended over -latest tags, which Mistral rolls silently. Graded answer-quality is measured in the harness but not published here: a cross-vendor quality score can't be made bias-free with available judges; see the findings for that discussion.

Set-Bench is part of lynox, a source-available professional agent (Elastic License v2). The model-routing it informs is why we built it. Source on GitHub.