Cross-Provider LLM Quality Dataset

Production-traced benchmark from a real Dutch MCP aggregator · Q2 2026 snapshot

CLAUDE HAIKU 4.5 GPT-4o-mini GEMINI 2.5 FLASH CC-BY 4.0

813 ground-truth-verified runs. Real production traffic. No cherry-picking. Same input distribution across all three providers.

If you build agents or evaluate model selection, this dataset gives you what static benchmarks cannot: real cost, real latency, real accuracy from live dispatch traffic.

Download runs.csv (181 KB) Dataset card Live manifest

Headline accuracy results

Task class	Claude Haiku 4.5	GPT-4o-mini	Gemini 2.5 Flash	n (per provider)
classify_sentiment	98.8%	100.0%	95.7%	161
detect_language	100.0%	100.0%	100.0%	113
extract_emails	100.0%	100.0%	100.0%	30–150
summarize_to_one_sentence	—	100.0%	95.0%	20

Honest read: On well-defined extraction tasks, all three providers are essentially tied on accuracy. Differentiation lies in cost and latency, not output quality. Routing logic, not picking a single winner, is the right answer for production.

Cost-quality-latency triangle

Provider	Mean accuracy	Mean latency (ms)	Mean cost (μEUR)
Anthropic Claude Haiku 4.5	99.6%	823 (fastest)	7.04
OpenAI GPT-4o-mini	100.0%	1046	10.51
Google Gemini 2.5 Flash	96.7%	1806 (slowest)	4.80 (cheapest)

Three independent winners. Build dispatch logic accordingly.

What this dataset is for

Model selection decisions for production systems
Routing logic in agents and aggregators
Comparison baseline for new model releases
Academic research on multi-provider arbitrage and quality trade-offs

What it's not for

Safety / jailbreak evaluation (benign traffic only)
Frontier-model comparisons (Haiku/Mini/Flash tier, not Opus/4o/Pro)
Long-context reasoning (single-turn extraction-style tasks)
Adversarial robustness

Methodology in one paragraph

Every row is a real call routed through the live MCP aggregator at /capabilities/dispatch. Inputs come from real third-party agents during April–May 2026. Ground truth for detect_language and extract_emails is deterministic. For classify_sentiment, ground truth is majority-vote across all three providers; runs with disagreement (correct=-1, 133 rows) are excluded from accuracy stats. Cost is per-call billing-API truth converted to EUR at daily ECB rate. Latency is server-side end-to-end including ~50 ms of AstraNL routing overhead.

How to cite

@dataset{astranl2026crossprovider,
  title  = {Cross-Provider Task-Class Quality Dataset (Q2 2026)},
  author = {AstraNL ZZP},
  year   = {2026},
  month  = {May},
  url    = {https://astranl.com/research/cross-provider-q2-2026/},
  license = {CC-BY-4.0}
}

Use it. Cite it. Improve it.

This dataset exists because we run the receipts. Pull requests adding more task classes, more providers, or better ground truth are welcome at research@astranl.com.

AstraNL ZZP · KvK 88449335 · BTW NL004604224B69 · Netherlands · GDPR-compliant production system · No PII included in this dataset — input texts are screened for personal data before publication