Production-traced benchmark from a real Dutch MCP aggregator · Q2 2026 snapshot
CLAUDE HAIKU 4.5 GPT-4o-mini GEMINI 2.5 FLASH CC-BY 4.0813 ground-truth-verified runs. Real production traffic. No cherry-picking. Same input distribution across all three providers.
If you build agents or evaluate model selection, this dataset gives you what static benchmarks cannot: real cost, real latency, real accuracy from live dispatch traffic.
| Task class | Claude Haiku 4.5 | GPT-4o-mini | Gemini 2.5 Flash | n (per provider) |
|---|---|---|---|---|
| classify_sentiment | 98.8% | 100.0% | 95.7% | 161 |
| detect_language | 100.0% | 100.0% | 100.0% | 113 |
| extract_emails | 100.0% | 100.0% | 100.0% | 30–150 |
| summarize_to_one_sentence | — | 100.0% | 95.0% | 20 |
| Provider | Mean accuracy | Mean latency (ms) | Mean cost (μEUR) |
|---|---|---|---|
| Anthropic Claude Haiku 4.5 | 99.6% | 823 (fastest) | 7.04 |
| OpenAI GPT-4o-mini | 100.0% | 1046 | 10.51 |
| Google Gemini 2.5 Flash | 96.7% | 1806 (slowest) | 4.80 (cheapest) |
Three independent winners. Build dispatch logic accordingly.
Every row is a real call routed through the live MCP aggregator at /capabilities/dispatch. Inputs come from real third-party agents during April–May 2026. Ground truth for detect_language and extract_emails is deterministic. For classify_sentiment, ground truth is majority-vote across all three providers; runs with disagreement (correct=-1, 133 rows) are excluded from accuracy stats. Cost is per-call billing-API truth converted to EUR at daily ECB rate. Latency is server-side end-to-end including ~50 ms of AstraNL routing overhead.
@dataset{astranl2026crossprovider,
title = {Cross-Provider Task-Class Quality Dataset (Q2 2026)},
author = {AstraNL ZZP},
year = {2026},
month = {May},
url = {https://astranl.com/research/cross-provider-q2-2026/},
license = {CC-BY-4.0}
}
This dataset exists because we run the receipts. Pull requests adding more task classes, more providers, or better ground truth are welcome at research@astranl.com.
AstraNL ZZP · KvK 88449335 · BTW NL004604224B69 · Netherlands · GDPR-compliant production system · No PII included in this dataset — input texts are screened for personal data before publication