← AstraNL

Cross-Provider LLM Quality Dataset

Production-traced benchmark from a real Dutch MCP aggregator · Q2 2026 snapshot

CLAUDE HAIKU 4.5 GPT-4o-mini GEMINI 2.5 FLASH CC-BY 4.0

813 ground-truth-verified runs. Real production traffic. No cherry-picking. Same input distribution across all three providers.

If you build agents or evaluate model selection, this dataset gives you what static benchmarks cannot: real cost, real latency, real accuracy from live dispatch traffic.

Download runs.csv (181 KB) Dataset card Live manifest

Headline accuracy results

Task classClaude Haiku 4.5GPT-4o-miniGemini 2.5 Flashn (per provider)
classify_sentiment98.8%100.0%95.7%161
detect_language100.0%100.0%100.0%113
extract_emails100.0%100.0%100.0%30–150
summarize_to_one_sentence100.0%95.0%20
Honest read: On well-defined extraction tasks, all three providers are essentially tied on accuracy. Differentiation lies in cost and latency, not output quality. Routing logic, not picking a single winner, is the right answer for production.

Cost-quality-latency triangle

ProviderMean accuracyMean latency (ms)Mean cost (μEUR)
Anthropic Claude Haiku 4.599.6%823 (fastest)7.04
OpenAI GPT-4o-mini100.0%104610.51
Google Gemini 2.5 Flash96.7%1806 (slowest)4.80 (cheapest)

Three independent winners. Build dispatch logic accordingly.

What this dataset is for

What it's not for

Methodology in one paragraph

Every row is a real call routed through the live MCP aggregator at /capabilities/dispatch. Inputs come from real third-party agents during April–May 2026. Ground truth for detect_language and extract_emails is deterministic. For classify_sentiment, ground truth is majority-vote across all three providers; runs with disagreement (correct=-1, 133 rows) are excluded from accuracy stats. Cost is per-call billing-API truth converted to EUR at daily ECB rate. Latency is server-side end-to-end including ~50 ms of AstraNL routing overhead.

How to cite

@dataset{astranl2026crossprovider,
  title  = {Cross-Provider Task-Class Quality Dataset (Q2 2026)},
  author = {AstraNL ZZP},
  year   = {2026},
  month  = {May},
  url    = {https://astranl.com/research/cross-provider-q2-2026/},
  license = {CC-BY-4.0}
}

Use it. Cite it. Improve it.

This dataset exists because we run the receipts. Pull requests adding more task classes, more providers, or better ground truth are welcome at research@astranl.com.

AstraNL ZZP · KvK 88449335 · BTW NL004604224B69 · Netherlands · GDPR-compliant production system · No PII included in this dataset — input texts are screened for personal data before publication