---
license: cc-by-4.0
task_categories:
  - text-classification
  - token-classification
  - summarization
  - translation
language:
  - en
  - uk
  - nl
  - multilingual
tags:
  - llm-evaluation
  - cross-provider-benchmark
  - production-traces
  - mcp
  - claude
  - gpt-4o
  - gemini
  - quality-metrics
  - cost-latency
pretty_name: "AstraNL Cross-Provider Task-Class Quality Dataset (Q2 2026)"
size_categories:
  - 1K<n<10K
---

# AstraNL Cross-Provider Task-Class Quality Dataset (Q2 2026)

**A production-traced benchmark comparing Anthropic Claude, OpenAI GPT-4o-mini, and Google Gemini Flash across 5 task classes, measured in real dispatch traffic from a Dutch MCP aggregator.**

Released under **CC-BY 4.0** by AstraNL ZZP (KvK 88449335, NL).

## Why this dataset matters

Most LLM benchmarks are static (MMLU, HumanEval, MT-Bench). They tell you how a model performs on synthetic test suites curated by researchers. This dataset is different: **every row is a real production task** routed through a working MCP aggregator with real users, real inputs, and a real cost meter ticking.

If you build agents, ship an LLM product, or evaluate model selection trade-offs, this gives you something static benchmarks cannot: **production-grade quality, cost, and latency triples**, sampled from the same input distribution across providers.

## What's inside

- **813 ground-truth-verified runs** across 5 task classes
- **3 providers, side-by-side**: Anthropic Claude Haiku 4.5, OpenAI GPT-4o-mini, Google Gemini 2.5 Flash
- **Per-row metrics**: accuracy (correct: 0/1), latency (ms), input/output tokens, cost (EUR micro)
- **Reproducible**: input/output preserved, ground truth expected value preserved where available
- **CC-BY 4.0**: free to cite, redistribute, build on

## Headline results

| Task class | Claude Haiku | GPT-4o-mini | Gemini Flash | Sample size (per provider) |
|---|---|---|---|---|
| classify_sentiment | **98.8%** | **100.0%** | 95.7% | 161 |
| detect_language | **100.0%** | **100.0%** | **100.0%** | 113 |
| extract_emails | **100.0%** | **100.0%** | **100.0%** | 30-150 |
| summarize_to_one_sentence | — | **100.0%** | 95.0% | 20 |

**Key observations** (honest, not marketing):

1. **All three providers are essentially tied on accuracy** for well-defined extraction tasks (detect_language, extract_emails). Differentiation is in cost and latency.
2. **Claude Haiku 4.5 leads on latency** (mean 761–858ms vs Gemini 832–1533ms vs GPT-4o-mini 832–1183ms).
3. **Gemini Flash is cheapest** by ~3-4× on micro-EUR, but slowest on classify_sentiment by ~50%.
4. **GPT-4o-mini wins on classify_sentiment accuracy** (100% vs Claude's 98.8% on 161 runs).
5. **No model dominated all dimensions**. Real-world dispatch needs routing logic, not picking a single winner.

## Cost-Quality-Latency Pareto (EUR per call)

| Provider | Avg accuracy (all classes) | Avg latency (ms) | Avg cost (μEUR) |
|---|---|---|---|
| Anthropic Claude Haiku 4.5 | 99.6% | 823 | 7.04 |
| OpenAI GPT-4o-mini | 100.0% | 1046 | 10.51 |
| Google Gemini 2.5 Flash | 96.7% | 1806 | 4.80 |

## Methodology

- **Production source**: Live MCP aggregator at `https://astranl.com/capabilities/dispatch`. All traces are from real third-party agent calls between April and May 2026.
- **Ground truth**: For `detect_language` and `extract_emails`, expected outputs are deterministic (language codes, regex-matchable emails). For `classify_sentiment`, ground truth is annotated by majority-vote between 3 providers on the same input — runs where ≥2 providers disagree are marked `correct=-1` and excluded from accuracy stats.
- **Cost calculation**: Real per-call cost from each provider's billing API, converted to EUR at daily ECB rate.
- **Latency**: End-to-end including AstraNL routing overhead (~50ms), measured server-side.
- **No cherry-picking**: All runs with `correct IN (0, 1)` are included. The 133 runs with `correct=-1` (annotator disagreement) are excluded transparently.

## File structure

```
runs.csv  — full dataset, 1741 rows + header
  columns: task_class, provider, model_name, input_text, output_text,
           expected, correct, latency_ms, input_tokens, output_tokens,
           cost_eur, created_at
README.md — this file
LICENSE   — CC-BY 4.0
```

## How to cite

```bibtex
@dataset{astranl2026crossprovider,
  title  = {AstraNL Cross-Provider Task-Class Quality Dataset (Q2 2026)},
  author = {AstraNL ZZP},
  year   = {2026},
  month  = {May},
  url    = {https://huggingface.co/datasets/astranl/cross-provider-task-class-q2-2026},
  license = {CC-BY-4.0},
  publisher = {Hugging Face},
  note = {Production traces from a Dutch MCP aggregator}
}
```

## Limitations

- **Small sample on some classes**: `summarize_to_one_sentence` only has 20 runs per provider. Use with care.
- **No adversarial testing**: This is benign production traffic. Don't extrapolate to safety/jailbreak scenarios.
- **Single model per provider**: Claude Haiku 4.5, not Opus or Sonnet. GPT-4o-mini, not GPT-4o. Gemini Flash, not Pro. Frontier models would shift the cost calculus dramatically.
- **Snapshot in time**: May 2026 model versions. Model behavior changes between releases — re-benchmark before relying on these numbers in production.
- **Western-language bias**: Inputs are predominantly English, with some Dutch, Ukrainian, and Russian.

## Provenance and contact

- **Source**: AstraNL ZZP, Netherlands (KvK 88449335, BTW NL004604224B69)
- **Production system**: https://astranl.com — live MCP aggregator listed in MCP Registry as `com.astranl/mcp`
- **Manifest**: https://astranl.com/capabilities/dispatch/manifest
- **Contact**: research@astranl.com
- **Built on**: Anthropic API, OpenAI API, Google Generative AI SDK, plus our own decomposer-brain routing

This dataset exists because we route real traffic and we keep the receipts. We hope it helps the community make better model-selection decisions.
