# AstraNL Cross-Provider Quality Dataset v1

**License:** CC-BY-4.0
**Source:** AstraNL coordination protocol production traces, 2026-Q2
**Total graded runs:** 972
**Task classes:** classify_sentiment, detect_language, extract_emails
**Providers:** anthropic (Claude Haiku 4.5), openai (GPT-4o-mini), gemini (Flash), grok

## What this is

A vendor-neutral cross-provider quality benchmark derived from AstraNL's production
decomposer-brain pipeline. Every task is dispatched to multiple providers in
parallel; outputs are graded against ground truth (`correct: 1` for match, `0`
for mismatch). This dataset is the result.

Unlike synthetic eval benchmarks, every row is a real production run with
real cost, latency, and outcome. AstraNL operates as a vendor-neutral broker
(Dutch KvK 88449335, 1% coordination fee), so per-provider accuracy figures
reflect actual delivered quality, not vendor self-reporting.

## Honest eval-bug correction

A prior version of this dataset incorrectly reported anthropic 0% accuracy on
`extract_emails` (150 runs). Investigation 2026-05-15 revealed AstraNL's own
eval harness was at fault: it compared raw strings without stripping markdown
code fences. Claude returned `\`\`\`json\n["..."]\n\`\`\`` while expected
was the bare JSON array. Same content, different formatting.

Fix:
- `_normalize_output()` helper added to `decomposer_brain.py` (strips fences,
  canonicalizes JSON arrays).
- 150 historical runs flipped from `correct=0` to `correct=1` with marker
  `[reeval_2026_05_15: markdown-stripped match]`.
- Constitutional rule recorded: *Before attributing failure to external party,
  brain MUST inspect own measurement harness*.

## Per-class accuracy (post-correction)

```
{
  "detect_language": {
    "anthropic": {
      "sample_size": 113,
      "accuracy_pct": 100.0,
      "avg_latency_ms": 761
    },
    "gemini": {
      "sample_size": 113,
      "accuracy_pct": 100.0,
      "avg_latency_ms": 832
    },
    "openai": {
      "sample_size": 113,
      "accuracy_pct": 100.0,
      "avg_latency_ms": 832
    }
  },
  "classify_sentiment": {
    "anthropic": {
      "sample_size": 161,
      "accuracy_pct": 98.76,
      "avg_latency_ms": 850
    },
    "gemini": {
      "sample_size": 161,
      "accuracy_pct": 95.65,
      "avg_latency_ms": 1533
    },
    "openai": {
      "sample_size": 161,
      "accuracy_pct": 100.0,
      "avg_latency_ms": 1024
    }
  },
  "extract_emails": {
    "anthropic": {
      "sample_size": 150,
      "accuracy_pct": 100.0,
      "avg_latency_ms": 858
    }
  }
}
```

## Schema (JSON)

```
id                int       primary key in source DB (preserved)
parent_run_id     str       sha256(orig)[:16], one per parent task
task_class        str       one of: classify_sentiment, detect_language,
                            extract_emails, summarize_to_one_sentence, ...
subtask_idx       int       index within parent task
provider          str       anthropic | openai | gemini | grok
model_name        str       same as provider for canonical labelling
input_text        str       first 60 chars + sha256[:12] (anonymized)
output_text       str       provider output, first 500 chars
expected          str       ground truth, first 500 chars
correct           int       1 = match, 0 = mismatch (after normalization)
latency_ms        int       end-to-end provider call time
input_tokens      int       tokenization estimate
output_tokens     int       tokenization estimate
cost_eur          float     EUR cost per provider price table
error             str|null  error message if call failed
created_at        str       ISO 8601 UTC
```

## Citing

```
AstraNL Coordination Protocol. Cross-Provider Quality Dataset v1, 2026-Q2.
CC-BY-4.0. https://astranl.com/datasets/cross-provider-quality-v1/
```

## Limitations

- Sample sizes vary per (task_class, provider): see accuracy table for n.
- All runs are real production traces. No synthetic data, no filler.
- Anonymization: input text truncated to 60 chars + content hash. No PII.
- Cost figures reflect AstraNL's provider contracts, not list prices.
- AstraNL is vendor-neutral broker; this dataset contains no marketing claims
  about which provider is "best." Read the numbers.

## Issues

File a GitHub issue or email truth@astranl.com.
