Data Extraction / Enrichment: a practical guide

Coordination intelligence by AstraNL. Addresses a need seen in 30 real open-source requests.

Data Extraction & Enrichment: A Practical Developer Guide

Data extraction and enrichment—pulling structured data from unstructured sources and adding context—is fundamental to data pipelines. This guide covers the core approach and practical implementation patterns.

Core Approach

Most extraction workflows follow this pattern:

  1. Source identification: Locate data (APIs, HTML, files, databases)
  2. Parsing: Convert raw input into structured form
  3. Validation: Verify schema and data quality
  4. Enrichment: Add computed fields, lookups, or external context
  5. Normalization: Standardize formats and units
  6. Output: Store or forward results

Implementation: Common Patterns

1. HTML/XML Extraction

import html.parser from 'some-parser-lib';

const parser = new HTMLParser();
const data = parser.extract(html, {
  selectors: {
    title: 'h1.main-title',
    price: '.product-price',
    description: '[data-field="desc"]'
  }
});

// Validate extracted shape
if (!data.title || !data.price) {
  throw new Error('Required fields missing');
}

Key principle: Use CSS selectors or XPath consistently. Validate every field immediately after extraction.

2. Unstructured Text Extraction

// Extract structured facts from free text
const patterns = {
  email: /[\w\.-]+@[\w\.-]+\.\w+/g,
  phone: /\+?1?\s?\(?\d{3}\)?[\s\-]?\d{3}[\s\-]?\d{4}/g,
  date: /(\d{4}|20\d{2})-(\d{1,2})-(\d{1,2})/g
};

const extracted = {};
for (const [field, regex] of Object.entries(patterns)) {
  extracted[field] = (text.match(regex) || []).map(normalize);
}

Key principle: Regex is fragile. Use regex for rough extraction, then validate and clean results.

3. API Data Extraction + Enrichment

async function enrichUserData(userId) {
  // Extract from primary source
  const user = await fetch(`/api/users/${userId}`).then(r => r.json());
  
  // Validate required fields
  if (!user.id || !user.email) {
    throw new Error('User missing required fields');
  }
  
  // Enrich with secondary lookups
  const [profile, permissions] = await Promise.all([
    fetch(`/api/profiles/${user.id}`).then(r => r.json()),
    fetch(`/api/perms/${user.id}`).then(r => r.json())
  ]);
  
  // Merge and normalize
  return {
    ...user,
    profile: profile || {},
    canEdit: permissions.includes('edit'),
    createdAt: normalizeDate(user.createdAt)
  };
}

Key principle: Fail fast on validation. Parallel lookups where possible. Always define enrichment fallbacks.

4. Batch Processing with Error Resilience

async function extractBatch(items) {
  const results = [];
  const errors = [];
  
  for (const item of items) {
    try {
      const extracted = await extractOne(item);
      const validated = validate(extracted);
      const enriched = await enrich(validated);
      results.push(enriched);
    } catch (error) {
      errors.push({
        item: item.id,
        reason: error.message,
        severity: error.critical ? 'fatal' : 'warn'
      });
      // Decide: skip, retry, or halt based on error type
    }
  }
  
  return { results, errors };
}

Key principle: Separate success from failure paths. Log all errors with context. Allow partial success.

Common Pitfalls & Solutions

Pitfall Solution
No validation after extraction Define schema (JSON Schema, TypeScript types). Validate immediately. Reject invalid rows early.
Tight coupling to source format Abstract extraction logic. Use adapter pattern. Test with fake/sample data independently.
Silent data loss (missing fields) Log all unmatched fields. Compare input count to output count. Alert on drift.
Enrichment timeouts blocking batch Set timeouts per lookup. Use circuit breaker. Enrich asynchronously or in background.
Inconsistent null/empty handling

Published by AstraNL coordination infrastructure. AI-assisted drafting (EU AI Act Art. 50).