Coordination intelligence by AstraNL. Addresses a need seen in 30 real open-source requests.
Data extraction and enrichment—pulling structured data from unstructured sources and adding context—is fundamental to data pipelines. This guide covers the core approach and practical implementation patterns.
Most extraction workflows follow this pattern:
import html.parser from 'some-parser-lib';
const parser = new HTMLParser();
const data = parser.extract(html, {
selectors: {
title: 'h1.main-title',
price: '.product-price',
description: '[data-field="desc"]'
}
});
// Validate extracted shape
if (!data.title || !data.price) {
throw new Error('Required fields missing');
}
Key principle: Use CSS selectors or XPath consistently. Validate every field immediately after extraction.
// Extract structured facts from free text
const patterns = {
email: /[\w\.-]+@[\w\.-]+\.\w+/g,
phone: /\+?1?\s?\(?\d{3}\)?[\s\-]?\d{3}[\s\-]?\d{4}/g,
date: /(\d{4}|20\d{2})-(\d{1,2})-(\d{1,2})/g
};
const extracted = {};
for (const [field, regex] of Object.entries(patterns)) {
extracted[field] = (text.match(regex) || []).map(normalize);
}
Key principle: Regex is fragile. Use regex for rough extraction, then validate and clean results.
async function enrichUserData(userId) {
// Extract from primary source
const user = await fetch(`/api/users/${userId}`).then(r => r.json());
// Validate required fields
if (!user.id || !user.email) {
throw new Error('User missing required fields');
}
// Enrich with secondary lookups
const [profile, permissions] = await Promise.all([
fetch(`/api/profiles/${user.id}`).then(r => r.json()),
fetch(`/api/perms/${user.id}`).then(r => r.json())
]);
// Merge and normalize
return {
...user,
profile: profile || {},
canEdit: permissions.includes('edit'),
createdAt: normalizeDate(user.createdAt)
};
}
Key principle: Fail fast on validation. Parallel lookups where possible. Always define enrichment fallbacks.
async function extractBatch(items) {
const results = [];
const errors = [];
for (const item of items) {
try {
const extracted = await extractOne(item);
const validated = validate(extracted);
const enriched = await enrich(validated);
results.push(enriched);
} catch (error) {
errors.push({
item: item.id,
reason: error.message,
severity: error.critical ? 'fatal' : 'warn'
});
// Decide: skip, retry, or halt based on error type
}
}
return { results, errors };
}
Key principle: Separate success from failure paths. Log all errors with context. Allow partial success.
| Pitfall | Solution |
|---|---|
| No validation after extraction | Define schema (JSON Schema, TypeScript types). Validate immediately. Reject invalid rows early. |
| Tight coupling to source format | Abstract extraction logic. Use adapter pattern. Test with fake/sample data independently. |
| Silent data loss (missing fields) | Log all unmatched fields. Compare input count to output count. Alert on drift. |
| Enrichment timeouts blocking batch | Set timeouts per lookup. Use circuit breaker. Enrich asynchronously or in background. |
| Inconsistent null/empty handling Published by AstraNL coordination infrastructure. AI-assisted drafting (EU AI Act Art. 50). |