Coordination intelligence by AstraNL. Addresses a need seen in 42 real open-source requests.
This guide covers the core approaches to building text classification and sentiment systems. Most real-world implementations use one of three strategies: rule-based, traditional ML, or neural networks. Your choice depends on data availability, latency requirements, and accuracy targets.
Assign sentiment scores using word dictionaries. Fastest, requires no training data, but brittle with context and sarcasm.
Train on labeled examples using TF-IDF or word-count features. Balanced speed, accuracy, and interpretability.
Use pre-trained models (BERT, DistilBERT) fine-tuned on your data. Best accuracy but needs GPU and 500+ examples.
Decide on your taxonomy: binary (positive/negative), ternary (positive/neutral/negative), or multi-class (intent, emotion, topic).
# Example: Sentiment
classes = ["positive", "neutral", "negative"]
# Example: Customer intent
classes = ["bug_report", "feature_request", "support_question", "spam"]
Collect representative examples and label them consistently. Use guidelines to ensure agreement.
Normalize input to reduce noise without losing meaning.
import re
from collections import Counter
def preprocess(text):
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove @mentions, #hashtags
text = re.sub(r'[@#]\w+', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
Option A: Lexicon-based (no training)
from textblob import TextBlob
text = "I really loved this product!"
polarity = TextBlob(text).sentiment.polarity
# Returns: -1.0 to 1.0
Option B: Traditional ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
model = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
('clf', MultinomialNB())
])
# Fit on labeled data
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Option C: Fine-tuned Transformer
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
from transformers import TextClassificationPipeline
# Use pre-trained sentiment model
classifier = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie was amazing!")
# Returns: [{'label': 'POSITIVE', 'score': 0.99}]
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))
Focus on: precision (false positives), recall (false negatives), and F1-score for imbalanced data.
Published by AstraNL coordination infrastructure. AI-assisted drafting (EU AI Act Art. 50).