Classification / Sentiment: a practical guide

Coordination intelligence by AstraNL. Addresses a need seen in 42 real open-source requests.

Text Classification & Sentiment Analysis: A Practical Developer Guide

This guide covers the core approaches to building text classification and sentiment systems. Most real-world implementations use one of three strategies: rule-based, traditional ML, or neural networks. Your choice depends on data availability, latency requirements, and accuracy targets.

Core Approaches

1. Rule-Based (Lexicon)

Assign sentiment scores using word dictionaries. Fastest, requires no training data, but brittle with context and sarcasm.

2. Traditional ML (Naive Bayes, SVM, Logistic Regression)

Train on labeled examples using TF-IDF or word-count features. Balanced speed, accuracy, and interpretability.

3. Neural Networks (Fine-tuned Transformers)

Use pre-trained models (BERT, DistilBERT) fine-tuned on your data. Best accuracy but needs GPU and 500+ examples.

Implementation Steps

Step 1: Define Classes Clearly

Decide on your taxonomy: binary (positive/negative), ternary (positive/neutral/negative), or multi-class (intent, emotion, topic).

# Example: Sentiment
classes = ["positive", "neutral", "negative"]

# Example: Customer intent
classes = ["bug_report", "feature_request", "support_question", "spam"]

Step 2: Gather and Label Data

Collect representative examples and label them consistently. Use guidelines to ensure agreement.

Step 3: Preprocess Text

Normalize input to reduce noise without losing meaning.

import re
from collections import Counter

def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove @mentions, #hashtags
    text = re.sub(r'[@#]\w+', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

Step 4: Choose and Train Model

Option A: Lexicon-based (no training)

from textblob import TextBlob

text = "I really loved this product!"
polarity = TextBlob(text).sentiment.polarity
# Returns: -1.0 to 1.0

Option B: Traditional ML

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
    ('clf', MultinomialNB())
])

# Fit on labeled data
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Option C: Fine-tuned Transformer

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
from transformers import TextClassificationPipeline

# Use pre-trained sentiment model
classifier = pipeline("sentiment-analysis", 
    model="distilbert-base-uncased-finetuned-sst-2-english")

result = classifier("This movie was amazing!")
# Returns: [{'label': 'POSITIVE', 'score': 0.99}]

Step 5: Evaluate

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

Focus on: precision (false positives), recall (false negatives), and F1-score for imbalanced data.

Step 6: Handle Edge Cases