TIL: Handling Imbalanced Datasets

Handling Imbalanced Datasets

Today I learned effective strategies for dealing with imbalanced datasets, where one class significantly outnumbers others. This is a common challenge in fraud detection, medical diagnosis, and anomaly detection tasks.

The Imbalanced Data Problem

Machine learning algorithms tend to favor the majority class when trained on imbalanced data, often resulting in models that:

Achieve high accuracy but miss the minority class
Lack practical utility for real-world applications
Fail to identify the important but rare events we’re trying to detect

Effective Solutions

Data-level Approaches

Resampling Techniques

Random Oversampling: Duplicating minority class instances

from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

SMOTE (Synthetic Minority Over-sampling Technique): Generating synthetic minority class examples

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

Random Undersampling: Removing majority class instances

from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)

Hybrid Methods: Combining oversampling and undersampling

from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

Algorithm-level Approaches

Class Weighting: Penalizing misclassification of the minority class more heavily

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(class_weight='balanced', random_state=42)
# Or explicitly: class_weight={0: 1, 1: 10}
model.fit(X, y)

Threshold Moving: Adjusting the classification threshold

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Instead of default 0.5 threshold
optimal_threshold = 0.3  # Determined via validation set
y_pred = (y_pred_proba >= optimal_threshold).astype(int)

Ensemble Methods: Combining multiple models

from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(random_state=42)
model.fit(X, y)

Evaluation Metrics

With imbalanced data, accuracy is misleading. Better metrics include:

Precision and Recall: Focus on minority class performance
F1-Score: Harmonic mean of precision and recall
AUC-ROC: Area under the Receiver Operating Characteristic curve
AUC-PR: Area under the Precision-Recall curve (especially useful for imbalanced data)

Real-world Results

In my fraud detection project, combining SMOTE with class weighting improved the F1-score on the fraud class from 0.67 to 0.82, without significant loss in overall accuracy. The key was selecting the right combination of techniques through cross-validation rather than applying a single method.