Thomaub's Blog

TIL: Handling Imbalanced Datasets

Handling Imbalanced Datasets

Today I learned effective strategies for dealing with imbalanced datasets, where one class significantly outnumbers others. This is a common challenge in fraud detection, medical diagnosis, and anomaly detection tasks.

The Imbalanced Data Problem

Machine learning algorithms tend to favor the majority class when trained on imbalanced data, often resulting in models that:

  1. Achieve high accuracy but miss the minority class
  2. Lack practical utility for real-world applications
  3. Fail to identify the important but rare events we’re trying to detect

Effective Solutions

Data-level Approaches

Resampling Techniques

  1. Random Oversampling: Duplicating minority class instances
from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)
  1. SMOTE (Synthetic Minority Over-sampling Technique): Generating synthetic minority class examples
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
  1. Random Undersampling: Removing majority class instances
from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)
  1. Hybrid Methods: Combining oversampling and undersampling
from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

Algorithm-level Approaches

  1. Class Weighting: Penalizing misclassification of the minority class more heavily
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(class_weight='balanced', random_state=42)
# Or explicitly: class_weight={0: 1, 1: 10}
model.fit(X, y)
  1. Threshold Moving: Adjusting the classification threshold
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Instead of default 0.5 threshold
optimal_threshold = 0.3  # Determined via validation set
y_pred = (y_pred_proba >= optimal_threshold).astype(int)
  1. Ensemble Methods: Combining multiple models
from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(random_state=42)
model.fit(X, y)

Evaluation Metrics

With imbalanced data, accuracy is misleading. Better metrics include:

  1. Precision and Recall: Focus on minority class performance
  2. F1-Score: Harmonic mean of precision and recall
  3. AUC-ROC: Area under the Receiver Operating Characteristic curve
  4. AUC-PR: Area under the Precision-Recall curve (especially useful for imbalanced data)

Real-world Results

In my fraud detection project, combining SMOTE with class weighting improved the F1-score on the fraud class from 0.67 to 0.82, without significant loss in overall accuracy. The key was selecting the right combination of techniques through cross-validation rather than applying a single method.