TIL: Handling Imbalanced Datasets
Handling Imbalanced Datasets
Today I learned effective strategies for dealing with imbalanced datasets, where one class significantly outnumbers others. This is a common challenge in fraud detection, medical diagnosis, and anomaly detection tasks.
The Imbalanced Data Problem
Machine learning algorithms tend to favor the majority class when trained on imbalanced data, often resulting in models that:
- Achieve high accuracy but miss the minority class
- Lack practical utility for real-world applications
- Fail to identify the important but rare events we’re trying to detect
Effective Solutions
Data-level Approaches
Resampling Techniques
- Random Oversampling: Duplicating minority class instances
from imblearn.over_sampling import RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)
- SMOTE (Synthetic Minority Over-sampling Technique): Generating synthetic minority class examples
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
- Random Undersampling: Removing majority class instances
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)
- Hybrid Methods: Combining oversampling and undersampling
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=42)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
Algorithm-level Approaches
- Class Weighting: Penalizing misclassification of the minority class more heavily
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced', random_state=42)
# Or explicitly: class_weight={0: 1, 1: 10}
model.fit(X, y)
- Threshold Moving: Adjusting the classification threshold
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Instead of default 0.5 threshold
optimal_threshold = 0.3 # Determined via validation set
y_pred = (y_pred_proba >= optimal_threshold).astype(int)
- Ensemble Methods: Combining multiple models
from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(random_state=42)
model.fit(X, y)
Evaluation Metrics
With imbalanced data, accuracy is misleading. Better metrics include:
- Precision and Recall: Focus on minority class performance
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the Receiver Operating Characteristic curve
- AUC-PR: Area under the Precision-Recall curve (especially useful for imbalanced data)
Real-world Results
In my fraud detection project, combining SMOTE with class weighting improved the F1-score on the fraud class from 0.67 to 0.82, without significant loss in overall accuracy. The key was selecting the right combination of techniques through cross-validation rather than applying a single method.