Machine Learning · Finance

Credit Risk
Model Comparison

A comparative study of five classification algorithms on the German Credit Dataset — evaluating predictive power, calibration quality, and business-relevant risk metrics across 1,000 loan applicants.

Dataset German Credit (UCI)
Records 1,000
Features 20
Test Split 20%
CV Folds 5-Fold Stratified
🏆

Random Forest — Best Performer

Highest AUC-ROC across both hold-out test set and 5-fold cross-validation

0.703
AUC-ROC
0.378
KS Stat
0.697
CV AUC
65.0%
Accuracy
AUC-ROC by Model
Model Accuracy AUC-ROC F1 Score Precision Recall KS Stat CV AUC
Top Feature Importances (Random Forest)
Methodology

Data

German Credit Dataset — 1,000 applicants, 20 features covering checking account status, loan duration, credit history, purpose, savings, employment tenure, personal status, and more. Target: binary good/bad credit risk.

Preprocessing

Ordinal encoding for categorical variables. StandardScaler applied within pipeline for scale-sensitive models (LR, SVM, GBM, KNN). Class imbalance handled via class_weight='balanced' in RF.

Evaluation

80/20 stratified train-test split. 5-fold stratified cross-validation for AUC. KS statistic computed as max(TPR - FPR) on ROC curve — a key metric in credit scoring contexts.

Models

LR (C=0.1), RF (200 trees, max_depth=8), Gradient Boosting (200 estimators, lr=0.05), SVM (RBF kernel, calibrated), KNN (k=11). All wrapped in sklearn Pipelines.

Key Findings

🌲 Random Forest Wins

Best AUC (0.703) and KS statistic (0.378). Ensemble tree methods handle the non-linear interactions in credit features well without heavy tuning.

📈 Gradient Boosting Close Second

Nearly identical F1 (0.664 vs 0.664) but slightly lower AUC (0.697). With hyperparameter tuning, GBM could close the gap or surpass RF.

🔵 Logistic Regression as Baseline

Solid interpretable baseline with AUC 0.681 — competitive, easily explainable to business stakeholders, and GDPR-friendly for credit decisions.

⚠️ KNN Underperforms

KS of 0.155 is well below acceptable credit scoring threshold. Distance-based methods struggle with the mixed-type features in this dataset.