Utilizing Generative Models to Address Imbalanced Data Classification in the Context of Credit Card Fraud Detection
Developed as part of my dissertation submitted to the University of Manchester for the degree of “M.Sc. Business Analytics: Operations research and Risk Analysis” in the Faculty of Humanities"
This study explored the effectiveness of data augmentation using generative models to address class imbalance in credit card datasets.
The two generative models tested are Generative Adversarial Network (GAN) and Variational Autoencoder (VAE), which are compared with traditional oversampling techniques, Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN).
- Editor Used: Visual Studio
- Python Version: Python 3.10.12
- General Purpose:
copy, collections
- Data Manipulation:
pandas, numpy
- Data Visualization:
seaborn, matplotlib
- Machine Learning:
scikit-learn, tensorflow, keras
- Sampling:
imblearn
visualisations.ipynb
: contains initial data exploration, including statistical summary table, correlation matrix, distribution graphs and boxplots.CV.py
: helper functions for implementing cross validation, and printing results.GAN.py
: GAN functions for training the model, generating synthetic samples, and concatenating with training data.VAE.py
: VAE functions for training the model, generating synthetic samples, and concatenating with training data.classifiers.ipynb
: training and evaluation of LR, RF, KNN, XGB, with original distribution of data, SMOTE, ADASYN, VAE, and GAN
The dataset used is sourced from Machine Learning Group - ULB and contains 284,807 credit card transactions made by European cardholders across two days in September 2013.