Skip to content

YasminFathy/HandleImbalancedDatasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Handle Imbalanced Datasets

Accompanying code for the paper:

Y. Fathy, M. Jaber and A. Brintrup, "Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis," in IEEE Access, vol. 9, pp. 2734-2757, 2021, doi: 10.1109/ACCESS.2020.3047838.

  • 1_DataExploration.ipynb: is to explore Scania Dataset, impute the missing values and apply dimensionality reduction

Case I: Predictoin on original data

  • 2_PredictiveModels_NoPCA.ipynb: performance of classifiers (logistic regression, XGBoost, Random Forest) on original Scania data (without augmenting synthetic faulty data samples). This includes imputing the missing values and no dimensionality reduction.

  • 2_PredictiveModels_PCA.ipynb (reproduce results of Table 2 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) on original Scania data (without augmenting synthetic faulty data samples). This includes imputing the missing values and dimensionality reduction on original using PCA (PCs = 11).


Case II: Predictoin with augmenting artificial faulty data samples using SMOTE

  • 3_PredictiveModels_NoPCA-SMOTE.ipynb:: performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using SMOTE (with different Imbalance Ratio) to original data (this includes data imputation of original data to fill missing values).

  • 3_PredictiveModels_PCA-SMOTE.ipynb (reproduce results of Table 3 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using SMOTE (with different Imbalance Ratio) to original data (this includes data imputation of original data to fill missing values and dimensionality reduction on original using PCA (PCs = 11)).


Case IV: Predictoin with augmenting artificial faulty data samples using GAN/CGAN

  • 4_GAN_CGAN_Scania_PCA.ipynb: to generate data using GAN/CAGN. This includes imputing the missing values and dimensionality reduction on original using PCA (PCs = 11).

  • 5_Optimise_PredictiveModels_PCA-GAN.ipynb(reproduce results of Table 4 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using GAN (with different Imbalance Ratio using 4_GAN_CGAN_Scania_PCA.ipynb) to original data (this includes data imputation of original data to fill missing values and dimensionality reduction on original using PCA (PCs = 11)).

  • 5_Optimise_PredictiveModels_PCA-CGAN.ipynb(reproduce results of Table 6 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using CGAN (with different Imbalance Ratio using 4_GAN_CGAN_Scania_PCA.ipynb) to original data (this includes data imputation of original data to fill missing values and dimensionality reduction on original using PCA (PCs = 11)).


Case III: Predictoin with augmenting artificial faulty data samples using WGAN/WCGAN

  • 4_WGAN_CWGAN_Scania_PCA.ipynb: to generate data using WGAN/WCAGN . This includes imputing the missing values and dimensionality reduction on original using PCA (PCs = 11).

  • 5_Optimise_PredictiveModels_PCA-WGAN.ipynbreproduce results of Table 5 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using WGAN (with different Imbalance Ratio using 4_WGAN_CWGAN_Scania_PCA.ipynb) to original data (this includes data imputation of original data to fill missing values and dimensionality reduction on original using PCA (PCs = 11)).

  • 5_Optimise_PredictiveModels_PCA-CWGAN.ipynb (reproduce results of Table 7 in the paper): performance of classifiers (logistic regression, XGBoost, Random Forest) with augmenting artificial faulty data samples obtained using CWGAN (with different Imbalance Ratio using 4_WGAN_CWGAN_Scania_PCA.ipynb) to original data (this includes data imputation of original data to fill missing values and dimensionality reduction on original using PCA (PCs = 11)).


Note: this repo provides some depth behind the paper and I am not actively maintaining it.

About

A placeholder repo for handling imbalanced dataset studies

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published