A collection of general tools for general binary classification tasks using spark.
Structure of this spark pipeline tool: lib/
- categorical_handler.py : provides common methods to encode categorical columns.
- feature_selection.py : provides 4 types of feature selection: hard_code_remover, chisquare test, model based selection..
- imbalance_handler.py : contains 3 functions: random down sampler, smote oversampler, overall driver for any sampling desired.
- data_explore.py : computes basic stats on num and cat columns.
- util.py : contains utility functions for both spark and pandas df.
- plot_metrics.py : plot modelling results in multiple setups, training/validation...
- modelling.py : contains data science part, spark ml and sklearn toolkits.
- logger.py : for logging purposes.