spark_pipelines

A collection of general tools for general binary classification tasks using spark.

Structure of this spark pipeline tool: lib/

categorical_handler.py : provides common methods to encode categorical columns.
feature_selection.py : provides 4 types of feature selection: hard_code_remover, chisquare test, model based selection..
imbalance_handler.py : contains 3 functions: random down sampler, smote oversampler, overall driver for any sampling desired.
data_explore.py : computes basic stats on num and cat columns.
util.py : contains utility functions for both spark and pandas df.
plot_metrics.py : plot modelling results in multiple setups, training/validation...
modelling.py : contains data science part, spark ml and sklearn toolkits.
logger.py : for logging purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
config		config
datasets		datasets
lib		lib
logs		logs
.DS_Store		.DS_Store
README.md		README.md
working_notebook.ipynb		working_notebook.ipynb

Provide feedback