Automatically transform all categorical, date-time, NLP variables in your data set to numeric in a single line of code for any data set any size.
- lazytransform is very easy to install on Kaggle and Colab notebooks using this command:.
!pip install lazytransform --ignore-installed --no-cache --no-deps
- lazytransform as of version 0.91 has two Super Learning Optimized (SULO) Ensembles named "SuloClassifier" and "SuloRegressor". The estimators are "super-optimized" in the sense that they perform automatic GridSearchCV so you can use them for all kinds of multi-label multi-class and Imbalanced data set problems with just default parameters to get great results in Kaggle competitions. Take a look at the amazing benchmarking results notebook here for SuloClassifier:
- What is lazytransform
- How to use lazytransform
- How to install lazytransform
- Usage
- Tips
- API
- Maintainers
- Contributing
- License
- All sklearn models
- All MultiOutput models from sklearn.multioutput library
- XGboost models
- LightGBM models
- lazytransform is built using pandas, numpy, scikit-learn, category_encoders and imbalanced-learn libraries. It should run on most Python3 Anaconda installations without additional installs. You won't have to import any special libraries other than "imbalanced-learn" and "category_encoders".
- First try it with base_estimator as None and all other params as either None or False
- Compare it against a competitor model such as XGBoost or RandomForest and see whether it beats them.
- If not, then set weights = True for Sulo models and then imbalanced=True and see whether that works.
- If a competitor is still beating Sulo, then use that model as base_estimator while leaving all other params above untouched.
- Finally change the n_estimators from default=None to 5.
- Finally increase n_estimators to 7 and then 10 and see. By now, Sulo should be beating all other models.
- The more you increase the number of estimators, the more performance boost you will get until at some point it drops off. Keep increasing until then.
-
model
: default is None. Or it could be any scikit-learn model (including multioutput models) as well as the popular XGBoost and LightGBM libraries. You need to install those libraries if you want to use them. -
encoders
: could be one more encoders in a string or a list. Each encoder string can be any one of the 10+ encoders fromcategory_encoders
library below. Available encoders are listed here as strings so that you can input them in lazytransform:auto
- It usesonehot
encoding for low-cardinality variables andlabel
encoding for high cardinality variables.onehot
- One Hot encoding - it will be used for all categorical features irrespective of cardinalitylabel
- Label Encoding - it will be used for all categorical features irrespective of cardinalityhashing
orhash
- Hashing (or Hash) Encoding - will be used for all categorical variableshelmert
- Helmert Encoding - will be used for all categorical variablesbdc
- BDC Encoding - will be used for all categorical variablessum
- Sum Encoding - will be used for all categorical variablesloo
- Leave one out Encoding - will be used for all categorical variablesbase
- Base encoding - will be used for all categorical variableswoe
- Weight of Evidence Encoding - will be used for all categorical variablesjames
- James Encoding - will be used for all categorical variablestarget
- Target Encoding - will be used for all categorical variablescount
- Count Encoding - will be used for all categorical variablesglm
,glmm
- Generalized Linear Model Encoding
-
Here is a description of various encoders and their uses from the excellent category_encoders python library:
HashingEncoder
: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.SumEncoder
: SumEncoder is a Sum contrast coding for the encoding of categorical features.PolynomialEncoder
: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.BackwardDifferenceEncoder
: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.OneHotEncoder
: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.HelmertEncoder
: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.OrdinalEncoder
: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.FrequencyEncoder
: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.BaseNEncoder
: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.TargetEncoder
: TargetEncoder performs Target encoding for categorical features. It supports following kinds of targets: binary and continuous. For multi-class targets it uses a PolynomialWrapper.CatBoostEncoder
: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.WOEEncoder
: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.JamesSteinEncoder
: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper. For feature value i, James-Stein estimator returns a weighted average of: The mean target value for the observed feature value i. The mean target value (regardless of the feature value).QuantileEncoder
: This is a very good encoder for Regression tasks. See Paper and article: https://towardsdatascience.com/quantile-encoder-eb33c272411d
-
scalers
: could be one of three main scalers used in scikit-learn models to transform numeric features. Default is None. Scalers are used in the last step of the pipeline to scale all features that have transformed. However, you might want to avoid scaling in NLP datasets since after TFiDF vectorization, scaling them may not make sense. But it is up to you. The 4 options are:None
No scaler. Great for almost all datasets. Test it first and then try one of the scalers below.std
standard scaler. Great for almost all datasets.minmax
minmax scaler. Great for datasets where you need to see the distribution between 0 and 1.robust
Robust scaler. Great for datasets where you have outliers.maxabs
max absolute scaler. Great for scaling but leaves the negative values as they are (negative).
-
date_to_string
: default is False. If you want to use date variables as strings (categorical), then set it as True.You can use this option when there are very few dates in your dataset. If you set it as False, it will convert it into date time format and extract up to 20 features from your date time column. This is the default option and best option. -
transform_target
: default is False. If you want to transform your target variable(s), then set it as True and we will transform your target(s) as numeric using Label Encoding as well as multi-label Binary classes. This is a great option when you have categorical target variables. -
imbalanced
: default is False. If you have an imbalanced dataset, then set it to True and we will transform your train data using BorderlineSMOTE or SMOTENC which are both great options. We will select the right SMOTE function automatically. -
combine_rare
: default is False. This is a great option if you have too many rare categories in your categorical variables. It will automatically combine those rare categories which are less than 1% of the dataset into one combined category called "rare_categories". You can also set it to False and we will not transform any of your categories in your categorical variables. -
verbose
: This has 3 possible states: -
0
silent output. Great for running this silently and getting fast results. -
1
more verbiage. Great for knowing how results were and making changes to flags in input. -
2
highly verbose output. Great for finding out what happens under the hood in lazytransform pipelines. - Category Encoders library: Fantastic library https://contrib.scikit-learn.org/category_encoders/index.html
- Imbalanced Learn library: Another fantastic library https://imbalanced-learn.org/stable/index.html
- The amazing `lazypredict` was an inspiration for `lazytransform`. You can check out the library here: https://github.com/shankarpandala/lazypredict
- The amazing `Kevin Markham` was another inspiration for lazytransform. You can check out his classes here: https://www.dataschool.io/about/
We ran a similar benchmarking result in SuloRegressor against XGBoost and LightGBM Regressors and it held its own against them. Take a look at the benchmarking result:
lazytransform
is a new python library for automatically transforming your entire dataset to numeric format using category encoders, NLP text vectorizers and pandas date time processing functions. All in a single line of code!
lazytransform
has two important uses in the Data Science process. It can be used in Feature Engg to transform features or add features (see API below). It can also be used in MLOps to train and evaluate models in data pipelines with multiple models being trained simultaneusly using the same train/test split and the same feature engg steps. This ensures that there is absolutely zero or minimal data leakage in your MLOps pipelines.
The first method is probably the most popular way to use lazytransform. The transformer within lazytransform can be used to transform and create new features from categorical, date-time and NLP (text) features in your dataset. This transformer pipeline is fully scikit-learn Pipeline compatible and can be used to build even more complex pipelines by you based on `make_pipeline` statement from `sklearn.pipeline` library. Let us see an example:
The second method is a great way to create an entire data transform and model training pipeline with absolutely no data leakage. `lazytransform` allows you to send in a model object (only the following are supported) and it will automatically transform, create new features and train a model using sklearn pipelines. This method can be seen as follows:
The third method is a great way to find the best data transformation and model training pipeline using GridSearchCV or RandomizedSearchCV along with a LightGBM or XGBoost or scikit-learn model. This is explained very clearly in the LazyTransformer_with_GridSearch_Pipeline.ipynb notebook in the same github here. Make sure you check it out!
The following models are currently supported:
Prerequsites:
conda install -c conda-forge lazytransform
The second best installation method is to use "pip install".
pip install lazytransform
Alert! When using Colab or Kaggle Notebooks, you must use a slightly modify installation process below. If you don't do this, you will get weird errors in those platforms!
pip install lazytransform --ignore-installed --no-deps
pip install category-encoders --ignore-installed --no-deps
To install from source:
cd <lazytransform_Destination>
git clone git@github.com:AutoViML/lazytransform.git
or download and unzip https://github.com/AutoViML/lazytransform/archive/master.zip
conda create -n <your_env_name> python=3.7 anaconda
conda activate <your_env_name> # ON WINDOWS: `source activate <your_env_name>`
cd lazytransform
pip install -r requirements.txt
You can invoke `lazytransform` as a scikit-learn compatible fit and transform or a fit and predict pipeline. See syntax below.
from lazytransform import LazyTransformer
lazy = LazyTransformer(model=None, encoders='auto', scalers=None,
date_to_string=False, transform_target=False, imbalanced=False,
combine_rare=False, verbose=0)
X_trainm, y_trainm = lazy.fit_transform(X_train, y_train)
X_testm = lazy.transform(X_test)
lazy = LazyTransformer(model=RandomForestClassifier(), encoders='auto', scalers=None,
date_to_string=False, transform_target=False, imbalanced=False,
combine_rare=False, verbose=0)
lazy.fit(X_train, y_train)
lazy.predict(X_test)
Tips for using SuloClassifier and SuloRegressor for High Performance:
lazytransform has a very simple API with the following inputs. You need to create a sklearn-compatible transformer pipeline object by importing LazyTransformer from lazytransform library.
Once you import it, you can define the object by giving several options such as:
Arguments
Caution: X_train and y_train must be pandas Dataframes or pandas Series. DO NOT send in numpy arrays. They won't work.
To view the text pipeline, the default display is 'text', do:
from sklearn import set_config
set_config(display="text")
lazy.xformer
To view the pipeline in a diagram (visual format), do:
from sklearn import set_config
set_config(display="diagram")
lazy.xformer
# If you have a model in the pipeline, do:
lazy.modelformer
To view the feature importances of the model in the pipeline, you can do:
lazy.plot_importance()
PRs accepted.
Apache License 2.0 © 2020 Ram Seshadri
This libray would not have been possible without the following great libraries:
This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.