The process of transforming raw data into a form that is more appropriate for modeling.
- Removal of column varibales having only a single value.
- Consideration of column varibale with very few unique values.
- Rows that contains duplicate value.
- Performing mean or median imputation (replacing with mean if normally distributed otherwise median, may distort the distribution if high percentage of missing data is present)
- Implementing mode or frequent category imputation ( If the percentage of missing values is high may distort the original distribution of categories)
- Replacing missing values with an arbitrary number(when data is not missing at random, while building non-linear models, may distort the distribution if high percentage of missing data is present)
- Capturing missing values in a bespoke category.
- Replacing missing values with a value at the end of the distribution (End-of-tail imputation may distort the distribution of the original variables may not be suitable for linear models)
- Implementing random sample imputation(Preserves the original distribution)
- Adding a missing value indicator variable.
- Performing multivariate imputation by chained equations.
- Assembling an imputation pipeline with scikit-learn.
- Assembling an imputation pipeline with Feature-engine
- Creating binary variables through one-hot encoding (A categorical variable with k unique categories can be encoded in k-1 binary variables, When determining the importance of each category within a variable encode in k binary varibles, when training decision trees)
- Performing one-hot encoding of frequent categories, In addition to being slightly less redundant, a dummy variable representation is requiredfor some models. For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will cause the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coeficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must beused instead.
- Replacing categories with ordinal numbers. (better suited for nonlinear machine learning models)
- Replacing categories with counts or frequency of observations.
- Encoding with integers in an ordered manner.
- Encoding with the mean of the target.
- Encoding with the Weight of Evidence.
- Grouping rare or infrequent categories.
- Performing binary encoding.
- Performing feature hashing.
- Trimming outliers from the dataset
- Performing winsorization
- Capping the variable at arbitrary maximum and minimum values
- Performing zero-coding – capping the variable values at zero
Winsorizer ArbitraryOutlierCapper
Discretisation methods EqualFrequencyDiscretiser EqualWidthDiscretiser DecisionTreeDiscretiser ArbitraryDiscreriser
- Transforming variables with the logarithm.
- Transforming variables with the reciprocal function.
- Using square and cube root to transform variables.
- Using power transformations on numerical variables.
- Performing Box-Cox transformation on numerical variables.
- Performing Yeo-Johnson transformation on numerical variables.
LogTransformer LogCpTransformer ReciprocalTransformer PowerTransformer BoxCoxTransformer YeoJohnsonTransformer Scikit-learn Wrapper: SklearnTransformerWrapper Variable Creation: MathematicalCombination CombineWithReferenceFeature CyclicalTransformer
- Dividing the variable into intervals of equal width
- Sorting the variable values in intervals of equal frequency
- Performing discretization followed by categorical encoding
- Allocating the variable values in arbitrary intervals
- Performing discretization with k-means clustering
- Using decision trees for discretization
Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.
- Categorical input data when the target variable is also categorical ( chi-squared statistic and the mutual information statistic) Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.
- Numerical input data and a categorical (class) target variable (ANOVA F-Statistic, Mutual Information Feature Selection)
- Numerical input data and a numerical target variable (Correlation Statistics, Mutual Information Statistics)
- How to Use RFE for Feature Selection.
- Feature importance:Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.
- Extracting date and time parts from a datetime variable
- Deriving representations of the year and month
- Creating representations of day and week
- Extracting time parts from a time variable
- Capturing the elapsed time between datetime variables
- Working with time in different time zones
- Standardizing the features
- Performing mean normalization
- Scaling to the maximum and minimum values
- Implementing maximum absolute scaling
- Scaling with the median and quantiles
- Scaling to vector unit length
- Trimming outliers from the dataset
- Performing winsorization
- Capping the variable at arbitrary maximum and minimum values
- Performing zero-coding – capping the variable values at zero
- How to Perform SVD Dimensionality Reduction