From 8626295e1638b21e62bca6865df78cf217fefb69 Mon Sep 17 00:00:00 2001 From: rebo16v Date: Sun, 13 Oct 2024 21:55:09 +0200 Subject: [PATCH] [SPARK-37178][ML] fix doc 2 --- docs/ml-features.md | 61 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 56 insertions(+), 5 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index a6456d7c4d89d..27af178366e51 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -857,7 +857,7 @@ for more details on the API. ## TargetEncoder -Target Encoding is a data-preprocessing technique that transforms high-cardinality categorical features into quasi-continuous scalar attributes suited for use in regression-type models. This paradigm maps individual values of an independent feature to a scalar, representing some estimate of the dependent attribute (meaning categorical values that exhibit similar statistics with respect to the target will have a similar representation). +[Target Encoding](https://www.researchgate.net/publication/220520258_A_Preprocessing_Scheme_for_High-Cardinality_Categorical_Attributes_in_Classification_and_Prediction_Problems) is a data-preprocessing technique that transforms high-cardinality categorical features into quasi-continuous scalar attributes suited for use in regression-type models. This paradigm maps individual values of an independent feature to a scalar, representing some estimate of the dependent attribute (meaning categorical values that exhibit similar statistics with respect to the target will have a similar representation). By leveraging the relationship between categorical features and the target variable, Target Encoding usually performs better than One-Hot and does not require a final binary $1-to-K$ vector encoding. `TargetEncoder` can transform multiple columns, returning a single target-encoded output column for each input column. @@ -868,19 +868,70 @@ User can specify the target column name by setting `label`. This column is expec `TargetEncoder` supports the `handleInvalid` parameter to choose how to handle invalid input, meaning categories not seen at training, when encoding new data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an exception). `TargetEncoder` supports the `targetType` parameter to choose the label type when fitting data, affecting how estimates are calculated. Available options include 'binary' and 'continuous'. -- When set to 'binary', the target attribute $Y$ is expected to be binary, $Y\in{0,1}$. The transformation maps individual values $X_{i}$ to the conditional probability of $Y$ given that $X=X_{i}$. $S_{i}=P(Y|X=X_{i})$. This approach is also known as $\it{bin-counting}$. -- When set to 'continuous', the target attribute $Y$ is expected to be continuous $Y\in\mathbb{Q}$. The transformation maps individual values $X_{i}$ to the average of $Y$ given that $X=X_{i}$. $S_{i}=E[Y|X=X_{i}$. This approach is also known as $\it{mean-encoding}$. +When set to 'binary', the target attribute $Y$ is expected to be binary, $Y\in{0,1}$. The transformation maps individual values $X_{i}$ to the conditional probability of $Y$ given that $X=X_{i}$. $S_{i}=P(Y|X=X_{i})$. This approach is also known as $\it{bin-counting}$. +When set to 'continuous', the target attribute $Y$ is expected to be continuous $Y\in\mathbb{Q}$. The transformation maps individual values $X_{i}$ to the average of $Y$ given that $X=X_{i}$. $S_{i}=E[Y|X=X_{i}$. This approach is also known as $\it{mean-encoding}$. `TargetEncoder` supports the `smoothing` parameter to tune how in-category stats and overall stats are weighted. High-cardinality categorical features are usually unevenly distributed across all possible values of $X$. When calculating encodings $S_{i}$ according only to in-class statistics, rarely seen categories are very likely to cause overfitting when used in learning, as the direct estimate of $S_{i}$ is not reliable. Smoothing prevents this behaviour by blending in-class estimates with overall estimates according to the relative size of this class on the whole dataset. -- $S_{i}=\lambda(n_{i})P(Y|X=X_{i}) + (1-\lambda(n_{i}))P(Y)$ for the binary case -- $S_{i}=\lambda(n_{i})E[Y|X=X_{i}] + (1-\lambda(n_{i}))E[Y]$ for the continuous case +$S_{i}=\lambda(n_{i})P(Y|X=X_{i})+(1-\lambda(n_{i}))P(Y)$ for the binary case +$S_{i}=\lambda(n_{i})E[Y|X=X_{i}]+(1-\lambda(n_{i}))E[Y]$ for the continuous case being the weighting factor, $\lambda(n_{i})$, a monotonically increasing function on $n_{i}$, bounded between 0 and 1. Typically $\lambda(n_{i})$ is chosen as the parametric function $\lambda(n_{i})=\frac{n_{i}}{n_{i}+m}$, where $m$ is the smoothing factor. **Examples** +Building on the `TargetEncoder` example, let's assume we have the following +DataFrame with columns `feature` and `target` (binary & continuous): + +~~~~ + feature | target | target + | (bin) | (cont) + --------|--------|--------| + 1 | 0 | 1.3 + 1 | 1 | 2.5 + 1 | 0 | 1.6 + 2 | 1 | 1.8 + 2 | 0 | 2.4 + 3 | 1 | 3.2 +~~~~ + +Applying `TargetEncoder` with 'binary' target type, +`feature` as the input column,`target (bin)` as the label column +and `encoded` as the output column, we are able to fit a model +on the data to learn encodings and transform the data according +to these mappings: + +~~~~ + feature | target | encoded + | (bin) | + --------|--------|-------- + 1 | 0 | 0.333 + 1 | 1 | 0.333 + 1 | 0 | 0.333 + 2 | 1 | 0.5 + 2 | 0 | 0.5 + 3 | 1 | 1.0 +~~~~ + +Applying `TargetEncoder` with 'continuous' target type, +`feature` as the input column,`target (cont)` as the label column +and `encoded` as the output column, we are able to fit a model +on the data to learn encodings and transform the data according +to these mappings: + +~~~~ + feature | target | encoded + | (cont) | + --------|--------|-------- + 1 | 1.3 | 1.8 + 1 | 2.5 | 1.8 + 1 | 1.6 | 1.8 + 2 | 1.8 | 2.1 + 2 | 2.4 | 2.1 + 3 | 3.2 | 3.2 +~~~~ +