Skip to content

Commit

Permalink
[SPARK-37178][ML] fix doc 2
Browse files Browse the repository at this point in the history
  • Loading branch information
rebo16v committed Oct 13, 2024
1 parent 4480768 commit 8626295
Showing 1 changed file with 56 additions and 5 deletions.
61 changes: 56 additions & 5 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -857,7 +857,7 @@ for more details on the API.

## TargetEncoder

Target Encoding is a data-preprocessing technique that transforms high-cardinality categorical features into quasi-continuous scalar attributes suited for use in regression-type models. This paradigm maps individual values of an independent feature to a scalar, representing some estimate of the dependent attribute (meaning categorical values that exhibit similar statistics with respect to the target will have a similar representation).
[Target Encoding](https://www.researchgate.net/publication/220520258_A_Preprocessing_Scheme_for_High-Cardinality_Categorical_Attributes_in_Classification_and_Prediction_Problems) is a data-preprocessing technique that transforms high-cardinality categorical features into quasi-continuous scalar attributes suited for use in regression-type models. This paradigm maps individual values of an independent feature to a scalar, representing some estimate of the dependent attribute (meaning categorical values that exhibit similar statistics with respect to the target will have a similar representation).

By leveraging the relationship between categorical features and the target variable, Target Encoding usually performs better than One-Hot and does not require a final binary $1-to-K$ vector encoding. `TargetEncoder` can transform multiple columns, returning a single target-encoded output column for each input column.

Expand All @@ -868,19 +868,70 @@ User can specify the target column name by setting `label`. This column is expec
`TargetEncoder` supports the `handleInvalid` parameter to choose how to handle invalid input, meaning categories not seen at training, when encoding new data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an exception).

`TargetEncoder` supports the `targetType` parameter to choose the label type when fitting data, affecting how estimates are calculated. Available options include 'binary' and 'continuous'.
- When set to 'binary', the target attribute $Y$ is expected to be binary, $Y\in{0,1}$. The transformation maps individual values $X_{i}$ to the conditional probability of $Y$ given that $X=X_{i}$. $S_{i}=P(Y|X=X_{i})$. This approach is also known as $\it{bin-counting}$.
- When set to 'continuous', the target attribute $Y$ is expected to be continuous $Y\in\mathbb{Q}$. The transformation maps individual values $X_{i}$ to the average of $Y$ given that $X=X_{i}$. $S_{i}=E[Y|X=X_{i}$. This approach is also known as $\it{mean-encoding}$.
When set to 'binary', the target attribute $Y$ is expected to be binary, $Y\in{0,1}$. The transformation maps individual values $X_{i}$ to the conditional probability of $Y$ given that $X=X_{i}$. $S_{i}=P(Y|X=X_{i})$. This approach is also known as $\it{bin-counting}$.
When set to 'continuous', the target attribute $Y$ is expected to be continuous $Y\in\mathbb{Q}$. The transformation maps individual values $X_{i}$ to the average of $Y$ given that $X=X_{i}$. $S_{i}=E[Y|X=X_{i}$. This approach is also known as $\it{mean-encoding}$.

`TargetEncoder` supports the `smoothing` parameter to tune how in-category stats and overall stats are weighted.
High-cardinality categorical features are usually unevenly distributed across all possible values of $X$.
When calculating encodings $S_{i}$ according only to in-class statistics, rarely seen categories are very likely to cause overfitting when used in learning, as the direct estimate of $S_{i}$ is not reliable.
Smoothing prevents this behaviour by blending in-class estimates with overall estimates according to the relative size of this class on the whole dataset.
- $S_{i}=\lambda(n_{i})P(Y|X=X_{i}) + (1-\lambda(n_{i}))P(Y)$ for the binary case
- $S_{i}=\lambda(n_{i})E[Y|X=X_{i}] + (1-\lambda(n_{i}))E[Y]$ for the continuous case
$S_{i}=\lambda(n_{i})P(Y|X=X_{i})+(1-\lambda(n_{i}))P(Y)$ for the binary case
$S_{i}=\lambda(n_{i})E[Y|X=X_{i}]+(1-\lambda(n_{i}))E[Y]$ for the continuous case
being the weighting factor, $\lambda(n_{i})$, a monotonically increasing function on $n_{i}$, bounded between 0 and 1. Typically $\lambda(n_{i})$ is chosen as the parametric function $\lambda(n_{i})=\frac{n_{i}}{n_{i}+m}$, where $m$ is the smoothing factor.

**Examples**

Building on the `TargetEncoder` example, let's assume we have the following
DataFrame with columns `feature` and `target` (binary & continuous):

~~~~
feature | target | target
| (bin) | (cont)
--------|--------|--------|
1 | 0 | 1.3
1 | 1 | 2.5
1 | 0 | 1.6
2 | 1 | 1.8
2 | 0 | 2.4
3 | 1 | 3.2
~~~~

Applying `TargetEncoder` with 'binary' target type,
`feature` as the input column,`target (bin)` as the label column
and `encoded` as the output column, we are able to fit a model
on the data to learn encodings and transform the data according
to these mappings:

~~~~
feature | target | encoded
| (bin) |
--------|--------|--------
1 | 0 | 0.333
1 | 1 | 0.333
1 | 0 | 0.333
2 | 1 | 0.5
2 | 0 | 0.5
3 | 1 | 1.0
~~~~

Applying `TargetEncoder` with 'continuous' target type,
`feature` as the input column,`target (cont)` as the label column
and `encoded` as the output column, we are able to fit a model
on the data to learn encodings and transform the data according
to these mappings:

~~~~
feature | target | encoded
| (cont) |
--------|--------|--------
1 | 1.3 | 1.8
1 | 2.5 | 1.8
1 | 1.6 | 1.8
2 | 1.8 | 2.1
2 | 2.4 | 2.1
3 | 3.2 | 3.2
~~~~

<div class="codetabs">

<div data-lang="python" markdown="1">
Expand Down

0 comments on commit 8626295

Please sign in to comment.