`OneHotEncoder` can accidentally create columns with same name #201

lars-reimann · 2023-04-17T21:44:45Z

Describe the bug

The OneHotEncoder uses the schema <old_column_name>_<value> to name the created columns. This can lead to conflicts, however.

To Reproduce

Run this program:

from safeds.data.tabular.containers import Table
from safeds.data.tabular.transformation import OneHotEncoder

if __name__ == '__main__':
    table = Table.from_dict({"a_b": ["c"], "a": ["b_c"]})
    transformed_table = OneHotEncoder().fit_and_transform(table)

    print(transformed_table)

It raises an exception:

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

The issue is that two columns with the same name (a_b_c) get created.

Expected behavior

No exception. The names of all created columns should be unique. They should also not conflict with existing columns in the Table. This can be done by detecting conflicts between two created columns or between a created column and an existing, unchanged column and appending a suffix _<counter> to the names of the created columns (e.g. a_b_c_1 vs. a_b_c_2).

Screenshots (optional)

No response

Additional Context (optional)

No response

The text was updated successfully, but these errors were encountered:

lars-reimann · 2023-04-28T10:43:22Z

For better readability I'd suggest to use the schema <column_name>__<value>(#<counter>)? (e.g. color__blue or color__red#2) for the names of the columns created by the OneHotEncoder. Double underscores should be rare so in many cases we won't even need a counter. And it makes it easier for users to figure out what is column name and what is value if either contain single underscores.

Only the names of duplicates need to have counter. The first occurrence needn't be changed. Counting should start at two. Example:

color__red
color__red#2
color__red#3
...

lars-reimann · 2023-04-28T10:49:48Z

Also useful: https://docs.python.org/3/library/collections.html#collections.Counter

zzril · 2023-04-28T14:17:14Z

We decided that we will implement the OneHotEncoder ourselfves, instead of using the one from scikit-learn.

We should also add performance tests to verify that our implementation is as effecient the one in scikit-learn. The tests should be performant on several large datasets.
(These tests do not need to be run by pytest automatically.)

Closes #201. ### Summary of Changes Changed OneHotEncoder to manually implement the encoding. (Breaking) Changed the format of newly generated columns to use two underscores as separator. In case of naming conflicts, a hash and a unique ID will be appended to the column name. --------- Co-authored-by: zzril <> Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com> Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>

## [0.12.0](v0.11.0...v0.12.0) (2023-05-11) ### Features * add `learning_rate` to AdaBoost classifier and regressor. ([#251](#251)) ([7f74440](7f74440)), closes [#167](#167) * add alpha parameter to `lasso_regression` ([#232](#232)) ([b5050b9](b5050b9)), closes [#163](#163) * add parameter `lasso_ratio` to `ElasticNetRegression` ([#237](#237)) ([4a1a736](4a1a736)), closes [#166](#166) * Add parameter `number_of_tree` to `RandomForest` classifier and regressor ([#230](#230)) ([414336a](414336a)), closes [#161](#161) * Added `Table.plot_boxplots` to plot a boxplot for each numerical column in the table ([#254](#254)) ([0203a0c](0203a0c)), closes [#156](#156) [#239](#239) * Added `Table.plot_histograms` to plot a histogram for each column in the table ([#252](#252)) ([e27d410](e27d410)), closes [#157](#157) * Added `Table.transform_table` method which returns the transformed Table ([#229](#229)) ([0a9ce72](0a9ce72)), closes [#110](#110) * Added alpha parameter to `RidgeRegression` ([#231](#231)) ([1ddc948](1ddc948)), closes [#164](#164) * Added Column#transform ([#270](#270)) ([40fb756](40fb756)), closes [#255](#255) * Added method `Table.inverse_transform_table` which returns the original table ([#227](#227)) ([846bf23](846bf23)), closes [#111](#111) * Added parameter `c` to `SupportVectorMachines` ([#267](#267)) ([a88eb8b](a88eb8b)), closes [#169](#169) * Added parameter `maximum_number_of_learner` and `learner` to `AdaBoost` ([#269](#269)) ([bb5a07e](bb5a07e)), closes [#171](#171) [#173](#173) * Added parameter `number_of_trees` to `GradientBoosting` ([#268](#268)) ([766f2ff](766f2ff)), closes [#170](#170) * Allow arguments of type pathlib.Path for file I/O methods ([#228](#228)) ([2b58c82](2b58c82)), closes [#146](#146) * convert `Schema` to `dict` and format it nicely in a notebook ([#244](#244)) ([ad1cac5](ad1cac5)), closes [#151](#151) * Convert between Excel file and `Table` ([#233](#233)) ([0d7a998](0d7a998)), closes [#138](#138) [#139](#139) * convert containers for tabular data to HTML ([#243](#243)) ([683c279](683c279)), closes [#140](#140) * make `Column` a subclass of `Sequence` ([#245](#245)) ([a35b943](a35b943)) * mark optional hyperparameters as keyword only ([#296](#296)) ([44a41eb](44a41eb)), closes [#278](#278) * move exceptions back to common package ([#295](#295)) ([a91172c](a91172c)), closes [#177](#177) [#262](#262) * precision metric for classification ([#272](#272)) ([5adadad](5adadad)), closes [#185](#185) * Raise error if an untagged table is used instead of a `TaggedTable` ([#234](#234)) ([8eea3dd](8eea3dd)), closes [#192](#192) * recall and F1-score metrics for classification ([#277](#277)) ([2cf93cc](2cf93cc)), closes [#187](#187) [#186](#186) * replace prefix `n` with `number_of` ([#250](#250)) ([f4f44a6](f4f44a6)), closes [#171](#171) * set `alpha` parameter for regularization of `ElasticNetRegression` ([#238](#238)) ([e642d1d](e642d1d)), closes [#165](#165) * Set `column_names` in `fit` methods of table transformers to be required ([#225](#225)) ([2856296](2856296)), closes [#179](#179) * set learning rate of Gradient Boosting models ([#253](#253)) ([9ffaf55](9ffaf55)), closes [#168](#168) * Support vector machine for regression and for classification ([#236](#236)) ([7f6c3bd](7f6c3bd)), closes [#154](#154) * usable constructor for `Table` ([#294](#294)) ([56a1fc4](56a1fc4)), closes [#266](#266) * usable constructor for `TaggedTable` ([#299](#299)) ([01c3ad9](01c3ad9)), closes [#293](#293) ### Bug Fixes * OneHotEncoder no longer creates duplicate column names ([#271](#271)) ([f604666](f604666)), closes [#201](#201) * selectively ignore one warning instead of all warnings ([#235](#235)) ([3aad07d](3aad07d))

lars-reimann · 2023-05-11T20:06:04Z

🎉 This issue has been resolved in version 0.12.0 🎉

The release is available on:

v0.12.0
GitHub release

Your semantic-release bot 📦🚀

lars-reimann added the bug 🪲 label Apr 17, 2023

lars-reimann changed the title ~~OneHotEncoder can create columns with same name~~ OneHotEncoder can accidentally create columns with same name Apr 17, 2023

lars-reimann added this to Library Apr 17, 2023

github-project-automation bot moved this to Backlog in Library Apr 17, 2023

lars-reimann mentioned this issue Apr 17, 2023

feat: OneHotEncoder.inverse_transform now maintains the column order from the original table #195

Merged

zzril assigned zzril and ilkajw Apr 28, 2023

alex-senger moved this from Backlog to Todo in Library Apr 28, 2023

alex-senger moved this from Todo to In Progress in Library Apr 28, 2023

zzril linked a pull request May 5, 2023 that will close this issue

fix: OneHotEncoder no longer creates duplicate column names #271

Merged

zzril mentioned this issue May 8, 2023

fix: OneHotEncoder no longer creates duplicate column names #271

Merged

zzril moved this from In Progress to Ready for Review in Library May 9, 2023

lars-reimann closed this as completed in #271 May 10, 2023

github-project-automation bot moved this from Ready for Review to ✔️ Done in Library May 10, 2023

lars-reimann added the released Included in a release label May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`OneHotEncoder` can accidentally create columns with same name #201

`OneHotEncoder` can accidentally create columns with same name #201

lars-reimann commented Apr 17, 2023

lars-reimann commented Apr 28, 2023 •

edited

Loading

lars-reimann commented Apr 28, 2023

zzril commented Apr 28, 2023 •

edited

Loading

lars-reimann commented May 11, 2023

OneHotEncoder can accidentally create columns with same name #201

OneHotEncoder can accidentally create columns with same name #201

Comments

lars-reimann commented Apr 17, 2023

Describe the bug

To Reproduce

Expected behavior

Screenshots (optional)

Additional Context (optional)

lars-reimann commented Apr 28, 2023 • edited Loading

lars-reimann commented Apr 28, 2023

zzril commented Apr 28, 2023 • edited Loading

lars-reimann commented May 11, 2023

`OneHotEncoder` can accidentally create columns with same name #201

`OneHotEncoder` can accidentally create columns with same name #201

lars-reimann commented Apr 28, 2023 •

edited

Loading

zzril commented Apr 28, 2023 •

edited

Loading