-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OneHotEncoder
can accidentally create columns with same name
#201
Comments
OneHotEncoder
can create columns with same nameOneHotEncoder
can accidentally create columns with same name
For better readability I'd suggest to use the schema Only the names of duplicates need to have counter. The first occurrence needn't be changed. Counting should start at two. Example:
|
We decided that we will implement the OneHotEncoder ourselfves, instead of using the one from scikit-learn. We should also add performance tests to verify that our implementation is as effecient the one in scikit-learn. The tests should be performant on several large datasets. |
Closes #201. ### Summary of Changes Changed OneHotEncoder to manually implement the encoding. (Breaking) Changed the format of newly generated columns to use two underscores as separator. In case of naming conflicts, a hash and a unique ID will be appended to the column name. --------- Co-authored-by: zzril <> Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com> Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>
## [0.12.0](v0.11.0...v0.12.0) (2023-05-11) ### Features * add `learning_rate` to AdaBoost classifier and regressor. ([#251](#251)) ([7f74440](7f74440)), closes [#167](#167) * add alpha parameter to `lasso_regression` ([#232](#232)) ([b5050b9](b5050b9)), closes [#163](#163) * add parameter `lasso_ratio` to `ElasticNetRegression` ([#237](#237)) ([4a1a736](4a1a736)), closes [#166](#166) * Add parameter `number_of_tree` to `RandomForest` classifier and regressor ([#230](#230)) ([414336a](414336a)), closes [#161](#161) * Added `Table.plot_boxplots` to plot a boxplot for each numerical column in the table ([#254](#254)) ([0203a0c](0203a0c)), closes [#156](#156) [#239](#239) * Added `Table.plot_histograms` to plot a histogram for each column in the table ([#252](#252)) ([e27d410](e27d410)), closes [#157](#157) * Added `Table.transform_table` method which returns the transformed Table ([#229](#229)) ([0a9ce72](0a9ce72)), closes [#110](#110) * Added alpha parameter to `RidgeRegression` ([#231](#231)) ([1ddc948](1ddc948)), closes [#164](#164) * Added Column#transform ([#270](#270)) ([40fb756](40fb756)), closes [#255](#255) * Added method `Table.inverse_transform_table` which returns the original table ([#227](#227)) ([846bf23](846bf23)), closes [#111](#111) * Added parameter `c` to `SupportVectorMachines` ([#267](#267)) ([a88eb8b](a88eb8b)), closes [#169](#169) * Added parameter `maximum_number_of_learner` and `learner` to `AdaBoost` ([#269](#269)) ([bb5a07e](bb5a07e)), closes [#171](#171) [#173](#173) * Added parameter `number_of_trees` to `GradientBoosting` ([#268](#268)) ([766f2ff](766f2ff)), closes [#170](#170) * Allow arguments of type pathlib.Path for file I/O methods ([#228](#228)) ([2b58c82](2b58c82)), closes [#146](#146) * convert `Schema` to `dict` and format it nicely in a notebook ([#244](#244)) ([ad1cac5](ad1cac5)), closes [#151](#151) * Convert between Excel file and `Table` ([#233](#233)) ([0d7a998](0d7a998)), closes [#138](#138) [#139](#139) * convert containers for tabular data to HTML ([#243](#243)) ([683c279](683c279)), closes [#140](#140) * make `Column` a subclass of `Sequence` ([#245](#245)) ([a35b943](a35b943)) * mark optional hyperparameters as keyword only ([#296](#296)) ([44a41eb](44a41eb)), closes [#278](#278) * move exceptions back to common package ([#295](#295)) ([a91172c](a91172c)), closes [#177](#177) [#262](#262) * precision metric for classification ([#272](#272)) ([5adadad](5adadad)), closes [#185](#185) * Raise error if an untagged table is used instead of a `TaggedTable` ([#234](#234)) ([8eea3dd](8eea3dd)), closes [#192](#192) * recall and F1-score metrics for classification ([#277](#277)) ([2cf93cc](2cf93cc)), closes [#187](#187) [#186](#186) * replace prefix `n` with `number_of` ([#250](#250)) ([f4f44a6](f4f44a6)), closes [#171](#171) * set `alpha` parameter for regularization of `ElasticNetRegression` ([#238](#238)) ([e642d1d](e642d1d)), closes [#165](#165) * Set `column_names` in `fit` methods of table transformers to be required ([#225](#225)) ([2856296](2856296)), closes [#179](#179) * set learning rate of Gradient Boosting models ([#253](#253)) ([9ffaf55](9ffaf55)), closes [#168](#168) * Support vector machine for regression and for classification ([#236](#236)) ([7f6c3bd](7f6c3bd)), closes [#154](#154) * usable constructor for `Table` ([#294](#294)) ([56a1fc4](56a1fc4)), closes [#266](#266) * usable constructor for `TaggedTable` ([#299](#299)) ([01c3ad9](01c3ad9)), closes [#293](#293) ### Bug Fixes * OneHotEncoder no longer creates duplicate column names ([#271](#271)) ([f604666](f604666)), closes [#201](#201) * selectively ignore one warning instead of all warnings ([#235](#235)) ([3aad07d](3aad07d))
🎉 This issue has been resolved in version 0.12.0 🎉 The release is available on:
Your semantic-release bot 📦🚀 |
Describe the bug
The
OneHotEncoder
uses the schema<old_column_name>_<value>
to name the created columns. This can lead to conflicts, however.To Reproduce
Run this program:
It raises an exception:
The issue is that two columns with the same name (
a_b_c
) get created.Expected behavior
No exception. The names of all created columns should be unique. They should also not conflict with existing columns in the
Table
. This can be done by detecting conflicts between two created columns or between a created column and an existing, unchanged column and appending a suffix_<counter>
to the names of the created columns (e.g.a_b_c_1
vs.a_b_c_2
).Screenshots (optional)
No response
Additional Context (optional)
No response
The text was updated successfully, but these errors were encountered: