[BUG] OutlierRemover doesn't work with supervised learning #342

MBrouns · 2020-05-08T13:57:02Z

X and y will have a different size after the outlier removal because we can't filter y in the pipeline

Matgrb · 2020-05-22T09:37:24Z

I suppose this could be fixed using imbalanced-learn Pipeline or implementing own pipeline that allows this kind of actions.
What are your thoughts on it?

koaning · 2020-05-22T19:36:39Z

I like the idea and it's certainly valid statement, but we'd like this library to remain compatible with scikit-learn primarily. I can imagine that it gets complicated for users too if they need to figure out which components require which pipeline backend.

MBrouns · 2020-05-22T19:48:41Z

I've given it some thought and I do think that a custom pipeline will be the only way to properly support this. That said, there might be something to say for reusing the imblearn pipeline here. If I recall correctly they use a resample method rather than a transform. That allows us to raise a warning in the transform of the outlier remover stating that if its used with supervised learning its probably better to use the imblearn pipelines.

@Matgrb were you thinking about picking this issue up?

sephib · 2020-12-14T14:19:21Z

Hi @MBrouns,
Thx for all the work on the package.
Until this issue is resolved, what is your strategy to implement an outlier dedaction within a pipeline? just to preform this step in the pre-process step and not as part of the splitting of the data within the pipeline? (I've shamelessly added a link to my blog post related to splitting of the data with dataclass - is this a relevant feature for scikit-lego?

MBrouns · 2020-12-18T10:00:29Z

Hm I kind of like being able to do as much as possible inside a pipeline. It makes persistence much simpler and there's less chance stuff will go wrong with data splits. I'll take a look at your approach though and see whether that fits!

sephib · 2020-12-20T21:21:16Z

I kind of like being able to do as much as possible inside a pipeline. It makes persistence much simpler and there's less chance stuff will go wrong with data split

Totally agree - that's why I was wondering what is your current solutions for outliers

MBrouns added the bug Something isn't working label May 8, 2020

FBruzzesi mentioned this issue Mar 23, 2024

[BUG] Rename transform_train to resample. #643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OutlierRemover doesn't work with supervised learning #342

[BUG] OutlierRemover doesn't work with supervised learning #342

MBrouns commented May 8, 2020

Matgrb commented May 22, 2020

koaning commented May 22, 2020

MBrouns commented May 22, 2020 •

edited

Loading

sephib commented Dec 14, 2020

MBrouns commented Dec 18, 2020

sephib commented Dec 20, 2020

[BUG] OutlierRemover doesn't work with supervised learning #342

[BUG] OutlierRemover doesn't work with supervised learning #342

Comments

MBrouns commented May 8, 2020

Matgrb commented May 22, 2020

koaning commented May 22, 2020

MBrouns commented May 22, 2020 • edited Loading

sephib commented Dec 14, 2020

MBrouns commented Dec 18, 2020

sephib commented Dec 20, 2020

MBrouns commented May 22, 2020 •

edited

Loading