Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OutlierRemover doesn't work with supervised learning #342

Open
MBrouns opened this issue May 8, 2020 · 6 comments
Open

[BUG] OutlierRemover doesn't work with supervised learning #342

MBrouns opened this issue May 8, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@MBrouns
Copy link
Collaborator

MBrouns commented May 8, 2020

X and y will have a different size after the outlier removal because we can't filter y in the pipeline

@MBrouns MBrouns added the bug Something isn't working label May 8, 2020
@Matgrb
Copy link

Matgrb commented May 22, 2020

I suppose this could be fixed using imbalanced-learn Pipeline or implementing own pipeline that allows this kind of actions.
What are your thoughts on it?

@koaning
Copy link
Owner

koaning commented May 22, 2020

I like the idea and it's certainly valid statement, but we'd like this library to remain compatible with scikit-learn primarily. I can imagine that it gets complicated for users too if they need to figure out which components require which pipeline backend.

@MBrouns
Copy link
Collaborator Author

MBrouns commented May 22, 2020

I've given it some thought and I do think that a custom pipeline will be the only way to properly support this. That said, there might be something to say for reusing the imblearn pipeline here. If I recall correctly they use a resample method rather than a transform. That allows us to raise a warning in the transform of the outlier remover stating that if its used with supervised learning its probably better to use the imblearn pipelines.

@Matgrb were you thinking about picking this issue up?

@sephib
Copy link

sephib commented Dec 14, 2020

Hi @MBrouns,
Thx for all the work on the package.
Until this issue is resolved, what is your strategy to implement an outlier dedaction within a pipeline? just to preform this step in the pre-process step and not as part of the splitting of the data within the pipeline? (I've shamelessly added a link to my blog post related to splitting of the data with dataclass - is this a relevant feature for scikit-lego?

@MBrouns
Copy link
Collaborator Author

MBrouns commented Dec 18, 2020

Hm I kind of like being able to do as much as possible inside a pipeline. It makes persistence much simpler and there's less chance stuff will go wrong with data splits. I'll take a look at your approach though and see whether that fits!

@sephib
Copy link

sephib commented Dec 20, 2020

I kind of like being able to do as much as possible inside a pipeline. It makes persistence much simpler and there's less chance stuff will go wrong with data split

Totally agree - that's why I was wondering what is your current solutions for outliers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants