Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection as a preprocessing step #135

Closed
Doppe1g4nger opened this issue Nov 20, 2018 · 1 comment
Closed

Outlier detection as a preprocessing step #135

Doppe1g4nger opened this issue Nov 20, 2018 · 1 comment

Comments

@Doppe1g4nger
Copy link
Contributor

I think it would be worth looking into adding the option to run an outlier detection algorithm (sklearn has some good ones) during the preprocessing stage. Based on the results we could throw out outliers that might affect performance or dynamically change the tpot accuracy metric to one that's more outlier resistant.

I thought of this because one of the datasets I'm working with has a few outliers and I think they are causing tpot to try really hard to find a model that improves performance drastically on those few when it should instead be finding a marginally better fit for the vast majority of the data.

@ardunn
Copy link
Contributor

ardunn commented Nov 20, 2018

I like this idea, but we need to be careful that we aren't discarding outliers that people want. A lot of times in matsci predictions we are looking for outliers (which material is the hardest, most conductive, etc.), and being able to predict them is important.

For the Analytics part, the outlier analysis should mainly be looking at which predictions were farthest from their true values, and possibly why. We can also look at the outliers based on actual value.

@ardunn ardunn closed this as completed Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants