-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] CV solution for anomaly detection without outliers during training #307
Comments
Just to confirm. The proposal is to create a new CV method that accepts an outlier detector as part of it's initialisation? I certainly see some merit to this idea. Got an example of what the API might look like? |
I think I mean something slightly different. There are outlier detection methods that only work when there are no outliers in the training data, in that sense, they are more like novelty detection. Of course, these outliers are very important for properly evaluating the hyperparameters and performance in general. So let's say we have Let's say we have the following dataset:
If the first four samples are in the training split we want to remove sample With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if |
Just for confirmation. This is the situation?
If you want to throw out novely before passing it to another pipeline ... you're gonna need an outlier detector first no? When you say;
If we have a label for being an outlier ... that's sometimes called classification. Do you have a usecase in mind here? There might certainly be something interesting here but this discussion feels just a tad bit theoretical. What problem will this solve in real life? It deserves mentioning, our implementation of OutlierRemover seems relevant to mention. |
I think it's the other way around @koaning. The train set should not contain the outliers, so if A is the training set in step 2, we remove observation with X=1. There's not necessarily a link with other pipelines or models, the idea here I think is if your outlier removal is a gmm, you don't want the known outliers to be in there as the might skew your fitting on 'normal' data. In validation you do want them to evaluate your method. |
|
Then it should not be too difficult. One issue with this approach however is that it seems like you would have to implement it for every different CV strategy. Is there a way around this if you take this approach? It might be possible to extend current CV strategies by inheritance and add an additional filter to the training fold but this would require an additional pass over the data. |
If there is still interest in this feature, I would be happy to give it a try, this looks like a nice feature to have. However I have a couple of questions:
|
Pun intended ... maybe this classname: It sure sounds better than |
With anomaly detection, if you have labeled outliers there are two types of models. The first one requires the outliers, although regularly unlabeled. Isolation forests fall in this category. One class SVM specifically works better without the outliers in there. Properly evaluating this model does require the outliers though. The current sklearn setup does not allow for this case (I believe). It would be nice to have a way to do this easily.
One possible approach would be to use a different type of validation iterator, that returns only negative sample indices in the training fold but both in the validation fold.
The text was updated successfully, but these errors were encountered: