[FEATURE] CV solution for anomaly detection without outliers during training #307

janvdvegt · 2020-03-16T12:27:11Z

With anomaly detection, if you have labeled outliers there are two types of models. The first one requires the outliers, although regularly unlabeled. Isolation forests fall in this category. One class SVM specifically works better without the outliers in there. Properly evaluating this model does require the outliers though. The current sklearn setup does not allow for this case (I believe). It would be nice to have a way to do this easily.

One possible approach would be to use a different type of validation iterator, that returns only negative sample indices in the training fold but both in the validation fold.

koaning · 2020-03-16T15:58:06Z

Just to confirm. The proposal is to create a new CV method that accepts an outlier detector as part of it's initialisation?

I certainly see some merit to this idea. Got an example of what the API might look like?

janvdvegt · 2020-03-17T11:09:04Z

I think I mean something slightly different. There are outlier detection methods that only work when there are no outliers in the training data, in that sense, they are more like novelty detection. Of course, these outliers are very important for properly evaluating the hyperparameters and performance in general. So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not. y is important for evaluation but we don't use it during training. However, in these novelty algorithms we want to throw out the positive samples in the training set, so in our CV loop.

Let's say we have the following dataset:

X   y (anomalous)
0   0
1   1
2   0
3   0
4   0
5   1
6   0

If the first four samples are in the training split we want to remove sample 1 because we don't want outliers in our training set but in the validation split we do not want to remove 5 because we need it for proper evaluation.

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

koaning · 2020-03-17T15:27:03Z

Just for confirmation. This is the situation?

X   y (anomalous)   y (to predict)  split
0   0               0               A
1   1               1000            A
2   0               2               A
3   0               3               B
4   0               4               B
5   1               10000           B
6   0               6               B

We first generate a split, this gives us A, B.
First A is the training set, we keep the outlier? We remove outliers in B before it is passed to another pipeline?
Then B is the training set, we keep the outlier? We remove outliers in A before it is passed to another pipeline?

If you want to throw out novely before passing it to another pipeline ... you're gonna need an outlier detector first no? When you say;

So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not.

If we have a label for being an outlier ... that's sometimes called classification. Do you have a usecase in mind here? There might certainly be something interesting here but this discussion feels just a tad bit theoretical. What problem will this solve in real life?

It deserves mentioning, our implementation of OutlierRemover seems relevant to mention.

MBrouns · 2020-03-17T15:32:27Z

I think it's the other way around @koaning. The train set should not contain the outliers, so if A is the training set in step 2, we remove observation with X=1.

There's not necessarily a link with other pipelines or models, the idea here I think is if your outlier removal is a gmm, you don't want the known outliers to be in there as the might skew your fitting on 'normal' data. In validation you do want them to evaluate your method.

MBrouns · 2020-03-17T15:35:07Z

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

y is definitely available in the cv's split method, StratifiedKFold relies on this for example

janvdvegt · 2020-03-18T11:52:34Z

Then it should not be too difficult. One issue with this approach however is that it seems like you would have to implement it for every different CV strategy. Is there a way around this if you take this approach? It might be possible to extend current CV strategies by inheritance and add an additional filter to the training fold but this would require an additional pass over the data.

FBruzzesi · 2023-10-29T12:36:01Z

If there is still interest in this feature, I would be happy to give it a try, this looks like a nice feature to have. However I have a couple of questions:

Any candidate for a name 😂?
Should the test contain all the anomalous samples? Or should it follow a given CV strategy and drop anomalous points from training?

koaning · 2023-10-30T16:10:18Z

Pun intended ... maybe this classname: WithoutlierCV?

It sure sounds better than WithoutOutlierCV but then again the letter is more literally explaining what it does without trying to be clever so that's probably better.

janvdvegt added the enhancement New feature or request label Mar 16, 2020

FBruzzesi self-assigned this Nov 7, 2023

FBruzzesi linked a pull request Nov 7, 2023 that will close this issue

[Feature] WithoutLiersCV model selection #595

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] CV solution for anomaly detection without outliers during training #307

[FEATURE] CV solution for anomaly detection without outliers during training #307

janvdvegt commented Mar 16, 2020

koaning commented Mar 16, 2020 •

edited

Loading

janvdvegt commented Mar 17, 2020

koaning commented Mar 17, 2020 •

edited

Loading

MBrouns commented Mar 17, 2020 •

edited

Loading

MBrouns commented Mar 17, 2020

janvdvegt commented Mar 18, 2020

FBruzzesi commented Oct 29, 2023

koaning commented Oct 30, 2023

[FEATURE] CV solution for anomaly detection without outliers during training #307

[FEATURE] CV solution for anomaly detection without outliers during training #307

Comments

janvdvegt commented Mar 16, 2020

koaning commented Mar 16, 2020 • edited Loading

janvdvegt commented Mar 17, 2020

koaning commented Mar 17, 2020 • edited Loading

MBrouns commented Mar 17, 2020 • edited Loading

MBrouns commented Mar 17, 2020

janvdvegt commented Mar 18, 2020

FBruzzesi commented Oct 29, 2023

koaning commented Oct 30, 2023

koaning commented Mar 16, 2020 •

edited

Loading

koaning commented Mar 17, 2020 •

edited

Loading

MBrouns commented Mar 17, 2020 •

edited

Loading