Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] CV solution for anomaly detection without outliers during training #307

Open
janvdvegt opened this issue Mar 16, 2020 · 8 comments · May be fixed by #595
Open

[FEATURE] CV solution for anomaly detection without outliers during training #307

janvdvegt opened this issue Mar 16, 2020 · 8 comments · May be fixed by #595
Assignees
Labels
enhancement New feature or request

Comments

@janvdvegt
Copy link
Contributor

With anomaly detection, if you have labeled outliers there are two types of models. The first one requires the outliers, although regularly unlabeled. Isolation forests fall in this category. One class SVM specifically works better without the outliers in there. Properly evaluating this model does require the outliers though. The current sklearn setup does not allow for this case (I believe). It would be nice to have a way to do this easily.

One possible approach would be to use a different type of validation iterator, that returns only negative sample indices in the training fold but both in the validation fold.

@janvdvegt janvdvegt added the enhancement New feature or request label Mar 16, 2020
@koaning
Copy link
Owner

koaning commented Mar 16, 2020

Just to confirm. The proposal is to create a new CV method that accepts an outlier detector as part of it's initialisation?

I certainly see some merit to this idea. Got an example of what the API might look like?

@janvdvegt
Copy link
Contributor Author

I think I mean something slightly different. There are outlier detection methods that only work when there are no outliers in the training data, in that sense, they are more like novelty detection. Of course, these outliers are very important for properly evaluating the hyperparameters and performance in general. So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not. y is important for evaluation but we don't use it during training. However, in these novelty algorithms we want to throw out the positive samples in the training set, so in our CV loop.

Let's say we have the following dataset:

X   y (anomalous)
0   0
1   1
2   0
3   0
4   0
5   1
6   0

If the first four samples are in the training split we want to remove sample 1 because we don't want outliers in our training set but in the validation split we do not want to remove 5 because we need it for proper evaluation.

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

@koaning
Copy link
Owner

koaning commented Mar 17, 2020

Just for confirmation. This is the situation?

X   y (anomalous)   y (to predict)  split
0   0               0               A
1   1               1000            A
2   0               2               A
3   0               3               B
4   0               4               B
5   1               10000           B
6   0               6               B
  1. We first generate a split, this gives us A, B.
  2. First A is the training set, we keep the outlier? We remove outliers in B before it is passed to another pipeline?
  3. Then B is the training set, we keep the outlier? We remove outliers in A before it is passed to another pipeline?

If you want to throw out novely before passing it to another pipeline ... you're gonna need an outlier detector first no? When you say;

So let's say we have X which contains our features and we have y that contains whether they are considered to be an outlier or not.

If we have a label for being an outlier ... that's sometimes called classification. Do you have a usecase in mind here? There might certainly be something interesting here but this discussion feels just a tad bit theoretical. What problem will this solve in real life?

It deserves mentioning, our implementation of OutlierRemover seems relevant to mention.

@MBrouns
Copy link
Collaborator

MBrouns commented Mar 17, 2020

I think it's the other way around @koaning. The train set should not contain the outliers, so if A is the training set in step 2, we remove observation with X=1.

There's not necessarily a link with other pipelines or models, the idea here I think is if your outlier removal is a gmm, you don't want the known outliers to be in there as the might skew your fitting on 'normal' data. In validation you do want them to evaluate your method.

@MBrouns
Copy link
Collaborator

MBrouns commented Mar 17, 2020

With regard to a possible implementation, I'm not super familiar with the types of arguments available. I know that CV iterators return indices, so if y is available it could just be a CV iterator that filters out the positive indices in the training set but keeps them in the evaluation set.

y is definitely available in the cv's split method, StratifiedKFold relies on this for example

@janvdvegt
Copy link
Contributor Author

Then it should not be too difficult. One issue with this approach however is that it seems like you would have to implement it for every different CV strategy. Is there a way around this if you take this approach? It might be possible to extend current CV strategies by inheritance and add an additional filter to the training fold but this would require an additional pass over the data.

@FBruzzesi
Copy link
Collaborator

If there is still interest in this feature, I would be happy to give it a try, this looks like a nice feature to have. However I have a couple of questions:

  • Any candidate for a name 😂?
  • Should the test contain all the anomalous samples? Or should it follow a given CV strategy and drop anomalous points from training?

@koaning
Copy link
Owner

koaning commented Oct 30, 2023

Pun intended ... maybe this classname: WithoutlierCV?

It sure sounds better than WithoutOutlierCV but then again the letter is more literally explaining what it does without trying to be clever so that's probably better.

@FBruzzesi FBruzzesi self-assigned this Nov 7, 2023
@FBruzzesi FBruzzesi linked a pull request Nov 7, 2023 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants