Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Refine Feature Selector (#1778)
Browse files Browse the repository at this point in the history
  • Loading branch information
xuehui1991 authored and chicm-ms committed Nov 26, 2019
1 parent fbe2586 commit 13d6b15
Show file tree
Hide file tree
Showing 3 changed files with 363 additions and 2 deletions.
256 changes: 255 additions & 1 deletion docs/en_US/FeatureEngineering/Overview.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,257 @@
# FeatureEngineering

We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute.

For now, we support the following feature selector:
- [GradientFeatureSelector](./GradientFeatureSelector.md)
- [GBDTSelector](./GBDTSelector.md)


# How to use?

```python
from nni.feature_engineering.gradient_selector import GradientFeatureSelector
# from nni.feature_engineering.gbdt_selector import GBDTSelector

# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# initlize a selector
fgs = GradientFeatureSelector(...)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(...))

...
```

When using the built-in Selector, you first need to `import` a feature selector, and `initialize` it. You could call the function `fit` in the selector to pass the data to the selector. After that, you could use `get_seleteced_features` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it.

# How to customize?

NNI provides _state-of-the-art_ feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself.

If you want to implement a customized feature selector, you need to:

1. Inherit the base FeatureSelector class
1. Implement _fit_ and _get_selected_features_ function
1. Integrate with sklearn (Optional)

Here is an example:

**1. Inherit the base Featureselector Class**

```python
from nni.feature_engineering.feature_selector import FeatureSelector

class CustomizedSelector(FeatureSelector):
def __init__(self, ...):
...
```

**2. Implement _fit_ and _get_selected_features_ Function**

```python
from nni.tuner import Tuner

from nni.feature_engineering.feature_selector import FeatureSelector

class CustomizedSelector(FeatureSelector):
def __init__(self, ...):
...

def fit(self, X, y, **kwargs):
"""
Fit the training data to FeatureSelector
Parameters
------------
X : array-like numpy matrix
The training input samples, which shape is [n_samples, n_features].
y: array-like numpy matrix
The target values (class labels in classification, real numbers in regression). Which shape is [n_samples].
"""
self.X = X
self.y = y
...

def get_selected_features(self):
"""
Get important feature
Returns
-------
list :
Return the index of the important feature.
"""
...
return self.selected_features_

...
```

**3. Integrate with Sklearn**

`sklearn.pipeline.Pipeline` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow.
The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a mudule of the pipeline.

1. Inherit the calss _sklearn.base.BaseEstimator_
1. Implement _get_params_ and _set_params_ function in _BaseEstimator_
1. Inherit the class _sklearn.feature_selection.base.SelectorMixin_
1. Implement _get_support_, _transform_ and _inverse_transform_ Function in _SelectorMixin_

Here is an example:

**1. Inherit the BaseEstimator Class and its Function**

```python
from sklearn.base import BaseEstimator
from nni.feature_engineering.feature_selector import FeatureSelector

class CustomizedSelector(FeatureSelector, BaseEstimator):
def __init__(self, ...):
...

def get_params(self, ...):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items()
if not key.endswith('_')}
return params

def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self

```

**2. Inherit the SelectorMixin Class and its Function**
```python
from sklearn.base import BaseEstimator
from sklearn.feature_selection.base import SelectorMixin

from nni.feature_engineering.feature_selector import FeatureSelector

class CustomizedSelector(FeatureSelector, BaseEstimator):
def __init__(self, ...):
...

def get_params(self, ...):
"""
Get parameters for this estimator.
"""
params = self.__dict__
params = {key: val for (key, val) in params.items()
if not key.endswith('_')}
return params

def set_params(self, **params):
"""
Set the parameters of this estimator.
"""
for param in params:
if hasattr(self, param):
setattr(self, param, params[param])
return self

def get_support(self, indices=False):
"""
Get a mask, or integer index, of the features selected.
Parameters
----------
indices : bool
Default False. If True, the return value will be an array of integers, rather than a boolean mask.
Returns
-------
list :
returns support: An index that selects the retained features from a feature vector.
If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention.
If indices are True, this is an integer array of shape [# output features] whose values
are indices into the input feature vector.
"""
...
return mask


def transform(self, X):
"""Reduce X to the selected features.
Parameters
----------
X : array
which shape is [n_samples, n_features]
Returns
-------
X_r : array
which shape is [n_samples, n_selected_features]
The input samples with only the selected features.
"""
...
return X_r


def inverse_transform(self, X):
"""
Reverse the transformation operation
Parameters
----------
X : array
shape is [n_samples, n_selected_features]
Returns
-------
X_r : array
shape is [n_samples, n_original_features]
"""
...
return X_r
```

After integrating with Sklearn, we could use the feature selector as follows:
```python
from sklearn.linear_model import LogisticRegression

# load data
...
X_train, y_train = ...

# build a ppipeline
pipeline = make_pipeline(XXXSelector(...), LogisticRegression())
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)

# score
print("Pipeline Score: ", pipeline.score(X_train, y_train))

```

# Benchmark

`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data.

| Dataset | Baseline | GradientFeatureSelector | TreeBasedClassifier | #Train | #Feature |
| ----------- | ------ | ------ | ------- | ------- | -------- |
| colon-cancer | 0.7547 | 0.7368 | 0.7223 | 62 | 2,000 |
| gisette | 0.9725 | 0.89416 | 0.9792 | 6,000 | 5,000 |
| avazu | 0.8834 | N/A | N/A | 40,428,967 | 1,000,000 |
| rcv1 | 0.9644 | 0.7333 | 0.9615 | 20,242 | 47,236 |
| news20.binary | 0.9208 | 0.6870 | 0.9070 | 19,996 | 1,355,191 |
| real-sim | 0.9681 | 0.7969 | 0.9591 | 72,309 | 20,958 |

The benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
)

Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

import bz2
import urllib.request
import numpy as np

import os

from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

from nni.feature_engineering.gradient_selector import FeatureGradientSelector


class Benchmark():

def __init__(self, files, test_size = 0.2):
self.files = files
self.test_size = test_size


def run_all_test(self, pipeline):
for file_name in self.files:
file_path = self.files[file_name]

self.run_test(pipeline, file_name, file_path)


def run_test(self, pipeline, name, path):
print("download " + name)
update_name = self.download(name, path)
X, y = load_svmlight_file(update_name)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=42)

pipeline.fit(X_train, y_train)
print("[Benchmark "+ name + " Score]: ", pipeline.score(X_test, y_test))


def download(self, name, path):
old_name = name + '_train.bz2'
update_name = name + '_train.svm'

if os.path.exists(old_name) and os.path.exists(update_name):
return update_name

urllib.request.urlretrieve(path, filename=old_name)

f_svm = open(update_name, 'wt')
with bz2.open(old_name, 'rb') as f_zip:
data = f_zip.read()
f_svm.write(data.decode('utf-8'))
f_svm.close()

return update_name


if __name__ == "__main__":
LIBSVM_DATA = {
"rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2",
# "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2",
"colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2",
"gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2",
# "kdd2010" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2",
# "kdd2012" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.bz2",
"news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2",
"real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2",
"webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2"
}

test_benchmark = Benchmark(LIBSVM_DATA)

pipeline1 = make_pipeline(LogisticRegression())
print("Test all data in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline1)

pipeline2 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression())
print("Test data selected by FeatureGradientSelector in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline2)

pipeline3 = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
print("Test data selected by TreeClssifier in LogisticRegression.")
print()
test_benchmark.run_all_test(pipeline3)

print("Done.")
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
print("selected features\t", fgs.get_selected_features())

pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression())
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
# pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)

print("Pipeline Score: ", pipeline.score(X_train, y_train))

0 comments on commit 13d6b15

Please sign in to comment.