This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Refine Feature Selector #1778
Merged
chicm-ms
merged 39 commits into
microsoft:master
from
xuehui1991:diff_feature_selection
Nov 26, 2019
Merged
Refine Feature Selector #1778
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
185f698
first update
xuehui1991 306ecf1
update by folder naming
xuehui1991 61b4719
add gradient feature selection example
xuehui1991 5371111
add examples
xuehui1991 405ba9c
delete unused example
xuehui1991 024e9c7
update by pylint
xuehui1991 0488949
update by pylint
xuehui1991 b5c865c
update learnability by info from pylint
xuehui1991 9d9d118
fix pylint in fgtrain
xuehui1991 15d416a
update fginitlize and learnability by pylint
xuehui1991 39c99b5
update by evan's response
xuehui1991 4364f8a
add gbdt_selector
xuehui1991 d2d8328
update gbdt_selector
xuehui1991 5420202
refine the example folder structure
xuehui1991 635f0d9
update feature engineering doc
xuehui1991 11290dc
update docs of feature selector
xuehui1991 0b11826
update doc of gradientfeature selector
xuehui1991 319abe5
update docs of GBDTSelector
xuehui1991 4a3338c
update examples of gradientfeature selector
xuehui1991 ef0899f
update folder structure
xuehui1991 e43cfef
update docs by folder structure
xuehui1991 565c211
test pylint
xuehui1991 d710d8f
test
xuehui1991 1497999
Merge remote-tracking branch 'upstream/master' into diff_feature_sele…
xuehui1991 9c509a6
update by pylint
xuehui1991 7050556
update by pylint
xuehui1991 63ce6a0
update docs and remove some dependency
xuehui1991 cee67af
remove unused code
xuehui1991 0845ce9
update by comments
xuehui1991 d1c6ac0
update by comments
xuehui1991 4ef2bb7
move the feature selection example path
xuehui1991 f86342b
delete unused dependency
xuehui1991 d251063
update the doc of overview
xuehui1991 4d31c4d
Merge remote-tracking branch 'upstream/master' into diff_feature_sele…
xuehui1991 5d60191
add benchmark
xuehui1991 34495d1
update by comments
xuehui1991 05afff6
update minus issue
xuehui1991 7ac0303
fix minus issue
xuehui1991 258451a
update by minus change
xuehui1991 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,257 @@ | ||
# FeatureEngineering | ||
|
||
We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute. | ||
We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on user feedback. We'd like to invite you to use, feedback and even contribute. | ||
|
||
For now, we support the following feature selector: | ||
- [GradientFeatureSelector](./GradientFeatureSelector.md) | ||
- [GBDTSelector](./GBDTSelector.md) | ||
|
||
|
||
# How to use? | ||
|
||
```python | ||
from nni.feature_engineering.gradient_selector import GradientFeatureSelector | ||
# from nni.feature_engineering.gbdt_selector import GBDTSelector | ||
|
||
# load data | ||
... | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) | ||
|
||
# initlize a selector | ||
fgs = GradientFeatureSelector(...) | ||
# fit data | ||
fgs.fit(X_train, y_train) | ||
# get improtant features | ||
# will return the index with important feature here. | ||
print(fgs.get_selected_features(...)) | ||
|
||
... | ||
``` | ||
|
||
When using the built-in Selector, you first need to `import` a feature selector, and `initialize` it. You could call the function `fit` in the selector to pass the data to the selector. After that, you could use `get_seleteced_features` to get important features. The function parameters in different selectors might be different, so you need to check the docs before using it. | ||
|
||
# How to customize? | ||
|
||
NNI provides _state-of-the-art_ feature selector algorithm in the builtin-selector. NNI also supports to build a feature selector by yourself. | ||
|
||
If you want to implement a customized feature selector, you need to: | ||
|
||
1. Inherit the base FeatureSelector class | ||
1. Implement _fit_ and _get_selected_features_ function | ||
1. Integrate with sklearn (Optional) | ||
|
||
Here is an example: | ||
|
||
**1. Inherit the base Featureselector Class** | ||
|
||
```python | ||
from nni.feature_engineering.feature_selector import FeatureSelector | ||
|
||
class CustomizedSelector(FeatureSelector): | ||
def __init__(self, ...): | ||
... | ||
``` | ||
|
||
**2. Implement _fit_ and _get_selected_features_ Function** | ||
|
||
```python | ||
from nni.tuner import Tuner | ||
|
||
from nni.feature_engineering.feature_selector import FeatureSelector | ||
|
||
class CustomizedSelector(FeatureSelector): | ||
def __init__(self, ...): | ||
... | ||
|
||
def fit(self, X, y, **kwargs): | ||
""" | ||
Fit the training data to FeatureSelector | ||
|
||
Parameters | ||
------------ | ||
X : array-like numpy matrix | ||
The training input samples, which shape is [n_samples, n_features]. | ||
y: array-like numpy matrix | ||
The target values (class labels in classification, real numbers in regression). Which shape is [n_samples]. | ||
""" | ||
self.X = X | ||
self.y = y | ||
... | ||
|
||
def get_selected_features(self): | ||
""" | ||
Get important feature | ||
|
||
xuehui1991 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Returns | ||
------- | ||
list : | ||
Return the index of the important feature. | ||
""" | ||
... | ||
return self.selected_features_ | ||
|
||
... | ||
``` | ||
|
||
**3. Integrate with Sklearn** | ||
|
||
`sklearn.pipeline.Pipeline` can connect models in series, such as feature selector, normalization, and classification/regression to form a typical machine learning problem workflow. | ||
The following step could help us to better integrate with sklearn, which means we could treat the customized feature selector as a mudule of the pipeline. | ||
|
||
1. Inherit the calss _sklearn.base.BaseEstimator_ | ||
1. Implement _get_params_ and _set_params_ function in _BaseEstimator_ | ||
1. Inherit the class _sklearn.feature_selection.base.SelectorMixin_ | ||
1. Implement _get_support_, _transform_ and _inverse_transform_ Function in _SelectorMixin_ | ||
|
||
Here is an example: | ||
|
||
**1. Inherit the BaseEstimator Class and its Function** | ||
|
||
```python | ||
from sklearn.base import BaseEstimator | ||
from nni.feature_engineering.feature_selector import FeatureSelector | ||
|
||
class CustomizedSelector(FeatureSelector, BaseEstimator): | ||
def __init__(self, ...): | ||
... | ||
|
||
def get_params(self, ...): | ||
""" | ||
Get parameters for this estimator. | ||
""" | ||
params = self.__dict__ | ||
params = {key: val for (key, val) in params.items() | ||
if not key.endswith('_')} | ||
return params | ||
|
||
def set_params(self, **params): | ||
""" | ||
Set the parameters of this estimator. | ||
""" | ||
for param in params: | ||
if hasattr(self, param): | ||
setattr(self, param, params[param]) | ||
return self | ||
|
||
``` | ||
|
||
**2. Inherit the SelectorMixin Class and its Function** | ||
```python | ||
from sklearn.base import BaseEstimator | ||
from sklearn.feature_selection.base import SelectorMixin | ||
|
||
from nni.feature_engineering.feature_selector import FeatureSelector | ||
|
||
class CustomizedSelector(FeatureSelector, BaseEstimator): | ||
def __init__(self, ...): | ||
... | ||
|
||
def get_params(self, ...): | ||
""" | ||
Get parameters for this estimator. | ||
""" | ||
params = self.__dict__ | ||
params = {key: val for (key, val) in params.items() | ||
if not key.endswith('_')} | ||
return params | ||
|
||
def set_params(self, **params): | ||
""" | ||
Set the parameters of this estimator. | ||
""" | ||
for param in params: | ||
if hasattr(self, param): | ||
setattr(self, param, params[param]) | ||
return self | ||
|
||
def get_support(self, indices=False): | ||
""" | ||
Get a mask, or integer index, of the features selected. | ||
|
||
Parameters | ||
---------- | ||
indices : bool | ||
Default False. If True, the return value will be an array of integers, rather than a boolean mask. | ||
|
||
Returns | ||
------- | ||
list : | ||
returns support: An index that selects the retained features from a feature vector. | ||
If indices are False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. | ||
If indices are True, this is an integer array of shape [# output features] whose values | ||
are indices into the input feature vector. | ||
""" | ||
... | ||
return mask | ||
|
||
|
||
def transform(self, X): | ||
"""Reduce X to the selected features. | ||
|
||
Parameters | ||
---------- | ||
X : array | ||
which shape is [n_samples, n_features] | ||
|
||
Returns | ||
------- | ||
X_r : array | ||
which shape is [n_samples, n_selected_features] | ||
The input samples with only the selected features. | ||
""" | ||
... | ||
return X_r | ||
|
||
|
||
def inverse_transform(self, X): | ||
""" | ||
Reverse the transformation operation | ||
|
||
Parameters | ||
---------- | ||
X : array | ||
shape is [n_samples, n_selected_features] | ||
|
||
Returns | ||
------- | ||
X_r : array | ||
shape is [n_samples, n_original_features] | ||
""" | ||
... | ||
return X_r | ||
``` | ||
|
||
After integrating with Sklearn, we could use the feature selector as follows: | ||
```python | ||
from sklearn.linear_model import LogisticRegression | ||
|
||
# load data | ||
... | ||
X_train, y_train = ... | ||
|
||
# build a ppipeline | ||
pipeline = make_pipeline(XXXSelector(...), LogisticRegression()) | ||
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression()) | ||
pipeline.fit(X_train, y_train) | ||
|
||
# score | ||
print("Pipeline Score: ", pipeline.score(X_train, y_train)) | ||
|
||
``` | ||
|
||
# Benchmark | ||
|
||
`Baseline` means without any feature selection, we directly pass the data to LogisticRegression. For this benchmark, we only use 10% data from the train as test data. | ||
|
||
| Dataset | Baseline | GradientFeatureSelector | TreeBasedClassifier | #Train | #Feature | | ||
| ----------- | ------ | ------ | ------- | ------- | -------- | | ||
| colon-cancer | 0.7547 | 0.7368 | 0.7223 | 62 | 2,000 | | ||
| gisette | 0.9725 | 0.89416 | 0.9792 | 6,000 | 5,000 | | ||
| avazu | 0.8834 | N/A | N/A | 40,428,967 | 1,000,000 | | ||
| rcv1 | 0.9644 | 0.7333 | 0.9615 | 20,242 | 47,236 | | ||
| news20.binary | 0.9208 | 0.6870 | 0.9070 | 19,996 | 1,355,191 | | ||
| real-sim | 0.9681 | 0.7969 | 0.9591 | 72,309 | 20,958 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no number of gbdtselector? |
||
|
||
The benchmark could be download in [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ | ||
) | ||
|
107 changes: 107 additions & 0 deletions
107
examples/feature_engineering/gradient_feature_selector/benchmark_test.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Copyright (c) Microsoft Corporation | ||
# All rights reserved. | ||
# | ||
# MIT License | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and | ||
# to permit persons to whom the Software is furnished to do so, subject to the following conditions: | ||
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING | ||
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, | ||
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | ||
|
||
import bz2 | ||
import urllib.request | ||
import numpy as np | ||
|
||
import os | ||
|
||
from sklearn.datasets import load_svmlight_file | ||
from sklearn.model_selection import train_test_split | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.linear_model import LogisticRegression | ||
|
||
from sklearn.ensemble import ExtraTreesClassifier | ||
from sklearn.feature_selection import SelectFromModel | ||
|
||
from nni.feature_engineering.gradient_selector import FeatureGradientSelector | ||
|
||
|
||
class Benchmark(): | ||
|
||
def __init__(self, files, test_size = 0.2): | ||
self.files = files | ||
self.test_size = test_size | ||
|
||
|
||
def run_all_test(self, pipeline): | ||
for file_name in self.files: | ||
file_path = self.files[file_name] | ||
|
||
self.run_test(pipeline, file_name, file_path) | ||
|
||
|
||
def run_test(self, pipeline, name, path): | ||
print("download " + name) | ||
update_name = self.download(name, path) | ||
X, y = load_svmlight_file(update_name) | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size, random_state=42) | ||
|
||
pipeline.fit(X_train, y_train) | ||
print("[Benchmark "+ name + " Score]: ", pipeline.score(X_test, y_test)) | ||
|
||
|
||
def download(self, name, path): | ||
old_name = name + '_train.bz2' | ||
update_name = name + '_train.svm' | ||
|
||
if os.path.exists(old_name) and os.path.exists(update_name): | ||
return update_name | ||
|
||
urllib.request.urlretrieve(path, filename=old_name) | ||
|
||
f_svm = open(update_name, 'wt') | ||
with bz2.open(old_name, 'rb') as f_zip: | ||
data = f_zip.read() | ||
f_svm.write(data.decode('utf-8')) | ||
f_svm.close() | ||
|
||
return update_name | ||
|
||
|
||
if __name__ == "__main__": | ||
LIBSVM_DATA = { | ||
"rcv1" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2", | ||
# "avazu" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/avazu-app.bz2", | ||
"colon-cancer" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/covtype.libsvm.binary.bz2", | ||
"gisette" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/gisette_scale.bz2", | ||
# "kdd2010" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2", | ||
# "kdd2012" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.bz2", | ||
"news20.binary" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/news20.binary.bz2", | ||
"real-sim" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/real-sim.bz2", | ||
"webspam" : "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/webspam_wc_normalized_trigram.svm.bz2" | ||
} | ||
|
||
test_benchmark = Benchmark(LIBSVM_DATA) | ||
|
||
pipeline1 = make_pipeline(LogisticRegression()) | ||
print("Test all data in LogisticRegression.") | ||
print() | ||
test_benchmark.run_all_test(pipeline1) | ||
|
||
pipeline2 = make_pipeline(FeatureGradientSelector(n_features=20), LogisticRegression()) | ||
print("Test data selected by FeatureGradientSelector in LogisticRegression.") | ||
print() | ||
test_benchmark.run_all_test(pipeline2) | ||
|
||
pipeline3 = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression()) | ||
print("Test data selected by TreeClssifier in LogisticRegression.") | ||
print() | ||
test_benchmark.run_all_test(pipeline3) | ||
|
||
print("Done.") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gfs?