You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TPOT uses FeatureUnion to combined the outputs of multiple operators. However, it is possible for tpot to put in two stacking estimators within a FeatureUnion block. This causes tpot to pass along two identical copies on the dataset into the next operator.
Context of the issue
This increases computational load and complexity, especially for large datasets, with no benefit. It may also have a performance impact on certain models.
Process to reproduce the issue
User creates TPOT instance
User calls TPOT fit() function with training data
TPOT will generate a pipeline as described.
To demonstrate the issue, below is code using a pipeline that was found by tpot.
from sklearn.pipeline import FeatureUnion, Pipeline
from tpot.builtins import StackingEstimator, ZeroCount
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.decomposition import PCA
import numpy as np
p = Pipeline(
[('featureunion', FeatureUnion(transformer_list=[('stackingestimator-1',
StackingEstimator(estimator=RandomForestRegressor(max_features=0.45,
min_samples_leaf=9,
min_samples_split=4))),
('stackingestimator-2',
StackingEstimator(estimator=ExtraTreesRegressor(max_features=0.7500000000000001,
min_samples_leaf=20,
min_samples_split=18)))])), ('stackingestimator-1', StackingEstimator(estimator=SGDRegressor(alpha=0.01, eta0=1.0,
fit_intercept=False, l1_ratio=0.0,
loss='epsilon_insensitive',
penalty='elasticnet', power_t=1.0))), ('pca', PCA(iterated_power=3, svd_solver='randomized')), ('stackingestimator-2', StackingEstimator(estimator=SGDRegressor(alpha=0.001, fit_intercept=False,
l1_ratio=0.0,
loss='epsilon_insensitive',
penalty='elasticnet', power_t=1.0))), ('zerocount', ZeroCount()), ('sgdregressor', SGDRegressor(alpha=0.001, fit_intercept=False, l1_ratio=0.5,
learning_rate='constant', loss='huber', penalty='elasticnet',
power_t=0.1))]
)
X = np.random.rand(5,10)
y = np.random.rand(5)
p.fit(X,y)
xx = [range(10)]
print("Input data " ,xx)
print("After featureUnion ", p.steps[0][1].transform(xx))
Here is my idea off the top of my head:
Limit featureUnion to selectors, transformers, and at most one classifier or regressor. That way only one copy of the data exists. When more than one classifier or regressor is used, replace the featureUnion with the sklearn stackingclassifier or stackingregressor. These functions similarly allow multiple models to pass along their predictions, but then only pass forward one copy of the dataset.
The FunctionTransformer module can also be set to exactly copy the input into the next layer. I have generated another pipeline where Several feature unions are stacked with multiple function transformers that are essentially just leading to multiple copies of the data.
TPOT uses FeatureUnion to combined the outputs of multiple operators. However, it is possible for tpot to put in two stacking estimators within a FeatureUnion block. This causes tpot to pass along two identical copies on the dataset into the next operator.
Context of the issue
This increases computational load and complexity, especially for large datasets, with no benefit. It may also have a performance impact on certain models.
Process to reproduce the issue
fit()
function with training dataTo demonstrate the issue, below is code using a pipeline that was found by tpot.
Expected result
The data should not be copied over twice.
[Estimator 1 predictions, Estimator 2 predictions, X]
[0.44, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Current result
[Estimator 1 predictions, X, Estimator 2 predictions, X]
[0.44, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, .45, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Possible fix
Here is my idea off the top of my head:
Limit featureUnion to selectors, transformers, and at most one classifier or regressor. That way only one copy of the data exists. When more than one classifier or regressor is used, replace the featureUnion with the sklearn stackingclassifier or stackingregressor. These functions similarly allow multiple models to pass along their predictions, but then only pass forward one copy of the dataset.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html
The text was updated successfully, but these errors were encountered: