Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialize backend and test improvements #246

Merged
merged 63 commits into from
Oct 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
2851151
clean up #243
ardunn Oct 11, 2019
f8e6ca4
add serialize as method to DFMLAdaptor Base
ardunn Oct 11, 2019
e0e2ae2
wip better serialization
ardunn Oct 11, 2019
0617c5e
working serialization, no tests
ardunn Oct 11, 2019
b628b59
fixing #241
ardunn Oct 11, 2019
2b9ef9b
fix #234
ardunn Oct 11, 2019
b58ae35
fixes #230
ardunn Oct 11, 2019
ce6ba6d
add a from_preset method, closes #232
ardunn Oct 11, 2019
449dc11
add StructuralComplexity
ardunn Oct 11, 2019
50db0b4
update readme
ardunn Oct 11, 2019
c3fdf92
update workflows
ardunn Oct 11, 2019
84adc9f
fix #226
ardunn Oct 11, 2019
1ae52b7
wip serialization tests with new requirements
ardunn Oct 12, 2019
cc13cb1
try use global variable for temp backend
ardunn Oct 12, 2019
8bae684
add test for instantiating via presets, and separation into multiple …
ardunn Oct 12, 2019
dfa88f2
version test does not need external file.
ardunn Oct 12, 2019
d51ed32
add skip for intensive in debug tpot pipeline tests
ardunn Oct 12, 2019
bdcdc1a
removing pesky prints
ardunn Oct 12, 2019
a196ce2
wip fixing double saves of matpipes
ardunn Oct 14, 2019
d281392
fix tests
ardunn Oct 14, 2019
adf0287
update docs with new logo
ardunn Oct 14, 2019
fa00aab
make sure version pipe gets cleaned up in teardown
ardunn Oct 14, 2019
25737f3
wip improved digest
ardunn Oct 14, 2019
fef4146
wip two digest methods 2
ardunn Oct 14, 2019
1d6d348
working test for save_dict_to_file
ardunn Oct 14, 2019
9d14b4d
working and pretty test for save_dict_to_file
ardunn Oct 14, 2019
60e1c39
update ci configuration
ardunn Oct 14, 2019
5c194a4
working summary and details with tests
ardunn Oct 14, 2019
c5aab51
tmp - use git source for matminer until 0.6.1 is released
ardunn Oct 14, 2019
6394824
remove xlrd from requirements
ardunn Oct 14, 2019
f3722d8
change matminer back to 0.6.1
ardunn Oct 14, 2019
7c1d114
reenable all tests
ardunn Oct 14, 2019
0f4bbda
reenable single pipeline tests
ardunn Oct 14, 2019
d473ac7
wip docs
ardunn Oct 14, 2019
990b455
refactor summary --> summarize and details --> inspect
ardunn Oct 14, 2019
06c9b2b
wip docs 2
ardunn Oct 14, 2019
6885440
finish basic documentation
ardunn Oct 14, 2019
0d9968e
fix code-block of log in basic docs not looking right
ardunn Oct 14, 2019
16ada54
updates to basic docs
ardunn Oct 14, 2019
f6105fe
matminer version upgrade
ardunn Oct 14, 2019
add18da
fix logging, fixes #204
ardunn Oct 14, 2019
ffee0d4
add teardown to log test
ardunn Oct 14, 2019
a75f1c2
change all is_fit declarations in init to super calls to DFTransformer
ardunn Oct 14, 2019
4e04e75
wip adding warning for large numbers of handled nans
ardunn Oct 14, 2019
8309eda
fix #199
ardunn Oct 14, 2019
b82849f
wip working on ignored columns
ardunn Oct 14, 2019
1e77201
working ignore on predict no tests
ardunn Oct 14, 2019
ffb2cb9
passing tests and better logging for ignoring columns, fixes #228
ardunn Oct 14, 2019
231c494
reenable tests
ardunn Oct 14, 2019
d1fb510
fixed tests for datacleaner
ardunn Oct 14, 2019
f828935
update code docs, fixes #244
ardunn Oct 14, 2019
61e46fa
fix dumb pipeline ignore default
ardunn Oct 14, 2019
084f2d0
add ignore to benchmark
ardunn Oct 14, 2019
51e0c27
wip docs
ardunn Oct 14, 2019
69f1515
wip docs 2
ardunn Oct 15, 2019
6b5fa05
advanced usage is done
ardunn Oct 15, 2019
57982af
add matbench documentation
ardunn Oct 15, 2019
18f7194
add tutorials and clean up support
ardunn Oct 15, 2019
c584ade
updates to docs
ardunn Oct 15, 2019
745587e
close to getting final docs
ardunn Oct 15, 2019
6401a68
add rst files
ardunn Oct 15, 2019
0e575bd
docs in good shape
ardunn Oct 15, 2019
20b0721
update docs one last time [skip ci]
ardunn Oct 15, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
coverage run setup.py test
coverage xml
python-codacy-coverage -r coverage.xml
no_output_timeout: 120m
no_output_timeout: 10m

- save_cache:
paths:
Expand Down
109 changes: 0 additions & 109 deletions .circleci/config_old.yml

This file was deleted.

2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
include LICENSE
include CHANGELOG.md
include CONTRIBUTING.md
recursive-include automatminer *.txt *.py *.yaml *.json *.csv
recursive-include automatminer *.txt *.py *.yaml *.json *.csv *.p *.pickle
recursive-exclude benchdev *
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,9 @@ automatminer is an automatic prediction engine for materials properties.
|:----------:|:-------------:|:------:|:------:|
| [![CircleCI](https://img.shields.io/circleci/project/github/hackingmaterials/automatminer/master.svg)](https://circleci.com/gh/hackingmaterials/automatminer) | [![Codacy Badge](https://img.shields.io/codacy/coverage/aa63dd7aa85e480bbe0e924a02ad1540.svg?colorB=brightgreen)](https://www.codacy.com/app/ardunn/automatminer) | [![Codacy Badge](https://img.shields.io/codacy/grade/aa63dd7aa85e480bbe0e924a02ad1540.svg)](https://www.codacy.com/app/ardunn/automatminer) | [![PyPI version](https://img.shields.io/pypi/v/automatminer.svg?colorB=blue)](https://pypi.org/project/automatminer/) |

### Warning: Automatminer is currently at an experimental stage of development.
#### Please use in production at your own risk!

#### Automatminer requires the newest version of [matminer](https://github.com/hackingmaterials/matminer) (from git) to work properly!

- **Website (including work-in-progress documentation):** <http://hackingmaterials.lbl.gov/automatminer/>
- **Help/Support:** https://hackingmaterials.discourse.group/c/matminer/automatminer
- **Source:** <https://github.com/hackingmaterials/automatminer>

You may also be interested in the parent code of automatminer, matminer:
- **Matminer**: <https://github.com/hackingmaterials/matminer>
4 changes: 2 additions & 2 deletions automatminer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from automatminer.preprocessing import DataCleaner, FeatureReducer
from automatminer.automl import TPOTAdaptor
from automatminer.automl import TPOTAdaptor, SinglePipelineAdaptor
from automatminer.featurization import AutoFeaturizer
from automatminer.pipeline import MatPipe
from automatminer.presets import get_preset_config

__author__ = 'Alex Dunn, Qi Wang, Alex Ganose, Alireza Faghaninia, Anubhav Jain'
__author_email__ = 'ardunn@lbl.gov'
__license__ = 'Modified BSD'
__version__ = "2019.9.12"
__version__ = "2019.10.11"
151 changes: 97 additions & 54 deletions automatminer/automl/adaptors.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@
"""
from collections import OrderedDict

from sklearn.pipeline import Pipeline
from tpot import TPOTClassifier, TPOTRegressor
from tpot.base import TPOTBase

from automatminer.automl.config.tpot_configs import TPOT_CLASSIFIER_CONFIG, \
TPOT_REGRESSOR_CONFIG
Expand All @@ -27,6 +25,8 @@
'Qi Wang <wqthu11@gmail.com>',
'Daniel Dopp <dbdopp@lbl.gov>']

_adaptor_tmp_backend = None


class TPOTAdaptor(DFMLAdaptor, LoggableMixin):
"""
Expand Down Expand Up @@ -61,6 +61,10 @@ class TPOTAdaptor(DFMLAdaptor, LoggableMixin):
best_models (OrderedDict): The best model names and their scores.
backend (TPOTBase): The TPOT object interface used for ML training.
models (OrderedDict): The raw sklearn-style models output by TPOT.

from_serialized (bool): Whether the backend is loaded from a serialized
instance. If True, the previous full TPOT data will not be available
due to pickling problems.
"""

def __init__(self, logger=True, **tpot_kwargs):
Expand All @@ -80,6 +84,10 @@ def __init__(self, logger=True, **tpot_kwargs):
self._features = None
self.logger = logger

self.from_serialized = False
self._best_models = None
super(DFMLAdaptor, self).__init__()

@log_progress(AMM_LOG_FIT_STR)
@set_fitted
def fit(self, df, target, **fit_kwargs):
Expand Down Expand Up @@ -148,69 +156,110 @@ def best_models(self):
best hyperparameter combination found.

"""
self.greater_score_is_better = is_greater_better(
self.backend.scoring_function)

# Get list of evaluated model names, cast to set and back
# to get unique model names, instantiate ordered model dictionary
evaluated_models = []
for key in self.backend.evaluated_individuals_.keys():
evaluated_models.append(key.split('(')[0])

model_names = list(set(evaluated_models))
models = OrderedDict({model: [] for model in model_names})

# This makes a dict of model names mapped to all runs of that model
for key, val in self.backend.evaluated_individuals_.items():
models[key.split('(')[0]].append(val)

# For each base model type sort the runs by best score
for model_name in model_names:
models[model_name].sort(
key=lambda x: x['internal_cv_score'],
reverse=self.greater_score_is_better
)

# Gets a simplified dict of the model to only its best run
# Sort the best individual models by type to best models overall
best_models = OrderedDict(
sorted({model: models[model][0] for model in models}.items(),
key=lambda x: x[1]['internal_cv_score'],
reverse=self.greater_score_is_better))

# Mapping of top models to just their score
scores = {model: best_models[model]['internal_cv_score']
for model in best_models}

# Sorted dict of top models just mapped to their top scores
best_models_and_scores = OrderedDict(
sorted(scores.items(),
key=lambda x: x[1],
reverse=self.greater_score_is_better))
self.models = models
return best_models_and_scores

if self.from_serialized:
return self._best_models
else:
self.greater_score_is_better = is_greater_better(
self.backend.scoring_function)

# Get list of evaluated model names, cast to set and back
# to get unique model names, instantiate ordered model dictionary
evaluated_models = []
for key in self.backend.evaluated_individuals_.keys():
evaluated_models.append(key.split('(')[0])
# evaluated_models.append(key)

model_names = list(set(evaluated_models))
models = OrderedDict({model: [] for model in model_names})

# This makes a dict of model names mapped to all runs of that model
for key, val in self.backend.evaluated_individuals_.items():
models[key.split('(')[0]].append(val)

# For each base model type sort the runs by best score
for model_name in model_names:
models[model_name].sort(
key=lambda x: x['internal_cv_score'],
reverse=self.greater_score_is_better
)

# Gets a simplified dict of the model to only its best run
# Sort the best individual models by type to best models overall
best_models = OrderedDict(
sorted({model: models[model][0] for model in models}.items(),
key=lambda x: x[1]['internal_cv_score'],
reverse=self.greater_score_is_better))

# Mapping of top models to just their score
scores = {model: best_models[model]['internal_cv_score']
for model in best_models}

# Sorted dict of top models just mapped to their top scores
best_models_and_scores = OrderedDict(
sorted(scores.items(),
key=lambda x: x[1],
reverse=self.greater_score_is_better))
self.models = models
return best_models_and_scores

@property
@check_fitted
def backend(self):
return self._backend

@property
@check_fitted
def best_pipeline(self):
if isinstance(self._backend, TPOTBase):
return self._backend.fitted_pipeline_
elif isinstance(self._backend, Pipeline):
if self.from_serialized:
# The TPOT backend is replaced by the best pipeline.
return self._backend
else:
raise TypeError("Backend type not recognized as TPOT or Pipeline")
return self._backend.fitted_pipeline_

@property
@check_fitted
def features(self):
return self._features

@property
@check_fitted
def fitted_target(self):
return self._fitted_target

@check_fitted
def serialize(self) -> None:
"""
Avoid TPOT pickling issues. Used by MatPipe during save.

Returns:
(self): A deepcopy of this object, with some modifications to make
it serializable.

"""
if not self.from_serialized:
global _adaptor_tmp_backend
_adaptor_tmp_backend = self._backend
# Necessary for getting best models post serialization
self._best_models = self.best_models
self._backend = self.best_pipeline
self.from_serialized = True

@check_fitted
def deserialize(self) -> None:
"""
Get the original TPOTAdaptor image back after serializing, with
(relatively) contained scope.

Returns:
None
"""
if not self.from_serialized:
global _adaptor_tmp_backend
self._backend = _adaptor_tmp_backend
_adaptor_tmp_backend = None
self.from_serialized = False


class SinglePipelineAdaptor(DFMLAdaptor, LoggableMixin):
"""
Expand All @@ -236,11 +285,6 @@ class SinglePipelineAdaptor(DFMLAdaptor, LoggableMixin):

mode (str): Either AMM_REG_NAME (regression) or AMM_CLF_NAME
(classification)
_regressor (BaseEstimator): The single pipeline to be used for
regression
_classifier (BaseEstimator)L The single pipeline to be used for
classification

"""

def __init__(self, regressor, classifier, logger=True):
Expand Down Expand Up @@ -278,7 +322,7 @@ def fit(self, df, target, **fit_kwargs):
@property
@check_fitted
def backend(self):
return None
return self.best_pipeline

@property
@check_fitted
Expand All @@ -294,4 +338,3 @@ def features(self):
@check_fitted
def fitted_target(self):
return self._fitted_target

Loading