Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to use built-in pickle for saving AutoMLSearch #2463

Merged
merged 10 commits into from
Jul 7, 2021

Conversation

christopherbunn
Copy link
Contributor

Resolves #2174

@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

Merging #2463 (833e034) into main (143c83a) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2463     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25555   25568     +13     
=======================================
+ Hits       25453   25466     +13     
  Misses       102     102             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.4% <100.0%> (+0.1%) ⬆️
evalml/automl/pipeline_search_plots.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.8% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 143c83a...833e034. Read the comment docs.

@christopherbunn christopherbunn marked this pull request as ready for review June 30, 2021 22:30
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn Looks good! The original issue also mentions doing something similar for pipelines. Is the plan is to do that in this PR or a follow up?

@@ -1331,7 +1347,7 @@ def load(file_path):
AutoSearchBase object
"""
with open(file_path, "rb") as f:
return cloudpickle.load(f)
return pickle.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can save with cloudpickle and read with pickle?! Wow, I had no idea that works hehe.

I feel like we should accept an argument here for the "pickle type"? Feels weird to offer a choice of library for save but not respect that in load. Not blocking though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I agree, it does feel symmetrical to do so. I think it would be a no-op though since it looks like the doc for cloudpickle just recommends using the standard python pickler for loading.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to pickle using a jupyter notebook and I'm getting the following, am I doing something wrong?:

automl_ = AutoMLSearch(X, y, problem_type="regression")
automl_.search()
automl_.save("test.pkl", pickle_type="pickle")

I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e1254c6fb959> in <module>
----> 1 automl_.save("test.pkl", pickle_type="pickle")

~/Desktop/evalml/evalml/automl/automl_search.py in save(self, file_path, pickle_type, pickle_protocol)
   1335 
   1336         with open(file_path, "wb") as f:
-> 1337             pkl_lib.dump(self, f, protocol=pickle_protocol)
   1338 
   1339     @staticmethod

TypeError: can't pickle module objects

It works if I use cloudpickle 🤔

with open(file_path, "wb") as f:
cloudpickle.dump(self, f, protocol=pickle_protocol)
pkl_lib.dump(self, f, protocol=pickle_protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user uses a cloudpickle protocol while trying to use pickle? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently cloudpickles can be opened by the regular pickling library (according to their example on their README.md and the doc string for cloudpickle.py in the attached screenshot)! I didn't know this before so that's pretty neat.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool 🤩

@freddyaboulton
Copy link
Contributor

@angela97lin Are you running in a jupyter notebook?

The following works for me when running from a script:

from evalml.demos import load_diabetes
from evalml.automl import AutoMLSearch

X, y = load_diabetes()

automl = AutoMLSearch(X, y, "regression")
automl.search()
automl.save("automl.pkl", pickle_type="pickle")
automl2 = automl.load("automl.pkl")

@christopherbunn
Copy link
Contributor Author

I'm trying to pickle using a jupyter notebook and I'm getting the following, am I doing something wrong?:

automl_ = AutoMLSearch(X, y, problem_type="regression")
automl_.search()
automl_.save("test.pkl", pickle_type="pickle")

I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-e1254c6fb959> in <module>
----> 1 automl_.save("test.pkl", pickle_type="pickle")

~/Desktop/evalml/evalml/automl/automl_search.py in save(self, file_path, pickle_type, pickle_protocol)
   1335 
   1336         with open(file_path, "wb") as f:
-> 1337             pkl_lib.dump(self, f, protocol=pickle_protocol)
   1338 
   1339     @staticmethod

TypeError: can't pickle module objects

It works if I use cloudpickle 🤔

Hmm, that's so odd! I am able to generate a pickle with the test and through a script version of my test Jupyter notebook. At the same time, I am also running into the same error as you when I try to run it interactively in a Jupyter notebook. Let me look into this...

@angela97lin
Copy link
Contributor

@freddyaboulton @christopherbunn Yup, running via jupyter notebook. What I was curious about was if it was possible to save in a file and load in completely different file so I wanted to have two separate notebooks to test but ran into this, heh.

@christopherbunn
Copy link
Contributor Author

christopherbunn commented Jul 1, 2021

@freddyaboulton @angela97lin I think I figured out the issue: in the notebook environment, we have the search iteration plot appear. However, in the test and script environment, we aren't showing a search iteration plot. As a result, the jupyter notebook version has the plot stored in automl.search_iteration_plot whereas the script version has None in this variable. Edit: this was incorrect, see below comment.

I am able to pickle the search in Jupyter when i set automl.search_iteration_plot=None.

I'm guessing that for some reason the plot doesn't support pickling? This is a bit odd considering that plotly has support for pickling per this PR on their repo. I'll see if I can find a workaround or a deeper root cause in Plotly.

@angela97lin
Copy link
Contributor

@christopherbunn Wow, good find! If you can't find a good workaround, I think it could also be fine to say that the plots will not be pickled and set automl.search_iteration_plot=None before saving, not sure if others have different opinions on that 🤷‍♀️

@freddyaboulton
Copy link
Contributor

I think the issue is that the PipelineSearchPlot has a reference to plotly.graph_objects, which is a module. Maybe we can refactor that? But I think not saving the plot is also a valid approach.

image

We might want to update the AutoMLSearch User Guide to talk about also using pickle to save automl and the limitations. Would also give us "coverage" for pickling in a jupyter notebook env which our unit tests don't cover.

@christopherbunn
Copy link
Contributor Author

Turns out I was wrong, the Jupyter notebook version uses a SearchIterationPlot object whereas the script/test version stores a plotly figure (which is pickleable). The error is raised because the _go class variable in SearchIterationPlot stores the module representing the plotly.graph_objects class.

I think the best path is to actually clear out the _go class variable after the init since this is the only place where this is used. Initial testing show that pickling works once this is done. I'll test a few edge cases but if it looks good I'll push it up.

@christopherbunn christopherbunn merged commit d5b8602 into main Jul 7, 2021
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
@freddyaboulton freddyaboulton deleted the 2174_pickling_search branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to pickle AutoMLSearch
3 participants