Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make name of new column with predictions appended to dataframe configurable #267

Merged
merged 2 commits into from
May 3, 2020

Conversation

janosh
Copy link
Member

@janosh janosh commented Nov 18, 2019

Closes #266.

@ardunn We might want to add a test that specifically calls mat_pipe.predict() with a custom value for pred_col_suffix. Right now, all I added was

# automl/tests/test_base.py
self.assertTrue(hasattr(tag, "pred_col_suffix"))

That's because the way TestAdaptorGood inherits from DFMLAdaptor and then overwrites predict makes adding a test there awkward. Should we instead have one on both TPOTAdaptor and SinglePipelineAdaptor?

Also, let me know if you'd like a different name for pred_col_suffix.

@ardunn
Copy link
Contributor

ardunn commented Nov 18, 2019

Hey @janosh thanks for the PR!

One quick thought - if we are going to accept changing column output names, it would probably be worth just replacing the entire output column name, not to make it a suffix. I could imagine some scenarios where people specify the output column name as zT_predicted" and then when appended the name is actually "zTzT_predicted" or something... because the person was expecting the name to be replaced, not appended.

@ardunn
Copy link
Contributor

ardunn commented Nov 18, 2019

And yes, we should definitely add a global matpipe test for it since it is both kind of a global option (i.e., its the output of the entire pipeline) and an Adaptor option.

@janosh
Copy link
Member Author

janosh commented Nov 18, 2019

Turned the target col suffix into a full name and added a preset test for target_col_name as a powerup.

Where should the MatPipe test live? I don't quite understand the flow in tests/test_pipeline.py.

Copy link
Contributor

@ardunn ardunn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the MatPipe test, it should definitely go in test_pipeline. To explain, the TestMatPipe class just defines common tests for any MatPipe configuration. The test classes that follow then run the same tests for different adaptors (Single, TPOT, etc.). One way to integrate the tests for this is change the TestMatPipe tests to just use the DFMLAdaptor.target_output_name instead of target + " predicted" in the extisting tests. You can add an extra test if needed

@@ -83,7 +86,7 @@ def __init__(self, **tpot_kwargs):

self.from_serialized = False
self._best_models = None
super(DFMLAdaptor, self).__init__()
super(TPOTAdaptor, self).__init__(**tpot_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to change TestAdaptorGood to not override DFMLAdaptor, since the DFMLAdaptor usage in test Adaptors should remain the same. Like we shouldn't be testing TPOTAdaptor where previously the base class was tested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what you mean... Is your comment related to this line?

Also, I was wondering what's the purpose of using Python 2 super() syntax in automatminer if all your CI is Python 3.7? I.e. why not simply use super() without arguments everywhere?

- super(TPOTAdaptor, self).__init__(**tpot_kwargs)
+ super().__init__(**tpot_kwargs)

Copy link
Contributor

@ardunn ardunn Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using py3 super syntax is fine by me!

Edit: never mind that comment, I was confused which file this was in. Seems fine by me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using py3 super syntax is fine by me!

Cool. Would you accept a separate PR converting all super() calls to Py3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also make it part of this PR if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's make that a separate PR though.

@@ -146,9 +153,11 @@ def predict(self, df: pd.DataFrame, target: str) -> pd.DataFrame:
"".format(not_in_df, not_in_model)
)
else:
if target_output_col is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just have the target output col argument be the "{target} predicted" string? Having it be None makes it a bit harder to see what is going on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nvm, I see it is set in the base class constructor

@@ -145,7 +145,7 @@ def run_task(self, fw_spec):

# Evaluate model
true = result_df[target]
test = result_df[target + " predicted"]
test = result_df[learner.target_output_col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are making changes to things which affect _dev, this will take me some time to merge, since the entire infrastructure I use to run AMM benchmarks depends on dev.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

@janosh
Copy link
Member Author

janosh commented Nov 19, 2019

One way to integrate the tests for this is change the TestMatPipe tests to just use the DFMLAdaptor.target_output_name instead of target + " predicted" in the extisting tests.

That part is already implemented.

@janosh
Copy link
Member Author

janosh commented Nov 19, 2019

Added a test for target_output_col intests/test_pipeline.py.

@ardunn
Copy link
Contributor

ardunn commented Nov 19, 2019

Added a test for target_output_col intests/test_pipeline.py.

Is there any way we could test this without re-fitting? Reason being running the full tests already takes a long time. If not its ok, just trying to brainstorm here a bit

@ardunn
Copy link
Contributor

ardunn commented Nov 19, 2019

@janosh after thinking on this some more, you may be right that adding this as a powerup is not the right implementation. Just having it alongside "ignore_cols" in matpipe predict args (and additionally in the adaptors, etc.) might be the way to go

@janosh
Copy link
Member Author

janosh commented Nov 19, 2019

Is there any way we could test this without re-fitting? Reason being running the full tests already takes a long time.

I thought about that as well. We could combine test_target_output_col and test_ignore and call it test_predict_kwargs. That way we would only need to fit once at the expense of having to work a little harder if that test fails to find out which kwarg is the culprit.

@janosh
Copy link
Member Author

janosh commented Nov 19, 2019

@janosh after thinking on this some more, you may be right that adding this as a powerup is not the right implementation. Just having it alongside "ignore_cols" in matpipe predict args (and additionally in the adaptors, etc.) might be the way to go

As you like. I'm fine with either. Let me know what you think about the combined test and then I can revert the powerup commit alongside changing the pipeline test.

@ardunn
Copy link
Contributor

ardunn commented Nov 19, 2019

Is there any way we could test this without re-fitting? Reason being running the full tests already takes a long time.

I thought about that as well. We could combine test_target_output_col and test_ignore and call it test_predict_kwargs. That way we would only need to fit once at the expense of having to work a little harder if that test fails to find out which kwarg is the culprit.

I think this is perfect! The implementation is rather simple so I'd imagine that won't be a major problem, and is preferred to testing them individually.

@ardunn
Copy link
Contributor

ardunn commented Nov 19, 2019

@janosh after thinking on this some more, you may be right that adding this as a powerup is not the right implementation. Just having it alongside "ignore_cols" in matpipe predict args (and additionally in the adaptors, etc.) might be the way to go

As you like. I'm fine with either. Let me know what you think about the combined test and then I can revert the powerup commit alongside changing the pipeline test.

Yeah I think not having it in the powerups is preferred. Mainly because you are right, it doesn't really fit the paradigm. The powerups are for applying common operations which affect the pipeline configuration, not things which can be changed after the pipeline has been fit on.

@ardunn
Copy link
Contributor

ardunn commented Nov 19, 2019

it will also be nice not having target_output_col be an attribute of the adaptors. If we don't make it a powerup it would be easy to just have it be an arg of DFMLAdaptor.predict (there's already too many damn attributes of each pipeline op). It would be quite a simple implementation in this case.

@janosh
Copy link
Member Author

janosh commented Nov 20, 2019

Alrighty, I reset all changes to upstream master, only kept the predict kwarg (no added attribute), refactored test_ignore to test_predict_kwargs and force pushed new changes. All that's missing now is documentation for target_output_col, I think.

@janosh
Copy link
Member Author

janosh commented Nov 25, 2019

I renamed target_output_col to output_col for brevity since I think that name is clear enough in the context of pipe.predict(). Hope you don't mind.

Anything left to do here?

@ardunn
Copy link
Contributor

ardunn commented Nov 27, 2019

@janosh sounds good. I am a bit behind so I haven't gotten the chance to see if this will change things in -dev, but I'll be doing that this week. hang tight in the meantime!

Edit: turns out i didn't do this that week

@janosh
Copy link
Member Author

janosh commented Mar 27, 2020

@ardunn With the quarantine in place, perhaps there's time now to revisit this and get it merged?

@ardunn
Copy link
Contributor

ardunn commented Mar 28, 2020

Haha @janosh yes I am starting work again on this. We are preparing a paper in parallel with this development so that has been taking considerable time as well. You can expect some automatminer progress over the next few weeks (including this PR)

@janosh
Copy link
Member Author

janosh commented Mar 28, 2020

Sounds good, looking forward to it! :)

@ardunn ardunn merged commit c7b0251 into hackingmaterials:master May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keyword arg for predicted column name
2 participants