Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make metaselection part of AutoFeaturizer #120

Merged
merged 19 commits into from
Nov 13, 2018

Conversation

Qi-max
Copy link
Contributor

@Qi-max Qi-max commented Nov 2, 2018

  • define _customize_featurizers to customize self.featurizers and add metaselection to be part of it. Users can use self.metaselector to get the FeaturizerMetaSelector class, containing self.metaselector.dataset_mfs (dataset_metafeatures calculated for this dataset), self.metaselector.excludes (auto excluded featurizers that do not work) etc.

  • make self.featurizers only contain featurizers for featurizer_types that have corresponding column in the dataframe, or say the featurizers that we are really going to use. User-specificed featurizers does not have this restriction. _add_composition_from_structure is called before _customize_featurizers to use metaselection for the added composition column

  • rename the default column name to be "composition" instead of the original "formula"

  • make dataset_metafeatures, composition_statistics etc as def instead of class

  • fix some minor issues of the original glass_binary.csv and update the tests related to it

  • add tests for the metaselection part in TestAutoFeaturizer, fix copy.copy and add docs etc

@ardunn
Copy link
Contributor

ardunn commented Nov 2, 2018

Hey Qi, I have a few questions

  • Why did OFM get removed from default featurizers? I had previously added because it now has the ability to given flattened (ie ML-friendly) features, but hadn't rigorously tested it
  • If I'm not mistaken, all the dataset loading stuff is going to be moved to matminer very soon (I think @Doppe1g4nger has a PR in the works for it). So I don't think any tests should depend on matbench loading functions since they'll soon be removed completely...

@Qi-max
Copy link
Contributor Author

Qi-max commented Nov 2, 2018

Thanks for the comments, Alex!@ardunn

  • For the OFM, I am getting errors from this featurizer when running the tests and it seems to be related to flattening (I suspect but hadn't thoroughly tested yet). I am planning to take some time to look closer to it. Did the tests pass before with this featurizer in the list?

  • Yeah, I saw the dataset pr in matminer. Currently, this pr is not merged, so I used the load function for now. Maybe we can leave this pr open, so I can wait until that pr is merged and then update all the tests? Or I can choose another dataset for these tests. Btw, I have fixed some issues on the previous glass_binary dataset and I will send the revised dataset to Daniel.

@ardunn
Copy link
Contributor

ardunn commented Nov 2, 2018

yes tests work for me right now with OFM enabled...

@ardunn
Copy link
Contributor

ardunn commented Nov 2, 2018

re: datasets, ok i think now it is merged though

@ardunn
Copy link
Contributor

ardunn commented Nov 6, 2018

@Qi-max sorry the refactor is causing some conflicts here. You can probably just pull most recent changes and update and everything will be ok

I'll go ahead and merge once the following items are taken care of:

  • _customize_featurizers needs documentation
  • the new attrs added (e.g., exclude, use_metaselector, need documentation
  • max_na_percent ---> max_na_frac (0.05% is not the same as 0.05 fraction, which is used in DataCleaner. having both is confusing? or i might be misinterpreting here...)
  • reinstate OFM for the time being (what is the error being thrown? I might be able to help debug
  • remove any tests which depend on old matbench dataset loading
  • since this PR is changing the featurization procedure (albeit slightly), there should be a separate test making sure some of the new args work as intended. ie a TestAutoFeaturizer method called test_metaselector which compares featurized dataframes with the metaselector on and off. And/or a TestAutoFeaturizer test method which makes sure when a user passes in featurizers, those featurizers are actually executed, and the autofeaturization is not used.

@Qi-max
Copy link
Contributor Author

Qi-max commented Nov 12, 2018

I think the PR is good now. @ardunn

@ardunn
Copy link
Contributor

ardunn commented Nov 13, 2018

Great, thanks for doing this @Qi-max

@ardunn ardunn merged commit a8b6268 into hackingmaterials:master Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants