Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Merged
merged 3 commits into from
Jul 21, 2021

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Jul 5, 2021

Issue #, if available:

#1174, #1120, #617, #901

Description of changes:

  • Added user-specified groups parameter to enable custom fold-splits.
  • Fixed HPO for RF and KNN (previously it would crash and not train the models if HPO is specified).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma requested review from jwmueller and gradientsky July 5, 2021 22:07
@@ -91,6 +91,13 @@ class TabularPredictor:
If True, then weighted metrics will be reported based on the sample weights provided in the specified `sample_weight` (in which case `sample_weight` column must also be present in test data).
In this case, the 'best' model used by default for prediction will also be decided based on a weighted version of evaluation metric.
Note: we do not recommend specifying `weight_evaluation` when `sample_weight` is 'auto_weight' or 'balance_weight', instead specify appropriate `eval_metric`.
groups : str, default = None
[Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be better left as kwarg while it is experimental?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature is asked about so much, I'd rather it be highly visible so it isn't missed. By being experimental it does fully work without issue, it's just if the user specifies an invalid or weird grouping, edge cases can occur that may result in strange errors (hard to detect if groups are invalid or not prior to trying). This is mostly related to cases with very small amounts of data.

[Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging.
This column will not be used as a feature during model training.
The data will be split via `sklearn.model_selection.LeaveOneGroupOut`.
Use this option to control the exact split indices AutoGluon uses.
Copy link
Contributor

@jwmueller jwmueller Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to clarify what the groups column values should look like. And how it works with bagging vs no-bagging. And what happens if tuning_data was also specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be helpful to provide a motivating example: "For example, if you want your data folds to preserve adjacent rows in the table (without shuffling), then the groups column should look like ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added example and more clarifications

@szha
Copy link

szha commented Jul 11, 2021

Job PR-1224-3 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/3/index.html

@szha
Copy link

szha commented Jul 12, 2021

Job PR-1224-4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/4/index.html

@szha
Copy link

szha commented Jul 15, 2021

Job PR-1224-5 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/5/index.html

return

if k_fold_end is None:
k_fold_end = k_fold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that the caller function should get this override? Currently it'll be replaced just in this function scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This occurs elsewhere. At the point where this validate is called, groups can also be present, meaning k_fold itself can still change, so we don't want to change k_fold_end to the old k_fold value.

Comment on lines +101 to +102
Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple cases to worry here:

  1. Heavy classes imbalances between groups would results with incorrect time limits estimates and potentially memory errors on 'heavy' folds. Sanity check for such imbalance might be helpful.
  2. sub-folds might have just a subset of training labels and bags need to be aligned on label space if some form of label encoding is used.
...,0,A
...,0,B
...,0,C
...,1,C
...,1,D
...,1,D

First fold would know only about A,B,C; 2nd about C,D. Is there handling which would align the fold predictions into an aggregate prediciton?

Copy link
Contributor Author

@Innixma Innixma Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is very hard to fix, even if we knew there was imbalance it wouldn't necessarily help us that much.

  2. Worst case scenario which causes crash is if a class is only present in 1 fold, meaning that a model that uses that fold as validation won't see that class in training. This is what I'd consider an invalid grouping, and probably isn't worth trying to work with (unless we later determine that there are valid use-cases for this being the case). In your example, classes A, B, and D fit that description, and will cause a crash.

if self.groups is not None:
num_groups = len(self.groups.unique())
if self.n_repeats != 1:
raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})')
Copy link
Contributor

@gradientsky gradientsky Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to raise exception? Alternatively we can log a warning and reset it to 1 automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer an exception, n_repeats is simply not valid for this situation and should never be passed in this fashion.

if self.n_repeats != 1:
raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})')
self.n_splits = num_groups
splitter_cls = LeaveOneGroupOut
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log a warning that the feature is experimental

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would log spam: This is repeated for every bag.

@jwmueller jwmueller self-requested a review July 21, 2021 01:28
Copy link
Contributor

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Innixma Innixma merged commit 20f558c into master Jul 21, 2021
@Innixma Innixma deleted the tabular_groups branch November 22, 2021 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants