-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224
Conversation
@@ -91,6 +91,13 @@ class TabularPredictor: | |||
If True, then weighted metrics will be reported based on the sample weights provided in the specified `sample_weight` (in which case `sample_weight` column must also be present in test data). | |||
In this case, the 'best' model used by default for prediction will also be decided based on a weighted version of evaluation metric. | |||
Note: we do not recommend specifying `weight_evaluation` when `sample_weight` is 'auto_weight' or 'balance_weight', instead specify appropriate `eval_metric`. | |||
groups : str, default = None | |||
[Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be better left as kwarg while it is experimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this feature is asked about so much, I'd rather it be highly visible so it isn't missed. By being experimental it does fully work without issue, it's just if the user specifies an invalid or weird grouping, edge cases can occur that may result in strange errors (hard to detect if groups are invalid or not prior to trying). This is mostly related to cases with very small amounts of data.
[Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging. | ||
This column will not be used as a feature during model training. | ||
The data will be split via `sklearn.model_selection.LeaveOneGroupOut`. | ||
Use this option to control the exact split indices AutoGluon uses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to clarify what the groups
column values should look like. And how it works with bagging vs no-bagging. And what happens if tuning_data
was also specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be helpful to provide a motivating example: "For example, if you want your data folds to preserve adjacent rows in the table (without shuffling), then the groups
column should look like ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added example and more clarifications
Job PR-1224-3 is done. |
Job PR-1224-4 is done. |
Job PR-1224-5 is done. |
return | ||
|
||
if k_fold_end is None: | ||
k_fold_end = k_fold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that the caller function should get this override? Currently it'll be replaced just in this function scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This occurs elsewhere. At the point where this validate is called, groups can also be present, meaning k_fold itself can still change, so we don't want to change k_fold_end to the old k_fold value.
Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are couple cases to worry here:
- Heavy classes imbalances between groups would results with incorrect time limits estimates and potentially memory errors on 'heavy' folds. Sanity check for such imbalance might be helpful.
- sub-folds might have just a subset of training labels and bags need to be aligned on label space if some form of label encoding is used.
...,0,A
...,0,B
...,0,C
...,1,C
...,1,D
...,1,D
First fold would know only about A,B,C; 2nd about C,D. Is there handling which would align the fold predictions into an aggregate prediciton?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This is very hard to fix, even if we knew there was imbalance it wouldn't necessarily help us that much.
-
Worst case scenario which causes crash is if a class is only present in 1 fold, meaning that a model that uses that fold as validation won't see that class in training. This is what I'd consider an invalid grouping, and probably isn't worth trying to work with (unless we later determine that there are valid use-cases for this being the case). In your example, classes A, B, and D fit that description, and will cause a crash.
if self.groups is not None: | ||
num_groups = len(self.groups.unique()) | ||
if self.n_repeats != 1: | ||
raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to raise exception? Alternatively we can log a warning and reset it to 1 automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer an exception, n_repeats is simply not valid for this situation and should never be passed in this fashion.
if self.n_repeats != 1: | ||
raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})') | ||
self.n_splits = num_groups | ||
splitter_cls = LeaveOneGroupOut |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log a warning that the feature is experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would log spam: This is repeated for every bag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue #, if available:
#1174, #1120, #617, #901
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.