Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Innixma · 2021-07-05T22:07:53Z

Issue #, if available:

Description of changes:

Added user-specified groups parameter to enable custom fold-splits.
Fixed HPO for RF and KNN (previously it would crash and not train the models if HPO is specified).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jwmueller · 2021-07-06T08:24:01Z

tabular/src/autogluon/tabular/predictor/predictor.py

@@ -91,6 +91,13 @@ class TabularPredictor:
        If True, then weighted metrics will be reported based on the sample weights provided in the specified `sample_weight` (in which case `sample_weight` column must also be present in test data).
        In this case, the 'best' model used by default for prediction will also be decided based on a weighted version of evaluation metric.
        Note: we do not recommend specifying `weight_evaluation` when `sample_weight` is 'auto_weight' or 'balance_weight', instead specify appropriate `eval_metric`.
+    groups : str, default = None
+        [Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging.


may be better left as kwarg while it is experimental?

Since this feature is asked about so much, I'd rather it be highly visible so it isn't missed. By being experimental it does fully work without issue, it's just if the user specifies an invalid or weird grouping, edge cases can occur that may result in strange errors (hard to detect if groups are invalid or not prior to trying). This is mostly related to cases with very small amounts of data.

jwmueller · 2021-07-06T08:25:37Z

tabular/src/autogluon/tabular/predictor/predictor.py

+        [Experimental] If specified, AutoGluon will use the column named the value of groups in `train_data` during `.fit` as the data splitting indices for the purposes of bagging.
+        This column will not be used as a feature during model training.
+        The data will be split via `sklearn.model_selection.LeaveOneGroupOut`.
+        Use this option to control the exact split indices AutoGluon uses.


probably need to clarify what the groups column values should look like. And how it works with bagging vs no-bagging. And what happens if tuning_data was also specified.

could be helpful to provide a motivating example: "For example, if you want your data folds to preserve adjacent rows in the table (without shuffling), then the groups column should look like ..."

added example and more clarifications

…KNN models.

szha · 2021-07-11T22:22:07Z

Job PR-1224-3 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/3/index.html

szha · 2021-07-12T01:14:43Z

Job PR-1224-4 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/4/index.html

szha · 2021-07-15T01:35:28Z

Job PR-1224-5 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-1224/5/index.html

gradientsky · 2021-07-20T18:42:08Z

core/src/autogluon/core/models/ensemble/bagged_ensemble_model.py

+            return
+
+        if k_fold_end is None:
+            k_fold_end = k_fold


Is it expected that the caller function should get this override? Currently it'll be replaced just in this function scope.

This occurs elsewhere. At the point where this validate is called, groups can also be present, meaning k_fold itself can still change, so we don't want to change k_fold_end to the old k_fold value.

gradientsky · 2021-07-20T20:16:07Z

tabular/src/autogluon/tabular/predictor/predictor.py

+        Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups.
+


There are couple cases to worry here:

Heavy classes imbalances between groups would results with incorrect time limits estimates and potentially memory errors on 'heavy' folds. Sanity check for such imbalance might be helpful.

sub-folds might have just a subset of training labels and bags need to be aligned on label space if some form of label encoding is used.

...,0,A ...,0,B ...,0,C ...,1,C ...,1,D ...,1,D

First fold would know only about A,B,C; 2nd about C,D. Is there handling which would align the fold predictions into an aggregate prediciton?

This is very hard to fix, even if we knew there was imbalance it wouldn't necessarily help us that much.

Worst case scenario which causes crash is if a class is only present in 1 fold, meaning that a model that uses that fold as validation won't see that class in training. This is what I'd consider an invalid grouping, and probably isn't worth trying to work with (unless we later determine that there are valid use-cases for this being the case). In your example, classes A, B, and D fit that description, and will cause a crash.

gradientsky · 2021-07-20T20:23:46Z

core/src/autogluon/core/utils/utils.py

+        if self.groups is not None:
+            num_groups = len(self.groups.unique())
+            if self.n_repeats != 1:
+                raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})')


Do we really need to raise exception? Alternatively we can log a warning and reset it to 1 automatically.

I prefer an exception, n_repeats is simply not valid for this situation and should never be passed in this fashion.

gradientsky · 2021-07-20T20:24:19Z

core/src/autogluon/core/utils/utils.py

+            if self.n_repeats != 1:
+                raise AssertionError(f'n_repeats must be 1 when split groups are specified. (n_repeats={self.n_repeats})')
+            self.n_splits = num_groups
+            splitter_cls = LeaveOneGroupOut


log a warning that the feature is experimental

This would log spam: This is repeated for every bag.

jwmueller

LGTM

Innixma requested review from jwmueller and gradientsky July 5, 2021 22:07

jwmueller reviewed Jul 6, 2021

View reviewed changes

Innixma added 2 commits July 11, 2021 13:22

Tabular: Added user-specified groups parameter, fixed HPO for RF and …

2b7c30c

…KNN models.

minor fix

9cac6ed

Innixma force-pushed the tabular_groups branch from 2e0b986 to 9cac6ed Compare July 11, 2021 20:22

addressed comments

263b308

gradientsky reviewed Jul 20, 2021

View reviewed changes

jwmueller self-requested a review July 21, 2021 01:28

jwmueller approved these changes Jul 21, 2021

View reviewed changes

Innixma merged commit 20f558c into master Jul 21, 2021

Innixma mentioned this pull request Aug 14, 2021

GroupKFold crossvalidation in AutoGluon Tabular? #901

Closed

Innixma mentioned this pull request Sep 28, 2021

How can I add my own CV split? #1120

Closed

Innixma deleted the tabular_groups branch November 22, 2021 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Innixma commented Jul 5, 2021

jwmueller Jul 6, 2021

Innixma Jul 12, 2021

jwmueller Jul 6, 2021 •

edited

Loading

jwmueller Jul 6, 2021

Innixma Jul 12, 2021

szha commented Jul 11, 2021

szha commented Jul 12, 2021

szha commented Jul 15, 2021

gradientsky Jul 20, 2021

Innixma Jul 20, 2021

gradientsky Jul 20, 2021

Innixma Jul 20, 2021 •

edited

Loading

gradientsky Jul 20, 2021 •

edited

Loading

Innixma Jul 20, 2021

gradientsky Jul 20, 2021

Innixma Jul 20, 2021

jwmueller left a comment

		Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups.

Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Tabular: Added user-specified groups parameter, fixed HPO for RF and KNN #1224

Conversation

Innixma commented Jul 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwmueller Jul 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha commented Jul 11, 2021

szha commented Jul 12, 2021

szha commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Innixma Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

gradientsky Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwmueller left a comment

Choose a reason for hiding this comment

jwmueller Jul 6, 2021 •

edited

Loading

Innixma Jul 20, 2021 •

edited

Loading

gradientsky Jul 20, 2021 •

edited

Loading