-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
min_child_samples plays bad with weights #5236
Comments
@memeplex Thanks. I do think it is necessary to provide an exact control over |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
I've been trying to fix this issue. I tried a number of approaches:
|
Thanks for working on this! Re-opening the discussion since you're working on it, and tagging @guolinke and @shiyu1994 to help respond to you when they have time. Please be patient, as the project's small maintainer team is very focused right now on issues critical to the next release (#5153). |
Description
I'm revisiting a previous report of mine: #3634
I've been having lots of trouble trying to fit models with heterogeneous sample weights, for example:
Stratified sampling on the response (Horvitz-Thompson), when there are too many zero responses I sample them down with some probability of inclusion pi and then compensate using weight 1 / pi.
In logistic regression, sometimes I group a run of zeros or a run of ones having the same X so the size of the group is given by the weight of the observation.
The heuristic implemented for min_child_samples works poorly in cases like these, for example requiring min_child_samples = 200 yields many nodes with less than 10 observations. In many cases, this is unacceptable, selected models are extremely noisy and the selection itself is dominated by the variance of the cv error estimate.
Some alternatives are using ridge regularization in order to change the prior (like if there were a number of extra zero observations inside the leaf) or using min_child_weight to bound the hessian. The first alternative introduces shrinkage that's not always the best option while the second one is difficult to fine tune since although the hessian is related to the sum of weights, the exact relationship depends on the type of regression, the current estimate and the value of the independent variable.
I believe a simple and reliable control over the number of samples in a leaf is an important hyperparameter, but the current heuristic isn't reliable.
The text was updated successfully, but these errors were encountered: