Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_child_samples plays bad with weights #5236

Closed
memeplex opened this issue May 23, 2022 · 4 comments
Closed

min_child_samples plays bad with weights #5236

memeplex opened this issue May 23, 2022 · 4 comments

Comments

@memeplex
Copy link

memeplex commented May 23, 2022

Description

I'm revisiting a previous report of mine: #3634

I've been having lots of trouble trying to fit models with heterogeneous sample weights, for example:

  • Stratified sampling on the response (Horvitz-Thompson), when there are too many zero responses I sample them down with some probability of inclusion pi and then compensate using weight 1 / pi.

  • In logistic regression, sometimes I group a run of zeros or a run of ones having the same X so the size of the group is given by the weight of the observation.

The heuristic implemented for min_child_samples works poorly in cases like these, for example requiring min_child_samples = 200 yields many nodes with less than 10 observations. In many cases, this is unacceptable, selected models are extremely noisy and the selection itself is dominated by the variance of the cv error estimate.

Some alternatives are using ridge regularization in order to change the prior (like if there were a number of extra zero observations inside the leaf) or using min_child_weight to bound the hessian. The first alternative introduces shrinkage that's not always the best option while the second one is difficult to fine tune since although the hessian is related to the sum of weights, the exact relationship depends on the type of regression, the current estimate and the value of the independent variable.

I believe a simple and reliable control over the number of samples in a leaf is an important hyperparameter, but the current heuristic isn't reliable.

@shiyu1994
Copy link
Collaborator

@memeplex Thanks. I do think it is necessary to provide an exact control over min_child_samples. Will include this as a feature request.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@memeplex
Copy link
Author

memeplex commented Aug 30, 2022

I've been trying to fix this issue. I tried a number of approaches:

  1. Make hist_t a struct with grad, hess and cnt fields. But the current representation as a series of doubles alternating grad and hess is leaking everywhere, it's very very difficult to get this right, not because of any conceptual complexity but simply due to the many accesses to the internal representation of a histogram, lots of << 1 and * 2 and pointer arithmetic here and there. A lot of this code I'm not even able to test (e.g. CUDA).
  2. When building a histogram don't add buckets with cnt < min_child_samples at the beginning or end of the hist array, but accumulate until cumcnt >= min_child_samples, so splits at the leftmost and rightmost buckets of the hist are not considered unless enough samples have been accumulated. But since buckets are filled in a random order I still need to keep all counts somewhere while creating the histogram and then accumulate and remove leftmost and rightmost buckets at the end of histogram creation. After this, I can get rid of the temporary counts array.
  3. This is the quick&dirty approach: use the lower bits of the mantissa of the hessian to keep the count. Since the hessian is represented as a double, there are 52 bits of mantissa. For normal floating points numbers, which are always of the 1.dddd... form, using about 12-13 least significant bits won't affect precision in any appreciable way (more than 10 decimal digits of separation from the most significant digit). This also works with subnormal numbers, indeed. Moreover, hessians are only added so there will be no cumulative relative error. This is the easiest solution I can figure out, most things will keep working with a tiny error while no extra memory is used and counts are represented exactly up to some number of bits.

@jameslamb
Copy link
Collaborator

Thanks for working on this!

Re-opening the discussion since you're working on it, and tagging @guolinke and @shiyu1994 to help respond to you when they have time. Please be patient, as the project's small maintainer team is very focused right now on issues critical to the next release (#5153).

@jameslamb jameslamb reopened this Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants