min_child_samples plays bad with weights #5236

memeplex · 2022-05-23T23:34:50Z

Description

I'm revisiting a previous report of mine: #3634

I've been having lots of trouble trying to fit models with heterogeneous sample weights, for example:

Stratified sampling on the response (Horvitz-Thompson), when there are too many zero responses I sample them down with some probability of inclusion pi and then compensate using weight 1 / pi.
In logistic regression, sometimes I group a run of zeros or a run of ones having the same X so the size of the group is given by the weight of the observation.

The heuristic implemented for min_child_samples works poorly in cases like these, for example requiring min_child_samples = 200 yields many nodes with less than 10 observations. In many cases, this is unacceptable, selected models are extremely noisy and the selection itself is dominated by the variance of the cv error estimate.

Some alternatives are using ridge regularization in order to change the prior (like if there were a number of extra zero observations inside the leaf) or using min_child_weight to bound the hessian. The first alternative introduces shrinkage that's not always the best option while the second one is difficult to fine tune since although the hessian is related to the sum of weights, the exact relationship depends on the type of regression, the current estimate and the value of the independent variable.

I believe a simple and reliable control over the number of samples in a leaf is an important hyperparameter, but the current heuristic isn't reliable.

shiyu1994 · 2022-05-30T03:58:27Z

@memeplex Thanks. I do think it is necessary to provide an exact control over min_child_samples. Will include this as a feature request.

StrikerRUS · 2022-06-04T22:29:07Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

memeplex · 2022-08-30T06:07:49Z

I've been trying to fix this issue. I tried a number of approaches:

Make hist_t a struct with grad, hess and cnt fields. But the current representation as a series of doubles alternating grad and hess is leaking everywhere, it's very very difficult to get this right, not because of any conceptual complexity but simply due to the many accesses to the internal representation of a histogram, lots of << 1 and * 2 and pointer arithmetic here and there. A lot of this code I'm not even able to test (e.g. CUDA).
When building a histogram don't add buckets with cnt < min_child_samples at the beginning or end of the hist array, but accumulate until cumcnt >= min_child_samples, so splits at the leftmost and rightmost buckets of the hist are not considered unless enough samples have been accumulated. But since buckets are filled in a random order I still need to keep all counts somewhere while creating the histogram and then accumulate and remove leftmost and rightmost buckets at the end of histogram creation. After this, I can get rid of the temporary counts array.
This is the quick&dirty approach: use the lower bits of the mantissa of the hessian to keep the count. Since the hessian is represented as a double, there are 52 bits of mantissa. For normal floating points numbers, which are always of the 1.dddd... form, using about 12-13 least significant bits won't affect precision in any appreciable way (more than 10 decimal digits of separation from the most significant digit). This also works with subnormal numbers, indeed. Moreover, hessians are only added so there will be no cumulative relative error. This is the easiest solution I can figure out, most things will keep working with a tiny error while no extra memory is used and counts are represented exactly up to some number of bits.

jameslamb · 2022-09-05T04:26:56Z

Thanks for working on this!

Re-opening the discussion since you're working on it, and tagging @guolinke and @shiyu1994 to help respond to you when they have time. Please be patient, as the project's small maintainer team is very focused right now on issues critical to the next release (#5153).

trivialfis mentioned this issue May 24, 2022

[WIP] Implement min_child_samples for hist and approx. dmlc/xgboost#7932

Draft

shiyu1994 mentioned this issue May 30, 2022

Feature Requests & Voting Hub #2302

Open

jameslamb added the feature request label Jun 4, 2022

StrikerRUS closed this as completed Jun 4, 2022

StrikerRUS added the help wanted label Jun 4, 2022

jameslamb reopened this Sep 5, 2022

StrikerRUS closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min_child_samples plays bad with weights #5236

min_child_samples plays bad with weights #5236

memeplex commented May 23, 2022 •

edited

Loading

shiyu1994 commented May 30, 2022

StrikerRUS commented Jun 4, 2022

memeplex commented Aug 30, 2022 •

edited

Loading

jameslamb commented Sep 5, 2022

min_child_samples plays bad with weights #5236

min_child_samples plays bad with weights #5236

Comments

memeplex commented May 23, 2022 • edited Loading

Description

shiyu1994 commented May 30, 2022

StrikerRUS commented Jun 4, 2022

memeplex commented Aug 30, 2022 • edited Loading

jameslamb commented Sep 5, 2022

memeplex commented May 23, 2022 •

edited

Loading

memeplex commented Aug 30, 2022 •

edited

Loading