One Tree doesn't produce mean of labels #4511

yechiav · 2021-08-09T22:01:07Z

Hey
one thing i cant get straight

have a classification issue with two features
created a RandomForest with 1 Tree
Tree branches to all possible combinations
would expect the output value to be the Mean of the leaf - it is not

original regression problem

tree visual ( leaf to each combination in the data)

actual output

would expected label mean to be same as prob mean - but its not the case

i truly wonder why

training paramters:
parameters = {
'application': 'binary',
'metric': 'binary_logloss',
'num_iterations': 1,
'is_unbalance': 'false',
'boosting': 'rf',
'num_leaves': 16,
"bagging_freq":1,
"bagging_fraction" : 0.999,
'verbose': 5 ,
'min_split_gain': 1,
'min_child_samples': 1
}

link to binder
https://mybinder.org/v2/gh/yechiav/lightgbm_question/9712383e09df6ae39771aba15f9d606dafb96691

jameslamb · 2021-08-14T03:41:09Z

Thanks for using LightGBM.

We appreciate the effort made to create a reproducible example, but in the future please include your code directly in the issue instead of linking to external, temporary resources such as notebooks running on mybinder. That helps maintainers and other users looking at this issue (including those finding it from search engines in the future) to be able to use your code without relying on that external resource (in this case, your Binder environment) existing. If/when you delete that Binder environment in the future, this issue would be difficult to understand for someone finding it from a search engine.

I've copied the code from that notebook here.

provided sample code (click me)

import lightgbm
import pandas as pd
import numpy as np

def make_class(length, name, thrshold, classes):
    a = np.random.rand(length, 1)
    b = np.where(a > thrshold, classes[0], classes[1])
    return pd.DataFrame(b, columns=[name])

a = make_class(100000, "State", 0.7, ["NY", "TX"])
b = make_class(100000, "Phone", 0.3, ["Ios", "Android"])

df = pd.concat([a, b], ignore_index=True, axis=1)
df.columns = ["State", "Phone"]

df.head(2)

mapper = {
    ("TX", "Ios"): 0.3,
    ("NY", "Ios"): 0.85,
    ("TX", "Android"): 1.0,
    ("NY", "Android"): 0.0,
}
mapper

df["odds"] = df.agg(tuple, 1).map(mapper)
df["label"] = np.where(df["odds"] > np.random.rand(len(df["odds"]), 1).flatten(), 1, 0)
df.groupby(["State", "Phone"])["label"].mean()
df.groupby(["State", "Phone"])["label"].agg(["mean", "count"])
df = df.drop("odds", axis=1)

y_train = df["label"].astype("bool")
X_train = df[["State", "Phone"]].copy()

X_train = X_train.astype("category")

train_data = lightgbm.Dataset(X_train, label=y_train, free_raw_data=False)

best_parameters = {
    "application": "binary",
    "metric": "binary_logloss",
    "num_iterations": 1,
    "is_unbalance": "false",
    "boosting": "rf",
    "num_leaves": 16,
    "bagging_freq": 1,
    "bagging_fraction": 0.999,
    "verbose": 5,
    "min_split_gain": 1,  # minimum loss reduction required to make further partition on a leaf node of the tree
    "min_child_samples": 1,  # minimum number of data needed in a leaf
}
best_parameters

evals_result = {}
model = lightgbm.train(
    best_parameters,
    train_data,
    evals_result=evals_result,
)

X_train["prob"] = model.predict(X_train[["State", "Phone"]])
X_train["label"] = df["label"]
X_train.groupby(["Phone", "State"]).agg(["mean", "count"])

I or another maintainer will answer here as soon as we have time to examine that code more closely.

jmoralez · 2021-08-19T14:51:35Z

I believe #4118 (comment) can help:

As for probability, currently rf follows the calculation of gbdt. But a better way to calculate the probability for rf should be calculating the class proportions in the leaves and then average them.
But currently trees trained in LightGBM don't have the class proportion information in leaves, since we are actually training regression trees. So to implement this, we need to renew the tree leaves after a tree is trained, to calculate the class proportion for rf.

no-response · 2022-04-01T12:57:42Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions · 2023-08-23T00:18:48Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label Aug 14, 2021

StrikerRUS mentioned this issue Aug 19, 2021

init_score not working in LGBMClassifier #4534

Closed

guolinke added the awaiting response label Mar 2, 2022

no-response bot closed this as completed Apr 1, 2022

github-actions bot removed the awaiting response label Aug 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One Tree doesn't produce mean of labels #4511

One Tree doesn't produce mean of labels #4511

yechiav commented Aug 9, 2021

jameslamb commented Aug 14, 2021

jmoralez commented Aug 19, 2021

no-response bot commented Apr 1, 2022

github-actions bot commented Aug 23, 2023

One Tree doesn't produce mean of labels #4511

One Tree doesn't produce mean of labels #4511

Comments

yechiav commented Aug 9, 2021

jameslamb commented Aug 14, 2021

jmoralez commented Aug 19, 2021

no-response bot commented Apr 1, 2022

github-actions bot commented Aug 23, 2023