Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One Tree doesn't produce mean of labels #4511

Closed
yechiav opened this issue Aug 9, 2021 · 4 comments
Closed

One Tree doesn't produce mean of labels #4511

yechiav opened this issue Aug 9, 2021 · 4 comments
Labels

Comments

@yechiav
Copy link

yechiav commented Aug 9, 2021

Hey
one thing i cant get straight

  1. have a classification issue with two features
  2. created a RandomForest with 1 Tree
  3. Tree branches to all possible combinations
  4. would expect the output value to be the Mean of the leaf - it is not

original regression problem
image

tree visual ( leaf to each combination in the data)
image

actual output
image

would expected label mean to be same as prob mean - but its not the case

i truly wonder why

training paramters:
parameters = {
'application': 'binary',
'metric': 'binary_logloss',
'num_iterations': 1,
'is_unbalance': 'false',
'boosting': 'rf',
'num_leaves': 16,
"bagging_freq":1,
"bagging_fraction" : 0.999,
'verbose': 5 ,
'min_split_gain': 1,
'min_child_samples': 1
}

link to binder
https://mybinder.org/v2/gh/yechiav/lightgbm_question/9712383e09df6ae39771aba15f9d606dafb96691

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

We appreciate the effort made to create a reproducible example, but in the future please include your code directly in the issue instead of linking to external, temporary resources such as notebooks running on mybinder. That helps maintainers and other users looking at this issue (including those finding it from search engines in the future) to be able to use your code without relying on that external resource (in this case, your Binder environment) existing. If/when you delete that Binder environment in the future, this issue would be difficult to understand for someone finding it from a search engine.

I've copied the code from that notebook here.

provided sample code (click me)
import lightgbm
import pandas as pd
import numpy as np

def make_class(length, name, thrshold, classes):
    a = np.random.rand(length, 1)
    b = np.where(a > thrshold, classes[0], classes[1])
    return pd.DataFrame(b, columns=[name])

a = make_class(100000, "State", 0.7, ["NY", "TX"])
b = make_class(100000, "Phone", 0.3, ["Ios", "Android"])

df = pd.concat([a, b], ignore_index=True, axis=1)
df.columns = ["State", "Phone"]

df.head(2)

mapper = {
    ("TX", "Ios"): 0.3,
    ("NY", "Ios"): 0.85,
    ("TX", "Android"): 1.0,
    ("NY", "Android"): 0.0,
}
mapper

df["odds"] = df.agg(tuple, 1).map(mapper)
df["label"] = np.where(df["odds"] > np.random.rand(len(df["odds"]), 1).flatten(), 1, 0)
df.groupby(["State", "Phone"])["label"].mean()
df.groupby(["State", "Phone"])["label"].agg(["mean", "count"])
df = df.drop("odds", axis=1)

y_train = df["label"].astype("bool")
X_train = df[["State", "Phone"]].copy()

X_train = X_train.astype("category")

train_data = lightgbm.Dataset(X_train, label=y_train, free_raw_data=False)

best_parameters = {
    "application": "binary",
    "metric": "binary_logloss",
    "num_iterations": 1,
    "is_unbalance": "false",
    "boosting": "rf",
    "num_leaves": 16,
    "bagging_freq": 1,
    "bagging_fraction": 0.999,
    "verbose": 5,
    "min_split_gain": 1,  # minimum loss reduction required to make further partition on a leaf node of the tree
    "min_child_samples": 1,  # minimum number of data needed in a leaf
}
best_parameters

evals_result = {}
model = lightgbm.train(
    best_parameters,
    train_data,
    evals_result=evals_result,
)

X_train["prob"] = model.predict(X_train[["State", "Phone"]])
X_train["label"] = df["label"]
X_train.groupby(["Phone", "State"]).agg(["mean", "count"])

I or another maintainer will answer here as soon as we have time to examine that code more closely.

@jmoralez
Copy link
Collaborator

I believe #4118 (comment) can help:

As for probability, currently rf follows the calculation of gbdt. But a better way to calculate the probability for rf should be calculating the class proportions in the leaves and then average them.
But currently trees trained in LightGBM don't have the class proportion information in leaves, since we are actually training regression trees. So to implement this, we need to renew the tree leaves after a tree is trained, to calculate the class proportion for rf.

@no-response
Copy link

no-response bot commented Apr 1, 2022

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@no-response no-response bot closed this as completed Apr 1, 2022
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants