-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One Tree doesn't produce mean of labels #4511
Comments
Thanks for using LightGBM. We appreciate the effort made to create a reproducible example, but in the future please include your code directly in the issue instead of linking to external, temporary resources such as notebooks running on mybinder. That helps maintainers and other users looking at this issue (including those finding it from search engines in the future) to be able to use your code without relying on that external resource (in this case, your Binder environment) existing. If/when you delete that Binder environment in the future, this issue would be difficult to understand for someone finding it from a search engine. I've copied the code from that notebook here. provided sample code (click me)import lightgbm
import pandas as pd
import numpy as np
def make_class(length, name, thrshold, classes):
a = np.random.rand(length, 1)
b = np.where(a > thrshold, classes[0], classes[1])
return pd.DataFrame(b, columns=[name])
a = make_class(100000, "State", 0.7, ["NY", "TX"])
b = make_class(100000, "Phone", 0.3, ["Ios", "Android"])
df = pd.concat([a, b], ignore_index=True, axis=1)
df.columns = ["State", "Phone"]
df.head(2)
mapper = {
("TX", "Ios"): 0.3,
("NY", "Ios"): 0.85,
("TX", "Android"): 1.0,
("NY", "Android"): 0.0,
}
mapper
df["odds"] = df.agg(tuple, 1).map(mapper)
df["label"] = np.where(df["odds"] > np.random.rand(len(df["odds"]), 1).flatten(), 1, 0)
df.groupby(["State", "Phone"])["label"].mean()
df.groupby(["State", "Phone"])["label"].agg(["mean", "count"])
df = df.drop("odds", axis=1)
y_train = df["label"].astype("bool")
X_train = df[["State", "Phone"]].copy()
X_train = X_train.astype("category")
train_data = lightgbm.Dataset(X_train, label=y_train, free_raw_data=False)
best_parameters = {
"application": "binary",
"metric": "binary_logloss",
"num_iterations": 1,
"is_unbalance": "false",
"boosting": "rf",
"num_leaves": 16,
"bagging_freq": 1,
"bagging_fraction": 0.999,
"verbose": 5,
"min_split_gain": 1, # minimum loss reduction required to make further partition on a leaf node of the tree
"min_child_samples": 1, # minimum number of data needed in a leaf
}
best_parameters
evals_result = {}
model = lightgbm.train(
best_parameters,
train_data,
evals_result=evals_result,
)
X_train["prob"] = model.predict(X_train[["State", "Phone"]])
X_train["label"] = df["label"]
X_train.groupby(["Phone", "State"]).agg(["mean", "count"]) I or another maintainer will answer here as soon as we have time to examine that code more closely. |
I believe #4118 (comment) can help:
|
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Hey
one thing i cant get straight
original regression problem
![image](https://user-images.githubusercontent.com/24636519/128779655-b4578b3b-040b-4a38-bc1c-8fad892a20cc.png)
tree visual ( leaf to each combination in the data)
![image](https://user-images.githubusercontent.com/24636519/128779732-64ca6beb-36ed-4037-a8ee-dd03cad82110.png)
actual output
![image](https://user-images.githubusercontent.com/24636519/128779815-15f95b82-1125-47d4-a4d6-e14852c63a5e.png)
would expected label mean to be same as prob mean - but its not the case
i truly wonder why
training paramters:
parameters = {
'application': 'binary',
'metric': 'binary_logloss',
'num_iterations': 1,
'is_unbalance': 'false',
'boosting': 'rf',
'num_leaves': 16,
"bagging_freq":1,
"bagging_fraction" : 0.999,
'verbose': 5 ,
'min_split_gain': 1,
'min_child_samples': 1
}
link to binder
https://mybinder.org/v2/gh/yechiav/lightgbm_question/9712383e09df6ae39771aba15f9d606dafb96691
The text was updated successfully, but these errors were encountered: