-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-output regression/classification #524
Comments
I think it would require rewriting the whole algorithm from scratch for LightGBM, as it was optimized for a one case. In the case of xgboost, it requires rewriting the whole algorithm from scratch, which is not possible in the current state unless someone is ready to work on it. |
Is there any introduction website or paper about multi-output task, except divided into multiple binary/regression task? |
@wxchan multi-output tasks require using an objective handling multi-output tasks. It also includes/requires multi-split support for decision trees (multiple cutting points instead of one cutting point). I think you can check this as starting point, it's explained very simply: http://users.math.yale.edu/users/gw289/CpSc-445-545/Slides/CPSC445%20-%20Topic%2005%20-%20Classification%20&%20Decision%20Trees.pdf |
@wxchan I believe GBDT can adapt from multi-class to multi-label classification (where the labels aren't mutually exclusive) without too much additional computational cost. In multi-label classification, the target y is a n_samples * n_labels matrix, and each column is a binary vector. Traditionally, at the leaf node of a classification tree, the prediction is generated by an average of one-hot class probability vectors (which represent the classes of the samples belonging to the leaf). Ex. mean([0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], ...). When we use it for multi-label classification, the probability vectors will be different. Ex. mean([0, 1, 0, 1], [0, 1, 1, 0], [0, 0, 0, 1], [0, 1, 0, 1], ...). (I know gradient boosting involves more complex maths, but that's the basic idea.) For GBDT, some other modifications may be required:
e.g.
@Laurae2 I couldn't see the necessity of multi-split support. Theoretically, any split of 1 parent node into more than 2 child nodes can be equivalently represented by a sequence of binary splits. (Did I understand you correctly?) The class_weight parameter will still be useful. A label with a larger weight would thus be considered more important during the evaluation of each split. I suppose, for multi-label classification, an implementation within a single lightgbm model could train more efficiently and consume less memory. Relatively speaking, multi-output regression might not be as useful. |
@TennielMiao multi-output classification is doable in xgboost/LightGBM, it is actually what is being done in multiclass problems but not in an efficient manner. Also, it returns everything, while you might be interested into a specific number of outputs (especially for classification). This is why we have It requires modification of the objective/loss function like you described. For instance, if you were to optimize F1 or F2 score, then you would have to put in the metric part an optimizer which finds the best threshold for each class for each iteration. For the loss function, you would have to find a proxy which is continuous and a local statistic (unlike F1/F2 Score requiring discrete inputs over a global statistic). For proper multi-output classification, if you can have more splits instead of binary splits you will require a lower depth for trees which would also requires less splits. As the sum of loss from binary splits is only an approximation of the sum of loss from multi-splits (mathematically if you consider a graph with chained losses), the representation of a multi-split is not always identical to the representation of multiple binary splits (if you split more, you have higher odds to end up with something different). As for the speed, there are two major cases:
For the |
@TennielMiao The main bottleneck for the implementation of multi-output classification( regression) are memories, It's been highly non-efficiency since we need maintain all the residual value of all sample over all features. |
Just find a related paper in ICML 2017: "Gradient Boosted Decision Trees for High Dimensional Sparse Output" . @huanzhang12 is one of the author. |
The excerpt:
That method seems way faster than |
@marugari Thanks for posting the link here! |
@huanzhang12 Trees for top-k labels have same
If this discussion is not suitable for this issue, I will send a e-mail. |
@marugari Yes, please send me an email if you have specific questions on our paper. |
@marugari |
@marugari Sorry for my late reply. I will look into this. |
@guolinke @huanzhang12 |
Hi all, I have a question and I am not sure if it is related to this discussion. When I train a multi-class model, from the log I can see that the number of trained trees is usually equal to the number of classes times number of iterations. Does that means LGB is training something like a one-vs-all classifier? If so can we not take the output for each class and somehow achieve multilabel classification? Please correct me if I am wrong. Thanks. |
@albertauyeung |
Multiclass classification in the TensorFlow Boosted trees. |
Any update on multi-label classification? |
temporal solution: using sklearn multi-output wrapper https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputRegressor.html#sklearn.multioutput.MultiOutputRegressor and |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
There is a new feature in XGBoost that allows modelling of multiple-outputs: https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html Any plans to include this also into LightGBM? Would be great since then I could also implement a multivariate probabilistic framework, similar to Multi-Target XGBoostLSS Regression that models multiple targets and their dependencies in a probabilistic regression setting. |
Thanks for the Also linking this related conversation from XGBoost that you've been contributing on: dmlc/xgboost#2087 And these related LightGBM conversations: To answer your question directly... I'm not aware of anyone currently working on adding this support to LightGBM. It has been almost a year since LightGBM's last substantive release, so the small team of mostly-volunteer maintainers here is currently focused on trying to get a year of other improvements and bugfixes out in a new major release (#5153). If you're interested in attempting to add multi-output support to LightGBM we can try to support you with reviews and advice, but at this point can't commit to more than that. |
@jameslamb Kindly asking if there is an update on this? |
Please don't post "is there an update" types of comments here. We'd welcome your help if you'd like to try to contribute this. Otherwise, you can subscribe to feature requests here to be notified of activity on them. |
Currently, LightGBM only supports 1-output problems. It would be interesting if LightGBM could support multi-output tasks (multi-output regression, multi-label classification, etc.) like those in multitask lasso.
I've seen a similar request on xgboost, but it hasn't been implemented yet.
The text was updated successfully, but these errors were encountered: