In this post I will share my experience on implementing gradient boosted trees as seen in XGBoost in order to get a better understanding of the method. As in previous posts, the code won't be as efficient as the original. Furthermore, I will use scala again.
The basic idea behind boosting is to construct an
additive model of a collection of simpler models (base models)
for classification or regresion. If you are unfamiliar with the
general boosting idea plase consider (Hastie: Elements of Statistics Learning)
for a great introduction. This post is mainly based on this blog post post
and these slides slides. In boosting, the model is constructed iteratively
that is, in an iteration
The idea of gradient boosting is that in each iteration the next model
is determined so it minimizes a given loss function
In the next part, I will derive the optimization step used for decision trees with a cross entropy loss and then "massage" in the regularization.
For binary classification, the go-to output function is the sigmoid function:
Prominent usage examples include Neural Networks and logistic regressions.
Given that our output
with its first derivative at
One of the ideas implemented in XGBosst is regularisation. Intuitavely we want to
penalize large parameters that dominate our result and model complexity.
Given a parameter set
The combined objective using regularization and some loss function is:
Since we are boosting, we define the objective in terms
of the predictions in iteration
while
Assuming a decision tree base classifier we can use the example regularization above and the objective becomes:
With a few transformations, we can redefine the loss in a way we can directly optimize the function with the standard CART learning algorithm.
First we defined
redefining the sums of the gradients to
Now we can update a leaf node using the gradient statistics by setting the derivative of the objective to zero:
and then
Intuitively, we model the gradient to take given an input in each leaf node. And these gradients are normalized by their second
order statistics. So each leaf prediction will push the total function output of the ensemble more towards the wanted result.
In order to construct a tree, we can use the same gradients to define a gain function and then learn the trees like a
normal decision tree. The formular is the gradient "amplitude" before the split