-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree is not input scale invariant for simple X transformation? #4017
Comments
I will try to reproduce it in Python when time allows. |
I test in python 3.5.2 with xgboost 0.81
|
@joegaotao Thanks! Will look into it this weekend. :) |
@joegaotao You need to scale the prediction dataset. |
@joegaotao |
@trivialfis sorry, I make a mistake. I did some tests again,
|
@joegaotao Beats me. I tested with multiple configurations including different number of rows and different transformations. The situation in multiplication is much better than addition. And the problem in GPU side seems even worse. My guess is the funny floating point issue. But will look more closely, debugging an issue only occurs with more than 10000 lines of data is quite messy... |
@trivialfis I also wondered it's the floating issue for boosting iteration, because when setting smaller max_depth and nrounds, it's hard to reproduce the difference. But I feel it's weird that |
While theoretically there must be the invariance, the scaling is not precise because of the finite float precision. As for the shifts, keep in mind that the number of unique possible float numbers within a unit interval decreases significantly when shifting away from [0,1). E.g., few float numbers survive the following roundtrip: In [51]: np.random.seed(111)
...: X = np.random.uniform(0, 1, 1000000).astype('float32')
...: X_10 = X - 10
...: X_ = X_10 + 10
...: (X == X_).sum()
...:
...:
Out[51]: 41725 With a sizeable datasample and such deep trees with hundreds of splits, the chances that one would hit some imprecision points after the transformation in one or a few splits get higher. While I'm not ruling out any potential causes within the xgboost that might also contribute to what we see here, but it very much seems to me as a floating point issue. |
@khotilov I think maybe you are right because float32 will change slightly the order due to the precision, especially small interval.
|
@joegaotao I looked into gpu hist and gpu exact and I still think floating point is the culprit here. |
Theoretically, tree is invariant for X simple transformation, such as "a * X - b". However, I do some simple tests, and I surprisingly found different version xgboost has different odd behavior, simple transformation will lead to different results. Here is the R code:
xgboost 0.71.2, change
X
toX - 8
R sessionInfo()
xgboost from master compilation, 0.81.0.1, change
X - 8
toX - 1
orX / 10
The text was updated successfully, but these errors were encountered: