Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost has trouble modeling multiplication/division #4069

Closed
ledmaster opened this issue Jan 20, 2019 · 8 comments · Fixed by #4233
Closed

XGBoost has trouble modeling multiplication/division #4069

ledmaster opened this issue Jan 20, 2019 · 8 comments · Fixed by #4233

Comments

@ledmaster
Copy link

Hi

I am using Python 3.6 and XGBoost version 0.81. When I try a simple experiment, creating a matrix X of numbers between 1 and -1, and then Y = X1 * X2 or Y = X1 / X2, xgboost can't learn and predicts a constant number.

multiplication_noise0

Now, if I add gaussian noise, it can model the function:

multiplication_noise05

I tried changing the range, tuning hyperparameters, base_score, using the native xgb.train vs XGBoostRegressor, but couldn't make it learn.

Is this a known issue? Do you know why it happens?

Thanks

@trivialfis
Copy link
Member

@ledmaster I tried to generate the following dataset, is it the right one?

x = np.random.rand(64, 2)
x = x * 2 - 1.0
y_true = x[:, 0] * x[:, 1]

Following above script:

dtrain = xgb.DMatrix(x, label=y_true)

params = {
    'tree_method': 'gpu_hist'
}

bst = xgb.train(params, dtrain, evals=[(dtrain, "train")], num_boost_round=10)

y_pred = bst.predict(dtrain)
# Z = pred_y

X = x[:, 0]
Y = x[:, 1]

fig = plt.figure(figsize=plt.figaspect(2.))
ax = fig.add_subplot(2, 1, 1, projection='3d')
ax.plot_trisurf(X, Y, y_pred, cmap='viridis')

ax = fig.add_subplot(2, 1, 2, projection='3d')
ax.plot_trisurf(X, Y, y_true, cmap='viridis')

plt.show()

I got:

figure_1

Seems pretty reasonable. Did I generate the wrong dataset?

@ledmaster
Copy link
Author

ledmaster commented Jan 21, 2019

@trivialfis
This is the code I used:

size = 10000
X = np.zeros((size, 2))

Z = np.meshgrid(np.linspace(-1,1, 100), np.linspace(-1,1, 100))

X[:, 0] = Z[0].flatten()
X[:, 1] = Z[1].flatten()

y_mul = X[:,0] * X[:, 1]
y_div = X[:,0] / X[:, 1]

ops = [('MULTIPLICATION', y_mul), ('DIVISION', y_div)]
for name, op in ops:
    fig = plt.figure(figsize=(15,10))
    ax = fig.gca(projection='3d')
    ax.set_title(name)
    ax.plot_trisurf(X[:, 0], X[:, 1], op, cmap=plt.cm.viridis, linewidth=0.2)
    #plt.show()
    plt.savefig("{}.jpg".format(name))

ops = [('MULTIPLICATION', y_mul), ('DIVISION', y_div)]
for name, op in ops:
    mdl = xgb.XGBRegressor()
    mdl.fit(X, op)
  
    fig = plt.figure(figsize=(15,10))
    ax = fig.gca(projection='3d')
    ax.set_title("{} - NOISE = 0".format(name))
    ax.plot_trisurf(X[:, 0], X[:, 1], mdl.predict(X), cmap=plt.cm.viridis, linewidth=0.2)
    #plt.show()
    plt.savefig("{}_noise0.jpg".format(name))

Figure for the non predict plot:

multiplication

@trivialfis
Copy link
Member

trivialfis commented Jan 21, 2019

I lose. Adding small normal noise (np.random.randn()) to label everything works, but without such noise xgboost just jump into some sort of local minimum.

        noise = np.random.randn(size)
        noise = noise / 1000
        y = y + noise  # somehow this helps XGBoost leave the local minimum
        dtrain = xgb.DMatrix(X, label=y)

I'm not sure if this is a bug. XGBoost relies on greedy algorithm after all. Would love to hear some other opinions. @RAMitchell

@khotilov
Copy link
Member

I agree with @trivialfis - it's not a bug. But it's a nice example of data with "perfect symmetry" with unstable balance. With such perfect dataset, when the algorithm is looking for a split, say in variable X1, the sums of residuals at each x1 location of it WRT X2 are always zero, thus it cannot find any split and is only able to approximate the total average. Any random disturbance, e.g., some noise or subsample=0.8 would help to kick it off the equilibrium and to start learning.

@trivialfis
Copy link
Member

@khotilov That's a very interesting example. I wouldn't come up with it myself. Maybe we can document it in a tutorial?

@trivialfis trivialfis reopened this Jan 22, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Jan 22, 2019

@ledmaster
Copy link
Author

Thanks @khotilov and @trivialfis for the answers and investigation.

As @hcho3 cited above, I wrote an article about GBMs and arithmetic operations, so this is why I ended up finding the issue. I added your answer there. Feel free to link to it.

@RAMitchell
Copy link
Member

Decision trees can only look at one feature at a time. In your example if I take any 1d feature range such as 0.25 < f0 < 0.5 and average all the samples in that slice I suspect you will always get exactly 0 due to the symmetry of your problem. So xgboost cant find anything when it looks from the perspective of a single feature.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants