Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical Instability with histogram methods for training on large data sets #4204

Closed
jeffdk opened this issue Mar 4, 2019 · 15 comments
Closed

Comments

@jeffdk
Copy link

jeffdk commented Mar 4, 2019

TLDR and ask
I've got a tricky issue which I only see when training on data sets above a certain size. I am currently unable to provide a clean repro as I only encounter the problem when my data is above a certain threshold (around 10 billion entries in the feature matrix). I'm asking for some help on how to further debug this issue; (e.g. since I don't know the details of the code nearly as well as a contributor, perhaps you could provide me with a watchpoint I can set to break on and gather more data about the program state at the onset of the instability)

Symptoms:
See my comment here for an example: #4095 (comment)

  • After a few iterations training "blows up": logloss and error increase greatly, then remain at a constant value while the model "trains" through the remaining rounds (this proceeds abnormally quickly)
  • Leaves of resulting model have absurdly large values e.g. in the first tree: 7:leaf=43151.1016 8:leaf=-0.147368416
    (Though it is certainly unrelated, the behavior feels similar to what was seen in XGBoost Regression Accuracies Worst with gpu_hist, Slower with gpu_exact #2754)

Data details

  • My data set contains 11,068,163 rows and 21,692 columns. I've subsetted the rows of the set for tests as I scale training to larger and larger multi-GPU workloads.
  • I've found that training a subset with 5,750,000 samples works fine (no instability), however once increasing to 6,000,000 rows, training blows up every time.

Experiments tried

  • I've tried training on a different set of data at the scale of 6,000,000 samples (i.e. it is not an instance of bad data causing the problem)
  • Changing tree depth or learning rate does not fix the instability.
  • Setting alpha/lambda regularization does not fix the instability; it does reduce the magnitude of the log-loss for the training set.
  • Increasing the number of hist bins does not fix the instability; it also does reduce the magnitude of the log-loss for the training set.
  • I have not used evaluation sets for these runs. I've tried using both CPU and GPU predictor settings (though I primarily use CPU predictor to save GPU memory)
  • I find that I can train on data bigger than 6M samples if I set a value for max_delta_step, however learning is significantly impaired. Setting max_delta_step is really just capping the potential impact of the instability, and I find that it is still impossible to learn a model that is competitive with what I've seen using the same data-set training in single-node CPU exact model.

Environment and build info
Sample hyperparameters (I've tried many experiments but all are generally similar to these):

    "n_gpus": -1,
    "num_class": 3,
    "nthread": "96",
    "eta": 0.1,
    "max_bin": 31,
    "objective": "multi:softprob",
    "num_boost_round": 100,
    "tree_method": "gpu_hist",
    "max_depth": 10,
    "predictor": "cpu_predictor"
  • Ubuntu 16.04 based (in a container using the nvidia runtime). CUDA and NCCL were installed via these deb packages from nvidia’s repository: cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0
  • I've been running a build very close to current master: (489e499) from the multi-node multi-gpu PR.
  • I've been running on 8 GPUs on a AWS p3dn.24xlarge instance. A smaller number of GPUs is unable to fit the size of my data.
  • For my build I've confirmed that the testxgboost suite and other tests for tree updaters with GPU pass.

Please let me know what other information I can provide that would be useful; any help is appreciated! Thanks all!
-Jeff

@RAMitchell
Copy link
Member

Seems like it could be an overflow issue? Can you please post the text dump of the first couple of trees using some small depth?

@jeffdk
Copy link
Author

jeffdk commented Mar 4, 2019

The smallest depth I have on hand right now is 5. This model's instability was less pronounced but you can see a leaf weight of over 20 present early on (search the file for "leaf=2" from the start):

12th tree from the top (Considering a 3 class model, this is the 4th round): bzktf.model.txt

				28:[f19939<0.000236599997] yes=57,no=58,missing=58
					57:leaf=0.100771725
					58:leaf=20.1690426

@mtjrider
Copy link
Contributor

mtjrider commented Mar 5, 2019

Hey @jeffdk

I’m wondering if you observed this with other objective functions, as well. Did you try reg:linear?

How many workers do you launch? I’m guessing just one from the parameters.

Have you tried not setting the predictor to CPU?

I believe setting nthread is also unnecessary as the code will use the system maximum if not set.

@jeffdk
Copy link
Author

jeffdk commented Mar 5, 2019

Hey @mt-jones thanks for checking in,

  • I have not tried reg:linear but the example from here [REVIEW] Enable Multi-Node Multi-GPU functionality #4095 (comment) shows I get the issue with binary:logistic in addition to multi:softprob.
  • This is just one worker
  • I have tried setting the predictor to GPU (that is, leaving it unset) and see the same problem.
  • Good to know on nthread! I'm always just setting it to the number of system cores anyway.

@mtjrider
Copy link
Contributor

mtjrider commented Mar 5, 2019

It would be helpful to know if the issue persists with all objective functions, etc. Could you try reg:linear and let me know?

It would also be helpful if we had a small reproducer. Is there a way for you to mock up some kind of data?

@jeffdk
Copy link
Author

jeffdk commented Mar 5, 2019

@RAMitchell

I have a repro of a run using a depth of 2 where you can see extremely large leaf values in the first few trees (magnitude > 100):

booster[0]:
booster[0]:
0:[f8023<24.9200001] yes=1,no=2,missing=1,gain=187926.812,cover=2666666.75
	1:[f3945<1.5] yes=3,no=4,missing=3,gain=53575.875,cover=2511679.75
		3:leaf=-0.675734818,cover=0
		4:leaf=0.620877981,cover=0
	2:[f879<-9.99999975e-06] yes=5,no=6,missing=5,gain=94983.2188,cover=154987.109
		5:leaf=-0.65048486,cover=0
		6:leaf=1.01979089,cover=0
booster[1]:
0:[f12895<0.770099998] yes=1,no=2,missing=1,gain=917343.75,cover=2666666.75
	1:[f10619<18] yes=3,no=4,missing=4,gain=198044.25,cover=2339213
		3:leaf=0.195236623,cover=0
		4:leaf=1.37889802,cover=0
	2:[f19999<68.5] yes=5,no=6,missing=5,gain=234804.484,cover=327453.75
		5:leaf=-318.269745,cover=0
		6:leaf=-0.482374966,cover=0
booster[2]:
0:[f8023<25.2299995] yes=1,no=2,missing=2,gain=492543.938,cover=2666666.75
	1:[f8327<-9.99999975e-06] yes=3,no=4,missing=3,gain=51327.4375,cover=198128.438
		3:leaf=1.21191311,cover=0
		4:leaf=-0.374824882,cover=0
	2:[f10619<2] yes=5,no=6,missing=5,gain=207495.062,cover=2468538.25
		5:leaf=-0.698607564,cover=0
		6:leaf=0.354673982,cover=0
booster[3]:
0:[f879<-9.99999975e-06] yes=1,no=2,missing=1,gain=97287.6719,cover=1241213.88
	1:[f2815<0.0188999996] yes=3,no=4,missing=3,gain=32373.4688,cover=1021803.25
		3:leaf=-122.004761,cover=0
		4:leaf=-0.437717617,cover=0
	2:[f19990<6.9829998] yes=5,no=6,missing=6,gain=63918.6016,cover=219410.609
		5:leaf=2.32276249,cover=0
		6:leaf=0.152193889,cover=0

...
booster[105]:
0:leaf=-nan,cover=232730.797
booster[106]:
0:leaf=-nan,cover=337007.531
booster[107]:
0:leaf=-nan,cover=236074.344
booster[108]:
0:leaf=0,cover=5.99999994e-10
...

Full dump here: vqqhu.model.txt
Edit: Added stats vqqhu.model.wstats.txt

Hyperparameters used for this run:

 {
    "n_gpus": -1,
    "num_class": 3,
    "nthread": "96",
    "eta": 1.0,
    "max_bin": 31,
    "objective": "multi:softprob",
    "num_boost_round": 100,
    "tree_method": "gpu_hist",
    "max_depth": 2
  }

@jeffdk
Copy link
Author

jeffdk commented Mar 6, 2019

@mt-jones

I also see the issue with the reg:linear objective (ran on the same data where labels are [0,1,2]).

booster[0]:
0:[f8023<24.9200001] yes=1,no=2,missing=2,gain=208415.25,cover=6000000
	1:[f12917<-9.99999975e-06] yes=3,no=4,missing=3,gain=43086.1875,cover=404180
		3:leaf=1.32514524,cover=0
		4:leaf=-0.00346823013,cover=0
	2:[f7703<-9.99999975e-06] yes=5,no=6,missing=5,gain=85603.375,cover=5595820
		5:leaf=0.520056725,cover=0
		6:leaf=-0.134979904,cover=0
booster[1]:
0:[f10619<4] yes=1,no=2,missing=1,gain=69084.2578,cover=6000000
	1:[f19981<-9.99999975e-06] yes=3,no=4,missing=3,gain=62528.5664,cover=5490956
		3:leaf=-176.849716,cover=0
		4:leaf=-0.0325979777,cover=0
	2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=40972.1836,cover=509044
		5:leaf=117.21711,cover=0
		6:leaf=0.351738989,cover=0
booster[2]:
0:[f19981<-9.99999975e-06] yes=1,no=2,missing=1,gain=69850,cover=6000000
	1:[f84<-9.99999975e-06] yes=3,no=4,missing=3,gain=75682.5234,cover=3
		3:leaf=215.036545,cover=0
		4:leaf=-58.2910652,cover=0
	2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=198506.359,cover=5999997
		5:leaf=134.321472,cover=0
		6:leaf=-0.0141801583,cover=0
...
booster[7]:
0:[f13606<0.00220249989] yes=1,no=2,missing=2,gain=1427965.62,cover=6000000
	1:leaf=-844.975464,cover=0
	2:[f13606<0.00220249989] yes=3,no=4,missing=4,gain=1427966.62,cover=5999999
		3:leaf=-844.975464,cover=0
		4:leaf=-6.83024211e-07,cover=0
... <Remainder of trees nearly identical from here out> ...

Full dump: zgdjw.model.txt
Edit: Added stats zgdjw.model.wstats.txt

Hyperparameters:

{
    "n_gpus": -1,
    "nthread": "96",
    "eta": 1.0,
    "max_bin": 31,
    "objective": "reg:linear",
    "num_boost_round": 100,
    "tree_method": "gpu_hist",
    "max_depth": 2
  }

It would also be helpful if we had a small reproducer. Is there a way for you to mock up some kind of data?

I took a stab at mocking some data by simplest thing first: random values in a dense np array and load into dmatrix. However this runs OOM on the p3dn.24xlarge node (750Gb). I'm sure it's possible with some finesse or workarounds (e.g. writing a libsvm format file to disk and reading it back in, as I currently read directly from a libsvm file), and I'll keep working at it. I agree, a small repro is king, but I thought I'd get the ball rolling in case contributors had any other ideas. :) tyty

@RAMitchell
Copy link
Member

Thanks @jeffdk. When you dump the tree can you add the option 'with_stats' for a bit more info. Seems like you are using a lot of features. Does the problen occur if you reduce the number of features?

I have two theories:

  • integer overflow in an index variable somewhere. Generally we try to use 64 bit for indices but its possible something is wrong at this very large size.
  • instability in double precision summation at large sizes, causing gain calculations to be incorrect. This can happen when the left sum, right sum and missing sum do not quite add up to the sum of gradients for a split.

@jeffdk
Copy link
Author

jeffdk commented Mar 8, 2019

@RAMitchell
I've edited my above comments to show the splitting statistics. I also did a trial where I truncated the feature space at 1/2 the features, though I increased the number of rows to have a comparable number of entries in the feature matrix. For this case, the instability builds less quickly, but it is still present; see less_feats.model.txt

@mt-jones
I did investigate mocking data using random data on a sparse matrix of comparable columns, rows and sparsity as the data set on which I see the instability. Unfortunately, I did not see the issue there. My theory is that if a mistake is made due to numerical issues, and an incorrect split/weight is computed, it
doesn't matter as there is nothing to learn. I could work on some 'learnable' mock data; will take some more time.

@RAMitchell
I don't think we can rule out either an integer overflow (one guess is perhaps in the indexing of the list of potential splits by gains, something is off), or a classic double precision summation instability. I did explicitly check your theory about the left/right gradient/hessian sums not adding up to the parent_sums by debugging in gdb; it doesn't appear to be the case: here is an example from the model I posted here (less_feats.model.txt) where the statistics appear to add up:

Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0b10) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = -425.618408203125, sum_hess = 5.188568115234375}
2: left_weight = 68.774940490722656
3: right_stats = {sum_grad = 25.152973175048828, sum_hess = 59.612892150878906}
4: right_weight = -0.41497728228569031
5: parent_sum = {sum_grad = -400.46543502807617, sum_hess = 64.801460266113281}
6: base_weight = 6.085965633392334
(gdb)
Continuing.

Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = -2169.78759765625, sum_hess = 452151.53125}
2: left_weight = 0.0047987955622375011
3: right_stats = {sum_grad = 2455.43359375, sum_hess = 55.732139587402344}
4: right_weight = -43.281173706054688
5: parent_sum = {sum_grad = 285.64599609375, sum_hess = 452207.2633895874}
6: base_weight = -0.00063166912877932191
(gdb)
Continuing.

Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = 4986.55810546875, sum_hess = 14944.9482421875}
2: left_weight = -0.3336394727230072
3: right_stats = {sum_grad = -7156.345703125, sum_hess = 437206.59375}
4: right_weight = 0.016368301585316658
5: parent_sum = {sum_grad = -2169.78759765625, sum_hess = 452151.5419921875}
Continuing.

Thread 1 "python" hit Breakpoint 3, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x227f3c0, candidate=..., p_tree=0x62c0d50) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231
1231        tree.ExpandNode(candidate.nid, candidate.split.findex,
1: left_stats = {sum_grad = 2407.80615234375, sum_hess = 32.859725952148438}
2: left_weight = -71.111213684082031
3: right_stats = {sum_grad = 47.62744140625, sum_hess = 22.872413635253906}
4: right_weight = -1.9950827360153198
5: parent_sum = {sum_grad = 2455.43359375, sum_hess = 55.732139587402344}
6: base_weight = -43.281173706054688

I also have an example from debugging (original 21k feature set) where I see the instability very quickly, in the second tree; perhaps you can pull a clue out of the state I've captured here?

Thread 1 "python" hit Breakpoint 4, xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ApplySplit (this=0x393e2f0, candidate=..., p_tree=0x831430f0) at /home/dsci/xgboost/src/tree/updater_gpu_hist.cu:1231andidate.nid, candidate.split.findex,
1: left_weight = -122.0047607421875
2: right_weight = -0.43771761655807495
(gdb) p candidate
$17 = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbe0: {nid = 1, depth = 1, split = {loss_chg = 32373.4688, dir = xgboost::tree::kLeftDir, fvalue = 0.0188999996, = {grad_ = 267.262177, hess_ = 1.19058812}, right_sum = {grad_ = 447261.188, hess_ = 1021802.06}}, timestamp = 1}
(gdb) p p_tree
$18 = (xgboost::RegTree *) 0x831430f0
(gdb) p *
$19 = {param = {<dmlc::Parameter<xgboost::TreeParam>> = {<No data fields>}, num_roots = 1, num_nodes = 3, num_deleted = 0, max_depth = 0, num_feature = 21692, size_leaf_vector = 0, reserved = {0 <repeats 31 times>}}, es_ = std::vector of length 3, capacity 4 = {{parent_ = -1, cleft_ = 1, cright_ = 2, sindex_ = 2147484527, info_ = {leaf_value = -9.99999975e-06, split_cond = -9.99999975e-06}}, {parent_ = -2147483648, cleft_ = -1, cright_ = 0, sindex_ = 0, 978655, split_cond = -0.437978655}}, {parent_ = 0, cleft_ = -1, cright_ = 0, sindex_ = 0, info_ = {leaf_value = 0.29592514, split_cond = 0.29592514}}}, deleted_nodes_ = std::vector of length 0, capacity 0, ngth 3, capacity 4 = {{loss_chg = 97287.6719, sum_hess = 1241213.88, base_weight = -0.308245599, leaf_child_cnt = 0}, {loss_chg = 0, sum_hess = 0, base_weight = 0, leaf_child_cnt = 0}, {loss_chg = 0, sum_hess = 0, ild_cnt = 0}}, node_mean_values_ = std::vector of length 0, capacity 0}

The breakpoint is here. Looks like some of the indexes in the data structure I printed are close to min/max int32.

@RAMitchell
Copy link
Member

I can't see anything definitive from these dumps. The indices that are close to int32 in this case are not a problem.

We can definitely see the gradients increasing over boosting iterations. This should basically never happen in gradient boosting. The problem will be pinpointing exactly where the incorrect calculation is occurring. Using the 'reg:linear' objective helps because the hessian is always 1.0 and therefore we can see how many training instances are in each node from 'cover' in the text dumps.

One thing I notice is that in some of the trees the child nodes have higher gain than the root node. This almost never happens in gradient boosting because we tend to greedily select the best splits early in the tree.

booster[2]:
0:[f19981<-9.99999975e-06] yes=1,no=2,missing=1,gain=69850,cover=6000000
	1:[f84<-9.99999975e-06] yes=3,no=4,missing=3,gain=75682.5234,cover=3
		3:leaf=215.036545,cover=0
		4:leaf=-58.2910652,cover=0
	2:[f3<-9.99999975e-06] yes=5,no=6,missing=5,gain=198506.359,cover=5999997
		5:leaf=134.321472,cover=0
		6:leaf=-0.0141801583,cover=0

Whatever statistics were used to calculate the split of node 2 in the above were almost certainly incorrect. It's also very unusual to split with only three instances on the left-hand side. You might be able to find more inconsistencies by drilling down into this particular example.

@jeffdk
Copy link
Author

jeffdk commented Mar 9, 2019

Thanks for taking a look and giving me some direction, @RAMitchell!

I'll dug into a very similar example (details below), but I'm running into a loss of where to point the debugger next; do you have a suggestion?
I'm also open to any other thoughts you might have about tools or routines you generally use to track down these sorts of issues (additional compiler settings, xgb logging modes, or just throw me some links to guides you find helpful). Also let me know if at any point you're interested in connecting over a higher bandwidth communication channel if you feel it would take less of your time compared to back-and-forth over issues here (regardless I'm extremely grateful for all your help).

Script & Parameters

import sys
import xgboost as xgb

dtrain = xgb.DMatrix(sys.argv[1])

params =   {
    "max_depth": 2,
    "eta": 1.0,
    "verbosity": 3,
    "objective": "reg:linear",
    "tree_method": "gpu_hist",
    "n_gpus": 8,
    "max_bin": 31,
    "eval_metric": ['mae', 'rmse']
}
eval_result = {}

bst = xgb.train(params, dtrain, num_boost_round=100, evals_result=eval_result, evals=[(dtrain, 'dtrain')])

print(eval_result)

bst.save_model("repro.model")
bst.dump_model("repro.model.txt", with_stats=True)

I ran with reg:linear as you suggested (though on my multiclass labeled dataset in [0, 1, 2]).

Debugging state
The first tree seemed to build fine, but the second one was immediately unstable; here is some debugging output (second and third tree); more detailed output here:
debug_info_reg-linear.txt

[0]	dtrain-mae:0.148228	dtrain-rmse:0.35644

2: parent_sum = {sum_grad = -54.15625, sum_hess = 6000000}
3: base_weight = 9.0260400611441582e-06
4: left_stats = {sum_grad = 179347.765625, sum_hess = 5490956}
5: left_weight = -0.032662387937307358
6: right_stats = {sum_grad = -179401.921875, sum_hess = 509044}
7: right_weight = 0.35242840647697449
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 0,
  depth = 0, split = {loss_chg = 69084.2578, dir = xgboost::tree::kLeftDir, fvalue = 4, findex = 10619, left_sum = {grad_ = 179347.766, hess_ = 5490956},
    right_sum = {grad_ = -179401.922, hess_ = 509044}}, timestamp = 0}
(gdb)
(gdb) c
Continuing.
2: parent_sum = {sum_grad = 179347.76193237305, sum_hess = 5490956}
3: base_weight = -0.032662387937307358
4: left_stats = {sum_grad = 353.69943237304688, sum_hess = 1}
5: left_weight = -176.84971618652344
6: right_stats = {sum_grad = 178994.0625, sum_hess = 5490955}
7: right_weight = -0.032597977668046951
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 1,
  depth = 1, split = {loss_chg = 62528.5664, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 19981, left_sum = {grad_ = 353.699432,
      hess_ = 1}, right_sum = {grad_ = 178994.062, hess_ = 5490955}}, timestamp = 1}
(gdb)
(gdb) c
Continuing.
2: parent_sum = {sum_grad = -179401.91696166992, sum_hess = 509044}
3: base_weight = 0.35242840647697449
4: left_stats = {sum_grad = -351.65133666992188, sum_hess = 2}
5: left_weight = 117.21710968017578
6: right_stats = {sum_grad = -179050.265625, sum_hess = 509042}
7: right_weight = 0.35173898935317993
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 2,
  depth = 1, split = {loss_chg = 40972.1836, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 3, left_sum = {grad_ = -351.651337,
      hess_ = 2}, right_sum = {grad_ = -179050.266, hess_ = 509042}}, timestamp = 2}
(gdb) c
Continuing.
[1]	dtrain-mae:0.155413	dtrain-rmse:1.35858

2: parent_sum = {sum_grad = 83074.714721679688, sum_hess = 6000000}
3: base_weight = -0.013845783658325672
4: left_stats = {sum_grad = -528.5274658203125, sum_hess = 3}
5: left_weight = 132.13186645507812
6: right_stats = {sum_grad = 83603.2421875, sum_hess = 5999997}
7: right_weight = -0.013933878391981125
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 0,
  depth = 0, split = {loss_chg = 69850, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 19981, left_sum = {grad_ = -528.527466,
      hess_ = 3}, right_sum = {grad_ = 83603.2422, hess_ = 5999997}}, timestamp = 0}
(gdb) c
Continuing.
2: parent_sum = {sum_grad = -528.52748870849609, sum_hess = 3}
3: base_weight = 132.13186645507812
4: left_stats = {sum_grad = -645.109619140625, sum_hess = 2}
5: left_weight = 215.03654479980469
6: right_stats = {sum_grad = 116.58213043212891, sum_hess = 1}
7: right_weight = -58.291065216064453
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 1,
  depth = 1, split = {loss_chg = 75682.5234, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 84, left_sum = {grad_ = -645.109619,
      hess_ = 2}, right_sum = {grad_ = 116.58213, hess_ = 1}}, timestamp = 1}
(gdb) c
Continuing.
2: parent_sum = {sum_grad = 83603.2451171875, sum_hess = 5999997}
3: base_weight = -0.013933878391981125
4: left_stats = {sum_grad = -1477.5361328125, sum_hess = 10}
5: left_weight = 134.32147216796875
6: right_stats = {sum_grad = 85080.78125, sum_hess = 5999987}
7: right_weight = -0.014180158264935017
8: candidate = (const xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::ExpandEntry &) @0x7fffffffcbb0: {nid = 2,
  depth = 1, split = {loss_chg = 198506.359, dir = xgboost::tree::kLeftDir, fvalue = -9.99999975e-06, findex = 3, left_sum = {grad_ = -1477.53613,
      hess_ = 10}, right_sum = {grad_ = 85080.7812, hess_ = 5999987}}, timestamp = 2}
(gdb) c
Continuing.
[2]	dtrain-mae:0.357891	dtrain-rmse:5.60447

The model is identical to the one posted above (I.e. the results here are consistent and deterministic)

@jeffdk jeffdk changed the title Numerical Instability in multi GPU for training on large data sets Numerical Instability with histogram methods for training on large data sets Mar 14, 2019
@jeffdk
Copy link
Author

jeffdk commented Mar 14, 2019

I have managed to reproduce the instability with a relatively small data set which I can share. Here is an archive of the data, a repro script, and the model output: xgboost-issue-4204.tar.gz

Updates from above

  • The instability can be seen using the cpu hist method (no GPU required)
  • I was able to reproduce on a data set with 80k rows, and only 5 features (included)
  • The presence of the instability is very sensitive to the row count and max_bins value

Symptoms of the instability
Loss and error increasing over iterations:

[6]	dtrain-error:0.140089
[7]	dtrain-error:0.140464
[8]	dtrain-error:0.141727
[9]	dtrain-error:0.136664
[10]	dtrain-error:0.138864
[11]	dtrain-error:0.143077
[12]	dtrain-error:0.138251

Unreasonably large leaf weights (only starting in the second iteration):

booster[0]:
0:[f1<-9.99999975e-06] yes=1,no=2,missing=1,gain=24160.4473,cover=19998
        1:[f4<0.958100021] yes=3,no=4,missing=3,gain=548.710938,cover=6389
                3:leaf=1.93459618,cover=5305.5
                4:leaf=1.15366864,cover=1083.5
        2:[f2<24.2000008] yes=5,no=6,missing=5,gain=16520.0605,cover=13609
                5:leaf=0.636635423,cover=6271.25
                6:leaf=-1.57371128,cover=7337.75
booster[1]:
0:[f1<-9.99999975e-06] yes=1,no=2,missing=1,gain=2211.51538,cover=12981.4521
        1:[f3<30.7999992] yes=3,no=4,missing=3,gain=635.600464,cover=3131.0188
                3:leaf=0.852509975,cover=3060.12256
                4:leaf=-2.17617011,cover=70.896431
        2:[f2<24.9500008] yes=5,no=6,missing=5,gain=2562.04663,cover=9850.43359
                5:leaf=0.179850996,cover=6565.78125
                6:leaf=-0.901914954,cover=3284.6521
booster[2]:
0:[f2<27.9200001] yes=1,no=2,missing=1,gain=1546.1908,cover=9669.37793
        1:[f3<30.3500004] yes=3,no=4,missing=3,gain=1597.83191,cover=9472.5332
                3:leaf=-0.190983862,cover=8581.21387
                4:leaf=1.21573615,cover=891.319031
        2:[f1<-9.99999975e-06] yes=5,no=6,missing=5,gain=2776.03711,cover=196.845535
                5:leaf=1.06381559,cover=163.066742
                6:leaf=11.0240746,cover=33.7788048

Hyperparameters for repro
While for the gpu_hist method I was able to see the instability for the reg:linear objective, I could could only reproduce it for classification, binary:logistic, on the cpu based hist method.

{
    "nthread": 1,
    "max_depth": 2,
    "eta": 1.0,
    "objective": "binary:logistic",
    "num_boost_round": 125,
    "tree_method": "hist",
    "max_bin": 31,
    "lambda": 0.0,
    "eval_metrics": ["logloss", "error"]
}

Note: The fast learning rate and lack of L2 regularization in these parameters is to surface the instability as quickly as possible for the repro. Setting a small learning rate, or L2 regularization slows but does not stop the instability.

Best guess
My best guess is that there is a pathological interaction between the data distribution and the histogram method. It seems to me that I'm hitting some edge case in histogram binning, histogram gradient statistics calculation/updating, histogram split finding, or histogram prediction caching that is resulting in abnormally large leaf weights. I've tried sifting through the statistics, splits, and predictions in gdb, but I'm simply not familiar enough with the code to narrow things any further. @hcho3 @RAMitchell I hope you can use the data I've provided to shed some light on the issue. Many thanks!

@RAMitchell
Copy link
Member

I had a look today. There is nothing wrong with prediction or prediction cacheing as far as I can tell.

The problem is identical in 'gpu_hist' and 'hist' but doesn't occur in 'exact'. Reducing the learning rate to 0.5 resolves the problem in this case. @jeffdk can you confirm what learning rates you are using on your larger dataset?

One theory is that xgboost uses a Newton style gradient descent algorithm. Therefore it is not guaranteed to converge for learning rate 1.0 due to approximation of the loss function. But you said above that you also tried the squared error objective, which would not fail if this were the case.

The dataset looks very sparse so maybe there is a problem with missing values?

@jeffdk
Copy link
Author

jeffdk commented Mar 22, 2019

Hi @RAMitchell, thanks for taking a look!

The problem is identical in 'gpu_hist' and 'hist' but doesn't occur in 'exact'.

That is correct

Reducing the learning rate to 0.5 resolves the problem in this case.

Actually, I claim there is still an issue for learning rate of 0.5; the dtrain error increases on iteration 10. Here is the corresponding tree:

booster[10]:
0:[f2<25.0499992] yes=1,no=2,missing=1,gain=158.67746,cover=9094.04004
        1:[f3<30.3500004] yes=3,no=4,missing=3,gain=352.901093,cover=7533.38428
                3:leaf=-0.00136313098,cover=6965.0625
                4:leaf=0.408399105,cover=568.321716
        2:[f4<0.557699978] yes=5,no=6,missing=5,gain=233.026367,cover=1560.6554
                5:leaf=0.464131415,cover=142.393295
                6:leaf=-0.206838489,cover=1418.26208

@jeffdk can you confirm what learning rates you are using on your larger dataset?

I use rates in the range of [0.09-0.2] usually. I've only been cranking the rate up to make the instability easier to see.

One theory is that xgboost uses a Newton style gradient descent algorithm. Therefore it is not guaranteed to converge for learning rate 1.0 due to approximation of the loss function.

I still see a divergence using the attached example and a learning rate of 0.9. Regardless, my impression is that for the binomial logloss, there should be no problems with convergence mathematically given full Newton steps (well, at least, I claim the solution shouldn't diverge right?).

But you said above that you also tried the squared error objective, which would not fail if this were the case.

I do see the instability on the full data set with gpu_hist using squared error. I'm not able to repro an instability using squared error loss with the cpu hist method.

The dataset looks very sparse so maybe there is a problem with missing values?

Sparsity in some way may be related. What sort of problems with missing values do you think could manifest in this way?

@trivialfis
Copy link
Member

Closing in favour of #5023

@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants