-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerical Instability with histogram methods for training on large data sets #4204
Comments
Seems like it could be an overflow issue? Can you please post the text dump of the first couple of trees using some small depth? |
The smallest depth I have on hand right now is 5. This model's instability was less pronounced but you can see a leaf weight of over 20 present early on (search the file for "leaf=2" from the start): 12th tree from the top (Considering a 3 class model, this is the 4th round): bzktf.model.txt
|
Hey @jeffdk I’m wondering if you observed this with other objective functions, as well. Did you try How many workers do you launch? I’m guessing just one from the parameters. Have you tried not setting the predictor to CPU? I believe setting |
Hey @mt-jones thanks for checking in,
|
It would be helpful to know if the issue persists with all objective functions, etc. Could you try It would also be helpful if we had a small reproducer. Is there a way for you to mock up some kind of data? |
I have a repro of a run using a depth of 2 where you can see extremely large leaf values in the first few trees (magnitude > 100):
Full dump here: vqqhu.model.txt Hyperparameters used for this run:
|
@mt-jones I also see the issue with the
Full dump: zgdjw.model.txt Hyperparameters:
I took a stab at mocking some data by simplest thing first: random values in a dense np array and load into dmatrix. However this runs OOM on the |
Thanks @jeffdk. When you dump the tree can you add the option 'with_stats' for a bit more info. Seems like you are using a lot of features. Does the problen occur if you reduce the number of features? I have two theories:
|
@RAMitchell @mt-jones @RAMitchell
I also have an example from debugging (original 21k feature set) where I see the instability very quickly, in the second tree; perhaps you can pull a clue out of the state I've captured here?
The breakpoint is here. Looks like some of the indexes in the data structure I printed are close to min/max int32. |
I can't see anything definitive from these dumps. The indices that are close to int32 in this case are not a problem. We can definitely see the gradients increasing over boosting iterations. This should basically never happen in gradient boosting. The problem will be pinpointing exactly where the incorrect calculation is occurring. Using the 'reg:linear' objective helps because the hessian is always 1.0 and therefore we can see how many training instances are in each node from 'cover' in the text dumps. One thing I notice is that in some of the trees the child nodes have higher gain than the root node. This almost never happens in gradient boosting because we tend to greedily select the best splits early in the tree.
Whatever statistics were used to calculate the split of node 2 in the above were almost certainly incorrect. It's also very unusual to split with only three instances on the left-hand side. You might be able to find more inconsistencies by drilling down into this particular example. |
Thanks for taking a look and giving me some direction, @RAMitchell! I'll dug into a very similar example (details below), but I'm running into a loss of where to point the debugger next; do you have a suggestion? Script & Parameters
I ran with Debugging state
The model is identical to the one posted above (I.e. the results here are consistent and deterministic) |
I have managed to reproduce the instability with a relatively small data set which I can share. Here is an archive of the data, a repro script, and the model output: xgboost-issue-4204.tar.gz Updates from above
Symptoms of the instability
Unreasonably large leaf weights (only starting in the second iteration):
Hyperparameters for repro
Note: The fast learning rate and lack of L2 regularization in these parameters is to surface the instability as quickly as possible for the repro. Setting a small learning rate, or L2 regularization slows but does not stop the instability. Best guess |
I had a look today. There is nothing wrong with prediction or prediction cacheing as far as I can tell. The problem is identical in 'gpu_hist' and 'hist' but doesn't occur in 'exact'. Reducing the learning rate to 0.5 resolves the problem in this case. @jeffdk can you confirm what learning rates you are using on your larger dataset? One theory is that xgboost uses a Newton style gradient descent algorithm. Therefore it is not guaranteed to converge for learning rate 1.0 due to approximation of the loss function. But you said above that you also tried the squared error objective, which would not fail if this were the case. The dataset looks very sparse so maybe there is a problem with missing values? |
Hi @RAMitchell, thanks for taking a look!
That is correct
Actually, I claim there is still an issue for learning rate of 0.5; the dtrain error increases on iteration 10. Here is the corresponding tree:
I use rates in the range of [0.09-0.2] usually. I've only been cranking the rate up to make the instability easier to see.
I still see a divergence using the attached example and a learning rate of 0.9. Regardless, my impression is that for the binomial logloss, there should be no problems with convergence mathematically given full Newton steps (well, at least, I claim the solution shouldn't diverge right?).
I do see the instability on the full data set with
Sparsity in some way may be related. What sort of problems with missing values do you think could manifest in this way? |
Closing in favour of #5023 |
TLDR and ask
I've got a tricky issue which I only see when training on data sets above a certain size. I am currently unable to provide a clean repro as I only encounter the problem when my data is above a certain threshold (around 10 billion entries in the feature matrix). I'm asking for some help on how to further debug this issue; (e.g. since I don't know the details of the code nearly as well as a contributor, perhaps you could provide me with a watchpoint I can set to break on and gather more data about the program state at the onset of the instability)
Symptoms:
See my comment here for an example: #4095 (comment)
7:leaf=43151.1016 8:leaf=-0.147368416
(Though it is certainly unrelated, the behavior feels similar to what was seen in XGBoost Regression Accuracies Worst with gpu_hist, Slower with gpu_exact #2754)
Data details
Experiments tried
max_delta_step
, however learning is significantly impaired. Settingmax_delta_step
is really just capping the potential impact of the instability, and I find that it is still impossible to learn a model that is competitive with what I've seen using the same data-set training in single-node CPU exact model.Environment and build info
Sample hyperparameters (I've tried many experiments but all are generally similar to these):
nvidia
runtime). CUDA and NCCL were installed via these deb packages from nvidia’s repository: cuda-toolkit-10-0 libnccl2=2.4.2-1+cuda10.0 libnccl-dev=2.4.2-1+cuda10.0p3dn.24xlarge
instance. A smaller number of GPUs is unable to fit the size of my data.testxgboost
suite and other tests for tree updaters with GPU pass.Please let me know what other information I can provide that would be useful; any help is appreciated! Thanks all!
-Jeff
The text was updated successfully, but these errors were encountered: