-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] Draft for language binding consistency. [skip ci] #9755
Conversation
Hi all, this is a draft for documenting some consistency issues that hopefully we can address in the future. Please help review and make suggestions. |
@trivialfis Thanks for the info. Also wondering about the recommended way of handling the |
@david-cortes IMHO, XGBoost doesn't need to have a CV function in its interface. The existing implementation pre-dates my involvement in XGBoost. To me, the CV function should be part of a more general hyper-parameter optimization library or a model evaluation library instead of XGBoost. My opinion is based on the fact that cross-validation is a general statistic technique instead of a gradient-boosting-specific technique. Ideal case aside, if we were to have a specialized feature version for XGB, (like working with survival data, learning to rank, custom early stopping strategy, etc) as a temporary solution, I think we can directly work with input data like |
That said, it's totally reasonable to have a different project than XGBoost, or upstream some changes to existing projects like scikit-learn. |
(Note: comment is unrelated to the PR in discussion) Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated I'm not familiar with what xgboost does internally in each situation, but for example in catboost, binning data in the histogram method (especially when there are categorical features) can take a lot of time, and an efficient |
That's a very good point! We don't have this sort of optimization yet but can be done if necessary. Basically, one would obtain a Xy = QuantileDMatrix(X, y) # <- contains the bining for the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xy_train = QuantileDMatrix(X_train, y_train, ref=Xy)
Xy_test = QuantileDMatrix(X_test, y_test, ref=Xy) |
Yes, there is some small data leakage when doing cross-validation with the same quantization for all folds, and it gives slightly different results than doing it the "correct" way, but as sample sizes increase, this difference shrinks. Although I see that for example lightgbm authors want to move away from this CV behavior, so that's why I was asking whether there would be a recommendation for xgboost's CV function. Although I also think in the case of xgboost, there's less pre-processing that in e.g. catboost so the time and memory gains would be much smaller. |
The only part of the draft that I have significant concerns about is the default parameter values. Some of the default values are problematic, in particular see this issue in which the default value for number of threads can cause programs to silently hang if they exceed the number of Julia process threads. I would not be at all surprised if that were not the only such issue (are there any parameters that reference an index? I can't think of any off the top of my head, but if so they will be different for 0 and 1-based languages). I think it's reasonable to require that language bindings can always accept a null argument to reproduce the On a somewhat less important note, iteration number seems like it could present some concerns. Currently, in the julia wrapper, it looks like we start with round Finally, I agree that the wrapper packages should be separate from additional general machine learning stuff like CV. I've encountered a few people who claimed we should add it, but there's a handful of packages with CV functionality in Julia, and I'd imagine that the number of such packages in Python must be astronomical. Early stopping, on the other hand, I don't see the same way, as it's much harder to provide a general interface for that works across different ML algorithms (arguably such a thing could not even be logically consistent). I do think it would be a good idea for the core library to add early stopping if the developers are so inclined. |
We can always make exceptions. This is a guideline, not a policy, as shared in the first paragraph.
Is it possible to translate it to 0-based under the hood?
This might not happen in the near future. Having ES inside the core would mean we will move all the logic of callback, custom objective, and metrics into C++ core, which is quite challenging, to say the least. |
Yes, definitely, I was just a bit unsure whether that would comply with your proposed guidelines because of reporting and all. If we can do the same thing we can do with array indices, I have no concern here.
Oh well. I had a feeling it might be quite involved, but thought I'd mention it just in case. |
No description provided.