Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] Draft for language binding consistency. [skip ci] #9755

Merged
merged 3 commits into from
Nov 29, 2023

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis
Copy link
Member Author

Hi all, this is a draft for documenting some consistency issues that hopefully we can address in the future. Please help review and make suggestions.

cc @ExpandingMan @david-cortes

@david-cortes
Copy link
Contributor

@trivialfis Thanks for the info. Also wondering about the recommended way of handling the .cv functions and details like creating one big DMatrix and slicing it vs. creating separate ones out of sliced data. I see for example that the R interface does a lot of things outside of the core C++ library.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 3, 2023

@david-cortes IMHO, XGBoost doesn't need to have a CV function in its interface. The existing implementation pre-dates my involvement in XGBoost. To me, the CV function should be part of a more general hyper-parameter optimization library or a model evaluation library instead of XGBoost. My opinion is based on the fact that cross-validation is a general statistic technique instead of a gradient-boosting-specific technique.

Ideal case aside, if we were to have a specialized feature version for XGB, (like working with survival data, learning to rank, custom early stopping strategy, etc) as a temporary solution, I think we can directly work with input data like data.table and np.ndarray instead of DMatrix. This way, we can have it work with the high-level interface and hopefully useful for other projects as well (like LightGBM).

@trivialfis
Copy link
Member Author

I think we can directly work with input data like data.table and np.ndarray instead of DMatrix.

That said, it's totally reasonable to have a different project than XGBoost, or upstream some changes to existing projects like scikit-learn.

@david-cortes
Copy link
Contributor

(Note: comment is unrelated to the PR in discussion)

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

I'm not familiar with what xgboost does internally in each situation, but for example in catboost, binning data in the histogram method (especially when there are categorical features) can take a lot of time, and an efficient cv routine can be exploited to do this only once instead of once per data fold.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 3, 2023

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

That's a very good point! We don't have this sort of optimization yet but can be done if necessary. Basically, one would obtain a QuantileDMatrix for the full dataset and use it as a reference for the others. However, I'm a bit confused, wouldn't this leak information about the validation data into training? Might not matter in practice, but seems like an odd choice.

Xy = QuantileDMatrix(X, y)  # <- contains the bining for the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xy_train = QuantileDMatrix(X_train, y_train, ref=Xy)
Xy_test = QuantileDMatrix(X_test, y_test, ref=Xy)

@david-cortes
Copy link
Contributor

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

That's a very good point! We don't have this sort of optimization yet but can be done if necessary. Basically, one would obtain a QuantileDMatrix for the full dataset and use it as a reference for the others. However, I'm a bit confused, wouldn't this leak information about the validation data into training? Might not matter in practice, but seems like an odd choice.

Xy = QuantileDMatrix(X, y)  # <- contains the bining for the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xy_train = QuantileDMatrix(X_train, y_train, ref=Xy)
Xy_test = QuantileDMatrix(X_test, y_test, ref=Xy)

Yes, there is some small data leakage when doing cross-validation with the same quantization for all folds, and it gives slightly different results than doing it the "correct" way, but as sample sizes increase, this difference shrinks.

Although I see that for example lightgbm authors want to move away from this CV behavior, so that's why I was asking whether there would be a recommendation for xgboost's CV function. Although I also think in the case of xgboost, there's less pre-processing that in e.g. catboost so the time and memory gains would be much smaller.

@ExpandingMan
Copy link

The only part of the draft that I have significant concerns about is the default parameter values. Some of the default values are problematic, in particular see this issue in which the default value for number of threads can cause programs to silently hang if they exceed the number of Julia process threads. I would not be at all surprised if that were not the only such issue (are there any parameters that reference an index? I can't think of any off the top of my head, but if so they will be different for 0 and 1-based languages). I think it's reasonable to require that language bindings can always accept a null argument to reproduce the libxgboost defaults. I also think it's reasonable to have a policy of preferring libxgboost defaults where they do not cause undesired side effects, but declaring it as a strict policy seems risky.

On a somewhat less important note, iteration number seems like it could present some concerns. Currently, in the julia wrapper, it looks like we start with round 1 and don't translate it, which, if I'm reading your documentation corrrectly, sounds like it might actually be a bug. This can of course be easily fixed, but I think on the Julia side we would prefer to keep iteration numbers 1-based for consistency with indexing.

Finally, I agree that the wrapper packages should be separate from additional general machine learning stuff like CV. I've encountered a few people who claimed we should add it, but there's a handful of packages with CV functionality in Julia, and I'd imagine that the number of such packages in Python must be astronomical. Early stopping, on the other hand, I don't see the same way, as it's much harder to provide a general interface for that works across different ML algorithms (arguably such a thing could not even be logically consistent). I do think it would be a good idea for the core library to add early stopping if the developers are so inclined.

@trivialfis
Copy link
Member Author

trivialfis commented Nov 8, 2023

Some of the default values are problematic

We can always make exceptions. This is a guideline, not a policy, as shared in the first paragraph.

On a somewhat less important note, iteration number seems like it could present some concerns.

Is it possible to translate it to 0-based under the hood?

I do think it would be a good idea for the core library to add early stopping if the developers are so inclined.

This might not happen in the near future. Having ES inside the core would mean we will move all the logic of callback, custom objective, and metrics into C++ core, which is quite challenging, to say the least.

@ExpandingMan
Copy link

Is it possible to translate it to 0-based under the hood?

Yes, definitely, I was just a bit unsure whether that would comply with your proposed guidelines because of reporting and all. If we can do the same thing we can do with array indices, I have no concern here.

This might not happen in the near future. Having ES inside the core would mean we will move all the logic of callback, custom objective, and metrics into C++ core, which is quite challenging, to say the least.

Oh well. I had a feeling it might be quite involved, but thought I'd mention it just in case.

doc/contrib/consistency.rst Outdated Show resolved Hide resolved
doc/contrib/consistency.rst Outdated Show resolved Hide resolved
doc/contrib/consistency.rst Outdated Show resolved Hide resolved
@trivialfis trivialfis merged commit 59684b2 into dmlc:master Nov 29, 2023
1 check passed
@trivialfis trivialfis deleted the doc-consistency branch November 29, 2023 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants