[doc] Draft for language binding consistency. [skip ci] #9755

trivialfis · 2023-11-03T09:15:48Z

No description provided.

trivialfis · 2023-11-03T09:18:40Z

Hi all, this is a draft for documenting some consistency issues that hopefully we can address in the future. Please help review and make suggestions.

cc @ExpandingMan @david-cortes

david-cortes · 2023-11-03T13:44:19Z

@trivialfis Thanks for the info. Also wondering about the recommended way of handling the .cv functions and details like creating one big DMatrix and slicing it vs. creating separate ones out of sliced data. I see for example that the R interface does a lot of things outside of the core C++ library.

trivialfis · 2023-11-03T14:12:23Z

@david-cortes IMHO, XGBoost doesn't need to have a CV function in its interface. The existing implementation pre-dates my involvement in XGBoost. To me, the CV function should be part of a more general hyper-parameter optimization library or a model evaluation library instead of XGBoost. My opinion is based on the fact that cross-validation is a general statistic technique instead of a gradient-boosting-specific technique.

Ideal case aside, if we were to have a specialized feature version for XGB, (like working with survival data, learning to rank, custom early stopping strategy, etc) as a temporary solution, I think we can directly work with input data like data.table and np.ndarray instead of DMatrix. This way, we can have it work with the high-level interface and hopefully useful for other projects as well (like LightGBM).

trivialfis · 2023-11-03T14:15:58Z

I think we can directly work with input data like data.table and np.ndarray instead of DMatrix.

That said, it's totally reasonable to have a different project than XGBoost, or upstream some changes to existing projects like scikit-learn.

david-cortes · 2023-11-03T14:37:41Z

(Note: comment is unrelated to the PR in discussion)

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

I'm not familiar with what xgboost does internally in each situation, but for example in catboost, binning data in the histogram method (especially when there are categorical features) can take a lot of time, and an efficient cv routine can be exploited to do this only once instead of once per data fold.

trivialfis · 2023-11-03T21:04:51Z

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

That's a very good point! We don't have this sort of optimization yet but can be done if necessary. Basically, one would obtain a QuantileDMatrix for the full dataset and use it as a reference for the others. However, I'm a bit confused, wouldn't this leak information about the validation data into training? Might not matter in practice, but seems like an odd choice.

Xy = QuantileDMatrix(X, y)  # <- contains the bining for the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xy_train = QuantileDMatrix(X_train, y_train, ref=Xy)
Xy_test = QuantileDMatrix(X_test, y_test, ref=Xy)

david-cortes · 2023-11-05T13:38:51Z

Generally, it makes sense to leave cross-validation to a dedicated framework, but there are cases where a dedicated cv function can exploit optimizations specific to the algorithm that a higher-level interface wouldn't.

That's a very good point! We don't have this sort of optimization yet but can be done if necessary. Basically, one would obtain a QuantileDMatrix for the full dataset and use it as a reference for the others. However, I'm a bit confused, wouldn't this leak information about the validation data into training? Might not matter in practice, but seems like an odd choice.
Xy = QuantileDMatrix(X, y)  # <- contains the bining for the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y)
Xy_train = QuantileDMatrix(X_train, y_train, ref=Xy)
Xy_test = QuantileDMatrix(X_test, y_test, ref=Xy)

Yes, there is some small data leakage when doing cross-validation with the same quantization for all folds, and it gives slightly different results than doing it the "correct" way, but as sample sizes increase, this difference shrinks.

Although I see that for example lightgbm authors want to move away from this CV behavior, so that's why I was asking whether there would be a recommendation for xgboost's CV function. Although I also think in the case of xgboost, there's less pre-processing that in e.g. catboost so the time and memory gains would be much smaller.

ExpandingMan · 2023-11-07T17:41:14Z

The only part of the draft that I have significant concerns about is the default parameter values. Some of the default values are problematic, in particular see this issue in which the default value for number of threads can cause programs to silently hang if they exceed the number of Julia process threads. I would not be at all surprised if that were not the only such issue (are there any parameters that reference an index? I can't think of any off the top of my head, but if so they will be different for 0 and 1-based languages). I think it's reasonable to require that language bindings can always accept a null argument to reproduce the libxgboost defaults. I also think it's reasonable to have a policy of preferring libxgboost defaults where they do not cause undesired side effects, but declaring it as a strict policy seems risky.

On a somewhat less important note, iteration number seems like it could present some concerns. Currently, in the julia wrapper, it looks like we start with round 1 and don't translate it, which, if I'm reading your documentation corrrectly, sounds like it might actually be a bug. This can of course be easily fixed, but I think on the Julia side we would prefer to keep iteration numbers 1-based for consistency with indexing.

Finally, I agree that the wrapper packages should be separate from additional general machine learning stuff like CV. I've encountered a few people who claimed we should add it, but there's a handful of packages with CV functionality in Julia, and I'd imagine that the number of such packages in Python must be astronomical. Early stopping, on the other hand, I don't see the same way, as it's much harder to provide a general interface for that works across different ML algorithms (arguably such a thing could not even be logically consistent). I do think it would be a good idea for the core library to add early stopping if the developers are so inclined.

trivialfis · 2023-11-08T15:44:56Z

Some of the default values are problematic

We can always make exceptions. This is a guideline, not a policy, as shared in the first paragraph.

On a somewhat less important note, iteration number seems like it could present some concerns.

Is it possible to translate it to 0-based under the hood?

I do think it would be a good idea for the core library to add early stopping if the developers are so inclined.

This might not happen in the near future. Having ES inside the core would mean we will move all the logic of callback, custom objective, and metrics into C++ core, which is quite challenging, to say the least.

ExpandingMan · 2023-11-08T19:22:31Z

Is it possible to translate it to 0-based under the hood?

Yes, definitely, I was just a bit unsure whether that would comply with your proposed guidelines because of reporting and all. If we can do the same thing we can do with array indices, I have no concern here.

This might not happen in the near future. Having ES inside the core would mean we will move all the logic of callback, custom objective, and metrics into C++ core, which is quite challenging, to say the least.

Oh well. I had a feeling it might be quite involved, but thought I'd mention it just in case.

doc/contrib/consistency.rst

[doc] Draft for language binding consistency. [skip ci]

7dea516

defaults [skip ci]

34990a2

RAMitchell approved these changes Nov 6, 2023

View reviewed changes

trivialfis commented Nov 28, 2023

View reviewed changes

doc/contrib/consistency.rst Outdated Show resolved Hide resolved

doc/contrib/consistency.rst Outdated Show resolved Hide resolved

doc/contrib/consistency.rst Outdated Show resolved Hide resolved

Apply suggestions from code review [skip ci]

69819cb

trivialfis merged commit 59684b2 into dmlc:master Nov 29, 2023
1 check passed

trivialfis deleted the doc-consistency branch November 29, 2023 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] Draft for language binding consistency. [skip ci] #9755

[doc] Draft for language binding consistency. [skip ci] #9755

trivialfis commented Nov 3, 2023

trivialfis commented Nov 3, 2023

david-cortes commented Nov 3, 2023

trivialfis commented Nov 3, 2023 •

edited

Loading

trivialfis commented Nov 3, 2023

david-cortes commented Nov 3, 2023

trivialfis commented Nov 3, 2023 •

edited

Loading

david-cortes commented Nov 5, 2023

ExpandingMan commented Nov 7, 2023

trivialfis commented Nov 8, 2023 •

edited

Loading

ExpandingMan commented Nov 8, 2023

[doc] Draft for language binding consistency. [skip ci] #9755

[doc] Draft for language binding consistency. [skip ci] #9755

Conversation

trivialfis commented Nov 3, 2023

trivialfis commented Nov 3, 2023

david-cortes commented Nov 3, 2023

trivialfis commented Nov 3, 2023 • edited Loading

trivialfis commented Nov 3, 2023

david-cortes commented Nov 3, 2023

trivialfis commented Nov 3, 2023 • edited Loading

david-cortes commented Nov 5, 2023

ExpandingMan commented Nov 7, 2023

trivialfis commented Nov 8, 2023 • edited Loading

ExpandingMan commented Nov 8, 2023

trivialfis commented Nov 3, 2023 •

edited

Loading

trivialfis commented Nov 3, 2023 •

edited

Loading

trivialfis commented Nov 8, 2023 •

edited

Loading