Auto encoding for categorical data during inference. #11088

trivialfis · 2024-12-11T06:03:42Z

We are working on automatic re-encoding for categorical features during inference. This teaches the booster to handle data encoded differently than the training dataset and eliminates the need for a scikit-learn pipeline for data encoding when using DataFrame inputs.

Removed the spark variants, its dataframe doesn't have encoding. Use the StringIndexer instead.

Strange behaviour with Xgboost when dealing with categorical data type, Possibly a bug #9676

Tracking PRs:

The text was updated successfully, but these errors were encountered:

trivialfis · 2024-12-11T12:55:10Z

@david-cortes May I ask how the tmp is kept alive during the construction of QDM?

xgboost/R-package/R/xgb.DMatrix.R

Line 572 in f4f3bd4

rm(tmp)

The setData from the proxy DMatrix only holds a reference (a C pointer) to the input data, if the data is released early then it's a use-after-free error. But testing the qdm.from_iterator seems fine.

david-cortes · 2024-12-11T17:15:37Z

@trivialfis In that function, the DMatrix is set in the line right before tmp gets deleted. Or are you saying that the data still needs to be kept after the DMatrix has been set? If so, at which point is it safe to deallocate the data?

Regarding the feature: since the idea is to have this feature in different interfaces, how would it work behind the scenes?

Would be ideal if the categorical encodings could get saved in the booster and be used in plots/trees-to-tables/jsons/etc. (#9927). Better yet if it's a standardized C-level attribute so that the encodings could survive transfers from one interface to another.

I see some potential difficulties though:

Pandas allows arbitrary python object for categorical encodings (not json-friendly).
Different data formats (pandas, arrow, polars, etc.) might have different limitations for what kind of values they allow as categories (e.g. strings-only vs. also integers).

trivialfis · 2024-12-11T17:25:58Z

Or are you saying that the data still needs to be kept after the DMatrix has been set

It needs to be kept until the next next call or reset call of the iterator. I got a bit confused since the data is deleted right after the setData call, but no access error. So, in theory, it should be kept after set, and until the next iteration.

Regarding the feature: since the idea is to have this feature in different interfaces, how would it work behind the scenes?

We will store the levels in the booster as you suggested. Things will be handled in C++, we might allow users to optionally disable the encoder for performance reasons (searching through levels is not cheap in the context of inference, especially with strings).

Better yet if it's a standardized C-level attribute so that the encodings could survive transfers from one interface to another.

Currently, I'm returning the categories in the arrow columnar format with the help of pyarrow. Haven't decided the exact return type yet, in my experimental code, it's just a Python map (dictionary) from feature names to arrow arrays.

Pandas allows arbitrary python object for categorical encodings

We accept only strings and some other primitive types like integers. Still working on the typing part. The pandas.Index (used for representing categories) should have the same type of feature column, so it can't be arbitrary and has to be something XGBoost can understand.

trivialfis mentioned this issue Dec 11, 2024

Strange behaviour with Xgboost when dealing with categorical data type, Possibly a bug #9676

Closed

david-cortes mentioned this issue Dec 11, 2024

[R] Ensure ProxyDMatrix creation keeps data until next iteration #11092

Merged

This was referenced Dec 15, 2024

[R] Move gc data protection to R side #11104

Open

Proposal for new R interface (discussion thread) #9734

Open

david-cortes mentioned this issue Dec 17, 2024

Plots for categorical splits don't show named categories #9927

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto encoding for categorical data during inference. #11088

Auto encoding for categorical data during inference. #11088

trivialfis commented Dec 11, 2024 •

edited

Loading

trivialfis commented Dec 11, 2024 •

edited

Loading

david-cortes commented Dec 11, 2024

trivialfis commented Dec 11, 2024 •

edited

Loading

Auto encoding for categorical data during inference. #11088

Auto encoding for categorical data during inference. #11088

Comments

trivialfis commented Dec 11, 2024 • edited Loading

trivialfis commented Dec 11, 2024 • edited Loading

david-cortes commented Dec 11, 2024

trivialfis commented Dec 11, 2024 • edited Loading

trivialfis commented Dec 11, 2024 •

edited

Loading

trivialfis commented Dec 11, 2024 •

edited

Loading

trivialfis commented Dec 11, 2024 •

edited

Loading