Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for categorical feature auto-encoding #8

Closed

Conversation

nunosilva800
Copy link
Contributor

@nunosilva800 nunosilva800 commented Oct 18, 2024

Adds support to have categorical features label encoded automatically, when using Booster#predict.

The python api does it by relying on pandas dataframes: https://github.com/microsoft/LightGBM/blob/e057ae08e6bf6c6c84f276a127423fb145ca5fdb/python-package/lightgbm/basic.py#L1131-L1137

Would be great for this ruby lib to provide similar quality-of-life.

@nunosilva800 nunosilva800 force-pushed the categorical-feature-encoder branch 2 times, most recently from fa23e11 to a78769f Compare October 18, 2024 14:47
@nunosilva800 nunosilva800 marked this pull request as ready for review October 18, 2024 14:48
@nunosilva800 nunosilva800 force-pushed the categorical-feature-encoder branch from a78769f to a6216cb Compare October 18, 2024 16:12
@ankane
Copy link
Owner

ankane commented Nov 12, 2024

Hi @nunosilva800, thanks for the another PR. Happy to include support for this, but it'd be good if the code matched Python as closely as possible (rather than a separate class).

Also, please share the code to generate the model file rather than checking it in directly.

@nunosilva800 nunosilva800 marked this pull request as draft December 11, 2024 15:01
@nunosilva800 nunosilva800 force-pushed the categorical-feature-encoder branch from a6216cb to bf2b9fe Compare December 11, 2024 15:08
@nunosilva800 nunosilva800 force-pushed the categorical-feature-encoder branch from bf2b9fe to fcced8b Compare December 11, 2024 17:02
@nunosilva800 nunosilva800 force-pushed the categorical-feature-encoder branch from fcced8b to 6f5045b Compare December 11, 2024 17:03
@nunosilva800 nunosilva800 marked this pull request as ready for review December 11, 2024 17:36
@nunosilva800
Copy link
Contributor Author

@ankane I've updated files in test/support to match the lightgbm v4 python API and added the categorical model file.

good if the code matched Python as closely as possible (rather than a separate class).

I'd like to challenge that. The python package overrides the input data when its type is a pandas dataframe. The whole logic is inside _data_from_pandas, but is stateless (does not mutate self) and therefore could be very well extracted somewhere else.

Here in ruby land, there is no pandas dataframe (apart from maybe mrkn/pandas.rb), so the approach here is to always transform the categorical features, if their names are defined as categorical in the model.
My thinking in extracting into a separate class is twofold:

  1. It is clearer to navigate the code and figure out what is happening if this transformation is causing predictions to be unexpected
  2. If in the future it turns out to be better to let the caller decide when to apply this transformation, they can do so by calling LightGBM::CategoricalFeatureEncoder directly, before invoking LightGBM::Booster#predict

What do you think?

ankane added a commit that referenced this pull request Dec 16, 2024
Co-authored-by: Nuno Silva <nunosilva800@gmail.com>
@ankane ankane closed this in 1506387 Dec 16, 2024
@ankane
Copy link
Owner

ankane commented Dec 16, 2024

Added a version of this in the commit above that matches the Python code a bit better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants