-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lift restrictions on feature names ("LightGBMError: Do not support special JSON characters in feature name") #6202
Comments
Thanks for using LightGBM. Please provide a reproducible example showing exactly how you hit this error and describing what you expected to happen. Your submission here suggesting that non-ASCII characters or feature names with For example, consider the following code: import lightgbm as lgb
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1_000, n_features=2)
feature_names = [
"name_with_underscores",
# what Google translate provides in Chinese for "feature"
"特征"
]
dtrain = lgb.Dataset(X, label=y, feature_name=feature_names)
model = lgb.train(
train_set=dtrain,
params={"objective": "regression"},
num_boost_round=5
)
model.feature_name()
# ['name_with_underscores', '特征'] Using If you're unfamiliar with how to create reproducible examples when asking for software help, this guide is useful: https://stackoverflow.com/help/minimal-reproducible-example. You could try modifying the example I've given here, and providing the following (which were asked for in the issue template when you clicked
|
Sorry, I was not sure what was the trigger in my case, turns out, it was a comma. I have updated my main post with a reproducible example. Stack overflow is full of suggestions to remove anything non-ascii though. |
Ah ok, I see you've edited this since it was initially posted to include some more examples.
Yes, characters like that are not allowed in feature names. You can search the repo for that error message and find the corresponding code here: LightGBM/include/LightGBM/dataset.h Lines 889 to 892 in 18dbd65
which calls this: LightGBM/include/LightGBM/utils/common.h Lines 886 to 902 in 18dbd65
You can see that that it is specifically a very small subset of characters that are forbidden in feature names.
LightGBM supports reading training data from TSV (tab-separated), CSV (comma-separated), and LibSVM formats. It also writes out model data (including feature names) to JSON and to a LightGBM-specific text format. Characters that are used in encoding/decoding such data, like To prevent having to worry about such problems in LightGBM, the library prohibits those characters. We feel that's a small inconvenience in exchange for the reduction in maintenance burden and other sources of user pain (like anything parsing LightGBM model files needing to also account for such escaping). When you say "lift restrictions", which of these behaviors would you prefer LightGBM took on?
I'd welcome a PR to improve this error message ("special JSON characters" is not very informative), but before we commit to any other change I'd like to hear your thoughts on how you'd prefer LightGBM handle this situation. |
I'm sorry that you found that answer that implied that non-ASCII feature names were an issue. Non-ASCII feature names have been supported in LightGBM since April 2020. For example, here's a post from another LightGBM maintainer back in 2021 about a similar question: #2478 (comment) |
Thank you so much for such a fast and informative answer! It's actually not a "small inconvenience" at all when you try to add LightGBM to existing models (that all work with established feature names without any questions), and it breaks :-) IMHO the best way to deal with such characters would be to escape them inside of LightGBM (transparently to a user), I guess other libraries do that, since no one else restricts feature names. Ideal scenario would be compatibility with other ML libs, i.e., no restrictions and renaming. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Summary
Currently, it can be hard to plug LightGBM into the existing ML system because of its selectivity to feature naming.
Underscores, commas, dots, square brackets or even non-English language symbols trigger "LightGBMError: Do not support special JSON characters in feature name."
Often it's hard to even understand about what name exactly LightGBM is complaining, you have to scroll through thousands of features to figure out which one is named "wrongly".
I know for sure that the naming of the features should have no influence on the model training process. It would be great if this limitation could be lifted.
Motivation
This limitation is very cumbersome. I am not aware of any other machine learning library that imposes such restrictions.
Often features come in groups, and it's convenient to use underscores and dots/brackets for separation, for example "[bioteam].[physio].prevweak_velocity_mean". Without the ability to group, in practice, feature names quickly become lengthy and totally unreadable.
Similarly, commas are often used as units: "distance, km".
Or the dataset comes in some national language, be it Chinese, French, or Russian, and stakeholders would love to see features in their native language. We have UTF, let's use it and work on allowing arbitrary feature names. Let's not limit the creativity of data scientists! )
Description
I don't know the technical reasons for this, but I can't find any logical reason to have this limitation.
Environment:
Python==3.8
lightgbm==4.1.0
OS==Windows
Locale=Russian
--
The text was updated successfully, but these errors were encountered: