-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INVALID_ARGUMENT: Too much categorical conditions
- how many is too many?
#118
Comments
This doesn't look right, I don't see how this would be triggered for this data spec. Can you please provide a bit more information about your dataset and hyperparameters? |
This is just an ordinary Kaggle tabular dataset. Perhaps it's going OOM on predict?
RAM is nearly full but Swap still has > 60 GB |
With max_depth lowered to 25, I was able to get a prediction. Still curious about why 100 features in the tree is negatively impactful? |
I believe that the issue is that the trees are very, very deep (esp. for GBTs). For inference, ydf transforms the model to use a buffer that contains all the categorical splits (i.e. splits of type "featureA in [featureAvalue1, featureAvalue4, ...]). This buffer can have at most std::numeric_limits<uint32_t>::max() entries. Each split occupies 100 entries (1 per feature), so you're limited to 43 million categorical splits in this case. This sound like a lot, but if your trees have max_depth 100, you will have a lot of splits. In C++, YDF has support for an inference engine that does not have this limitation (probably, haven't tried it). However, this engine is much slower than what we expose PYDF. It sounds like exposing the slow engine might be a useful solution for some less common models such as the one you built - I'll try to prioritize this. |
@rstz Based on your comment, I am curious how many people would also be looking to do trees so deep. I have a different use case which has a vocabulary of up to 1m tokenized / feature set. I suppose BQML frontend allows only for 50k. Even at 50k features, what sort of depth and width would you expect to be required? Data shape would be like 20-30m rows, 50k columns, and a binary outcome. It might be suitable to just give a warning when On the note of feature requests, I am curious if YDF can support multilabel outcomes (not mutually exclusive outcomes, shared tree space)? |
As always in ML the answer will depend on your data, but I'll give some more-or-less educated guesses. The theory of boosting suggests using mostly small trees (see e.g. Intro to Statistical Learning Chapter 8.2) to avoid overfitting with individual trees. In practice, we've seen YDF's default of 6 or values in the range 2-10 perform well. Small trees have the added advantage that the model is much smaller and inference can be much faster. I'd be interested how model quality changes with max_depth in your use case. As an aside, note that we have seen GBTs perform better when ignoring the max_depth parameter altogether. Instead set Having 50k or more features often happens when using one-hot encoding on categorical features. One-hot encoding is not recommended when using decision forests. Instead, categorical features should be fed directly. This allows the tree to perform splits on multiple categories at once (e.g. Re: Multi-label outcomes - can you please open a separate issue for this? I think Yggdrasil might have a solution for this, but it's probably not yet exposed in the Python API. [1] Categorical Sets are unfortunately broken in the Python API until our pending fix to #113 has landed. The fix and a tutorial will be included in the next release. |
Yep, this sounds about right. There is definitely something to be said about stats, but I think this library is one of the first to elegantly enable such wide (particularly highly sparse) data sets. In that sense I am very curious going forward about how tens to hundreds of thousands of features affect desirable tree depths. Ohhh sure growing strategy definitely makes the MOST sense. Yes, I have kept features relatively sparse by avoiding OHE. Unfortunate that Python API categorical sets aren't working -- this is an incredibly important feature of ydf! |
Categorical Sets are now working in 0.7.0 - we plan to publish guides in the near future. |
When I try to evaluate my model or make predictions on the val set, I get the following error:
Here is the DataSpec from .describe()
There are probably a lot of categories, but I would have thought that with categorical sets it would be fine?
The text was updated successfully, but these errors were encountered: