-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prediction is extremly slow for categorical decision stumps #109
Comments
Hi, apologies for the late reply. I will try to reproduce this - it very much sounds like a bug somewhere. If you can provide any more details on the hyperparameters you're using, I'd be happy to know them |
hyperparameters: |
One tree looks like this:
|
Hi, I cannot reproduce this yet in Colab, so let's try to dig deeper here. I'm training and predicting 10K rows and 5000 features, no validation data / no early stopping with 100 trees.
Training the model takes ~half a minute (note the reported time is wrong, because it excludes dataset loading)
and the forest consists of 100 stumps. Prediction time is
The tree seems to look like yours (e.g.
A significant amount of time is spent getting the datasets in YDF format. Note that We can strip out this conversion by creating a Vertical Dataset (YDFs internal format) separately:
Prediction on the VerticalDataset takes less time as expected:
Do you have an idea where my experiment diverges from your setup? |
In the example you give, we can see that the features are treated as categorical features instead of numerical features. For instances, In your case, if the column was treated numerically, each numerical split can have 255 possible conditions (e.g. value >= threshold). Make sure to use ydf.Feature(feat, ydf.Semantic.NUMERICAL) instead of ydf.Feature(feat, ydf.Semantic.CATEGORICAL). Bonus: When training a model with |
Unfortunatly I do need the featues to be categorical. I can understand that the training takes a long time and I have no problem with that. The problem (probably a bug) is that model.predict also takes a very long time when it should be quite trivial to evaluate this fast! (It is the same kind of trees that the cv2.cascade uses and they are evaluated in a fraction of a microsecond) |
@rstz Thank you so much for your efforts! I think your results do match up with mine. But I would have expected the prediction to be much much faster. After all, it is the same model opencv uses as their cascade classifier in their object detection pyramid. One 200x200 image is evaluated in a few milliseconds which is about the same order of magnitude as 10k samples. This is about 10000 times after than 20s! So that is why I considered this time "very long". I didnt know about ydf.create_vertical_dataset so I will use it in the future, since I don't really process the dataframe during or after training. But even then it is quite slow |
I have categorial features with 256 possible values each. I train a small forest with ~10 non oblique decision stumps. Each trees thus has only one decision node based on exactly one feature.
The model trains somewhat slowly, but what suprises me is how slow the prediction method runs. I don't know the exact implementation but it seems that each tree node is represented as a list which is extremply slow. Just a few samples take multiple seconds to predict which is much slower than I would expect even with an naive implementation. If I get the feature values of the samples by hand from the dataframe and then check whether they are contained in the tree nodes positive set, it is faster than the predict method
The text was updated successfully, but these errors were encountered: