-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large categorical sets dump too much output to stderr, can cause crash #113
Comments
Hi, thank you for the detailed report. I agree that this looks weird. If you are able to share the dataset with us, this might help debugging this a bit, but feel free to let us know that this is not possible and we can try to repro this based on the description. Independently, sounds like we should consider muting this message if it appears too often - we don't have a mechanism in C++ logging yet to do so, but maybe we can add it. |
@rstz This dataset is culled from an entirely customized set of processes, stored in a sqlite database (50M) on my machine and trimmed down to core features by other custom code before handing to YDF. As such, it's not so easy to share (and also contains "secret sauce" for a product I'm building). The strings in the categorical set could probably be simulated by generating random strings. I can try to build a more isolated repro that doesn't require all my data and code. |
@rstz Here's a simple repro. When NUM_ROWS=100, the warning messages happen only very rarely. Set it to 200 to get consistent warning. Set to 20000 to get a crash. Reduce num_trees to see the warnings/crashes happen less frequently.
|
Great, thank you, I'll have a look. Minor update: I'm starting to think that this is, in fact, a bug in the way the Python API handles categorical sets.... |
Just confirming that this is a bug and will be fixed for the next release (commit coming soon) |
See #113. PiperOrigin-RevId: 659879007
See also #113 PiperOrigin-RevId: 661228947
Closing this - both the repeated logs and the underlying issue have been addressed at head and will be included in the next release |
I'm using RandomForestLearner to train a 10-class categorization model using roughly 15000 examples and 12 features. One of the features is a categorical set that has about 3500 unique strings. The strings are each greater than two characters, and do not parse as numbers.
When training the model on the dataset, I get a tremendous amount of output to stderr which is directly related to the one categorical set feature (removing the feature stops this output). It goes like this, repeatedly:
In some cases, the training might eventually fail with the following error:
From experimentation, I have found that:
print(model.describe())
. The resulting model generally works well.The sheer volume of the above output, for my dataset, for num_trees=50 is over 200M! I have to redirect the stdout/stderr of the python process to a file in order to keep things reasonable. I found that I can eliminate that output by setting the verbosity to 0:
model = learner.train(df, verbose=0)
. But I'd rather not do that, as it mutes other useful output.Seems to me there is a bug somewhere with respect to the handling of categorical sets.
The text was updated successfully, but these errors were encountered: