Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large categorical sets dump too much output to stderr, can cause crash #113

Closed
CodingDoug opened this issue Jul 2, 2024 · 6 comments
Closed

Comments

@CodingDoug
Copy link

I'm using RandomForestLearner to train a 10-class categorization model using roughly 15000 examples and 12 features. One of the features is a categorical set that has about 3500 unique strings. The strings are each greater than two characters, and do not parse as numbers.

When training the model on the dataset, I get a tremendous amount of output to stderr which is directly related to the one categorical set feature (removing the feature stops this output). It goes like this, repeatedly:

[WARNING 24-07-02 10:01:56.5843 EDT training.cc:4832] The effective split of examples does not match the expected split returned
 by the splitter algorithm. This problem can be caused by (1) large floating point values (e.g. value>=10e30) or (2) a bug in th
e software. You can turn this error in a warning with internal_error_on_wrong_splitter_statistics=false.

Details:
Num examples: 208
Effective num positive examples: 17
Expected num positive example: 20
Effective num negative examples: 191
Condition: na_value: false
attribute: 11
condition {
  contains_condition {
    elements: 22
  }
}
num_training_examples_without_weight: 208
num_training_examples_with_weight: 390.9984073638916
split_score: 0.0180455502
num_pos_training_examples_without_weight: 20
num_pos_training_examples_with_weight: 16.226399675011635

Attribute spec: type: CATEGORICAL_SET
name: "self_class_words"
categorical {
  number_of_unique_values: 1190
  items {
    key: "token"
    value {
      index: 235
      count: 48
    }
  }

  // items repeats many times here, each with a different key value.

}
count_nas: 0
dtype: DTYPE_BYTES

In some cases, the training might eventually fail with the following error:

Traceback (most recent call last):
  File ".../test.py", line 257, in <module>
    train()
  File ".../test.py", line 210, in train
    model = learner.train(
            ^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/specialized_learners.py", line 2690, in train
    return super().train(ds=ds, valid=valid, verbose=verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/generic_learner.py", line 202, in train
    model = self._train_from_dataset(ds, valid)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/generic_learner.py", line 258, in _train_from_dataset
    cc_model = learner.Train(**train_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: INTERNAL: No examples fed to the node trainer

From experimentation, I have found that:

  • There is a relationship between the size of the categorical set and the number of trees configured for the learner. Increasing one or the other eventually crosses some threshold for deterministically causing the failure.
  • With my full dataset, I can get about 50 trees. I definitely cannot get the default 300 trees without also drastically reducing the number of samples (which reduces the size of the categorical set).
  • If the training succeeds without crashing, I can see that the set is effectively being used in multiple trees by examining the output of print(model.describe()). The resulting model generally works well.

The sheer volume of the above output, for my dataset, for num_trees=50 is over 200M! I have to redirect the stdout/stderr of the python process to a file in order to keep things reasonable. I found that I can eliminate that output by setting the verbosity to 0: model = learner.train(df, verbose=0). But I'd rather not do that, as it mutes other useful output.

Seems to me there is a bug somewhere with respect to the handling of categorical sets.

@rstz
Copy link
Collaborator

rstz commented Jul 2, 2024

Hi, thank you for the detailed report. I agree that this looks weird. If you are able to share the dataset with us, this might help debugging this a bit, but feel free to let us know that this is not possible and we can try to repro this based on the description. Independently, sounds like we should consider muting this message if it appears too often - we don't have a mechanism in C++ logging yet to do so, but maybe we can add it.

@CodingDoug
Copy link
Author

@rstz This dataset is culled from an entirely customized set of processes, stored in a sqlite database (50M) on my machine and trimmed down to core features by other custom code before handing to YDF. As such, it's not so easy to share (and also contains "secret sauce" for a product I'm building). The strings in the categorical set could probably be simulated by generating random strings. I can try to build a more isolated repro that doesn't require all my data and code.

@CodingDoug
Copy link
Author

CodingDoug commented Jul 2, 2024

@rstz Here's a simple repro. When NUM_ROWS=100, the warning messages happen only very rarely. Set it to 200 to get consistent warning. Set to 20000 to get a crash. Reduce num_trees to see the warnings/crashes happen less frequently.

import ydf
import pandas as pd
import random
import string

NUM_ROWS = 20000
SET_SIZE = 5

CATEGORIES = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
rows = []

for i in range(0, NUM_ROWS):
    set = [''.join(random.choices(string.ascii_lowercase, k=2)) for i in range(0, SET_SIZE)]
    rows.append(dict(
        category = CATEGORIES[i % len(CATEGORIES)],
        set = set,
    ))

df = pd.DataFrame.from_records(rows)

learner = ydf.RandomForestLearner(
    task=ydf.Task.CLASSIFICATION,
    # num_trees=50,
    label="category",
    features=[
        "set",
    ],
)

model = learner.train(
    df,
    # verbose=0,
)

@rstz
Copy link
Collaborator

rstz commented Jul 2, 2024

Great, thank you, I'll have a look.

Minor update: I'm starting to think that this is, in fact, a bug in the way the Python API handles categorical sets....

@rstz
Copy link
Collaborator

rstz commented Jul 19, 2024

Just confirming that this is a bug and will be fixed for the next release (commit coming soon)

copybara-service bot pushed a commit that referenced this issue Aug 6, 2024
See #113.

PiperOrigin-RevId: 659879007
copybara-service bot pushed a commit that referenced this issue Aug 9, 2024
@rstz
Copy link
Collaborator

rstz commented Aug 20, 2024

Closing this - both the repeated logs and the underlying issue have been addressed at head and will be included in the next release

@rstz rstz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants