Large categorical sets dump too much output to stderr, can cause crash #113

CodingDoug · 2024-07-02T14:28:43Z

I'm using RandomForestLearner to train a 10-class categorization model using roughly 15000 examples and 12 features. One of the features is a categorical set that has about 3500 unique strings. The strings are each greater than two characters, and do not parse as numbers.

When training the model on the dataset, I get a tremendous amount of output to stderr which is directly related to the one categorical set feature (removing the feature stops this output). It goes like this, repeatedly:

[WARNING 24-07-02 10:01:56.5843 EDT training.cc:4832] The effective split of examples does not match the expected split returned
 by the splitter algorithm. This problem can be caused by (1) large floating point values (e.g. value>=10e30) or (2) a bug in th
e software. You can turn this error in a warning with internal_error_on_wrong_splitter_statistics=false.

Details:
Num examples: 208
Effective num positive examples: 17
Expected num positive example: 20
Effective num negative examples: 191
Condition: na_value: false
attribute: 11
condition {
  contains_condition {
    elements: 22
  }
}
num_training_examples_without_weight: 208
num_training_examples_with_weight: 390.9984073638916
split_score: 0.0180455502
num_pos_training_examples_without_weight: 20
num_pos_training_examples_with_weight: 16.226399675011635

Attribute spec: type: CATEGORICAL_SET
name: "self_class_words"
categorical {
  number_of_unique_values: 1190
  items {
    key: "token"
    value {
      index: 235
      count: 48
    }
  }

  // items repeats many times here, each with a different key value.

}
count_nas: 0
dtype: DTYPE_BYTES

In some cases, the training might eventually fail with the following error:

Traceback (most recent call last):
  File ".../test.py", line 257, in <module>
    train()
  File ".../test.py", line 210, in train
    model = learner.train(
            ^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/specialized_learners.py", line 2690, in train
    return super().train(ds=ds, valid=valid, verbose=verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/generic_learner.py", line 202, in train
    model = self._train_from_dataset(ds, valid)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.venv/lib/python3.12/site-packages/ydf/learner/generic_learner.py", line 258, in _train_from_dataset
    cc_model = learner.Train(**train_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: INTERNAL: No examples fed to the node trainer

From experimentation, I have found that:

There is a relationship between the size of the categorical set and the number of trees configured for the learner. Increasing one or the other eventually crosses some threshold for deterministically causing the failure.
With my full dataset, I can get about 50 trees. I definitely cannot get the default 300 trees without also drastically reducing the number of samples (which reduces the size of the categorical set).
If the training succeeds without crashing, I can see that the set is effectively being used in multiple trees by examining the output of print(model.describe()). The resulting model generally works well.

The sheer volume of the above output, for my dataset, for num_trees=50 is over 200M! I have to redirect the stdout/stderr of the python process to a file in order to keep things reasonable. I found that I can eliminate that output by setting the verbosity to 0: model = learner.train(df, verbose=0). But I'd rather not do that, as it mutes other useful output.

Seems to me there is a bug somewhere with respect to the handling of categorical sets.

The text was updated successfully, but these errors were encountered:

rstz · 2024-07-02T14:38:30Z

Hi, thank you for the detailed report. I agree that this looks weird. If you are able to share the dataset with us, this might help debugging this a bit, but feel free to let us know that this is not possible and we can try to repro this based on the description. Independently, sounds like we should consider muting this message if it appears too often - we don't have a mechanism in C++ logging yet to do so, but maybe we can add it.

CodingDoug · 2024-07-02T14:58:46Z

@rstz This dataset is culled from an entirely customized set of processes, stored in a sqlite database (50M) on my machine and trimmed down to core features by other custom code before handing to YDF. As such, it's not so easy to share (and also contains "secret sauce" for a product I'm building). The strings in the categorical set could probably be simulated by generating random strings. I can try to build a more isolated repro that doesn't require all my data and code.

CodingDoug · 2024-07-02T17:40:51Z

@rstz Here's a simple repro. When NUM_ROWS=100, the warning messages happen only very rarely. Set it to 200 to get consistent warning. Set to 20000 to get a crash. Reduce num_trees to see the warnings/crashes happen less frequently.

import ydf
import pandas as pd
import random
import string

NUM_ROWS = 20000
SET_SIZE = 5

CATEGORIES = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
rows = []

for i in range(0, NUM_ROWS):
    set = [''.join(random.choices(string.ascii_lowercase, k=2)) for i in range(0, SET_SIZE)]
    rows.append(dict(
        category = CATEGORIES[i % len(CATEGORIES)],
        set = set,
    ))

df = pd.DataFrame.from_records(rows)

learner = ydf.RandomForestLearner(
    task=ydf.Task.CLASSIFICATION,
    # num_trees=50,
    label="category",
    features=[
        "set",
    ],
)

model = learner.train(
    df,
    # verbose=0,
)

rstz · 2024-07-02T19:02:10Z

Great, thank you, I'll have a look.

Minor update: I'm starting to think that this is, in fact, a bug in the way the Python API handles categorical sets....

rstz · 2024-07-19T14:09:23Z

Just confirming that this is a bug and will be fixed for the next release (commit coming soon)

See #113. PiperOrigin-RevId: 659879007

See also #113 PiperOrigin-RevId: 661228947

rstz · 2024-08-20T13:23:21Z

Closing this - both the repeated logs and the underlying issue have been addressed at head and will be included in the next release

rstz mentioned this issue Jul 24, 2024

INVALID_ARGUMENT: Too much categorical conditions - how many is too many? #118

Open

copybara-service bot pushed a commit that referenced this issue Aug 6, 2024

Handle unordered categorical sets in training.

ad1b825

See #113. PiperOrigin-RevId: 659879007

copybara-service bot pushed a commit that referenced this issue Aug 9, 2024

[YDF] Limit prints of "The effective split of examples does not.."

c4f3cb6

See also #113 PiperOrigin-RevId: 661228947

rstz closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large categorical sets dump too much output to stderr, can cause crash #113

Large categorical sets dump too much output to stderr, can cause crash #113

CodingDoug commented Jul 2, 2024

rstz commented Jul 2, 2024

CodingDoug commented Jul 2, 2024

CodingDoug commented Jul 2, 2024 •

edited

Loading

rstz commented Jul 2, 2024 •

edited

Loading

rstz commented Jul 19, 2024

rstz commented Aug 20, 2024

Large categorical sets dump too much output to stderr, can cause crash #113

Large categorical sets dump too much output to stderr, can cause crash #113

Comments

CodingDoug commented Jul 2, 2024

rstz commented Jul 2, 2024

CodingDoug commented Jul 2, 2024

CodingDoug commented Jul 2, 2024 • edited Loading

rstz commented Jul 2, 2024 • edited Loading

rstz commented Jul 19, 2024

rstz commented Aug 20, 2024

CodingDoug commented Jul 2, 2024 •

edited

Loading

rstz commented Jul 2, 2024 •

edited

Loading