TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

nicdumz · 2023-12-20T11:35:19Z

Documentation for output_mode currently reads:

"multi_hot": Outputs a single int array per batch, of either vocab_size or max_tokens size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item.
"count": Like "multi_hot", but the int array contains a count of the number of times the token at that index appeared in the batch item.

But this isn't actually the case. A little test to show this:

 v = keras.layers.TextVectorization(output_mode="count")
 v.adapt(["foo", "bar", "baz"])
 self.assertEqual(v(["foo lol"]).dtype, tf.int64)  # AssertionError: tf.float32 != tf.int64

Source in fact currently outputs ints for output_mode="int", but floats for everything else. This seems to have been introduced as part of ef72bfb

The text was updated successfully, but these errors were encountered:

* Fix custom functional reload issue * Fix issue with TextVectorization as first Sequential input * Fix text vectorization output spec

nicdumz · 2023-12-20T12:07:16Z

(IndexLookup has exactly the same code, fwiw)

divyashreepathihalli · 2023-12-21T04:30:46Z

@nicdumz I tried with all backends and it seems to retuning int

can you please double check?

nicdumz · 2023-12-21T20:45:40Z

@divyashreepathihalli :

With a test program containing:

import tensorflow as tf, tensorflow.version as tv

print(f"{tv.VERSION}, {tv.COMPILER_VERSION}, {tv.GIT_VERSION}")

v = tf.keras.layers.TextVectorization(output_mode="count")
v.adapt(["foo", "bar", "baz"])
print(v(["bar baz"]).dtype)

Output is:

2.15.0, Ubuntu Clang 17.0.2 (++20231003073124+b2417f51dbbd-1~exp1~20231003073217.50), v2.15.0-2-g0b15fdfcb3f
<dtype: 'float32'>

I would have expected an int64 output.

divyashreepathihalli · 2023-12-21T20:56:27Z

oh I see, this is a tf keras issue. The change commit you linked was in the Keras 3 repo.
I have moved to issue to tf_keras here - keras-team/tf-keras#711

nicdumz · 2023-12-21T20:58:04Z

Thank you, sorry I was not aware of the difference; and thanks for the redirect.

github-actions bot assigned sachinprasadhs Dec 20, 2023

nicdumz referenced this issue Dec 20, 2023

Fix TextVectorization + Sequential bug (#696)

ef72bfb

* Fix custom functional reload issue * Fix issue with TextVectorization as first Sequential input * Fix text vectorization output spec

sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Dec 20, 2023

divyashreepathihalli closed this as completed Dec 21, 2023

divyashreepathihalli removed the keras-team-review-pending Pending review by a Keras team member. label Dec 21, 2023

divyashreepathihalli mentioned this issue Dec 21, 2023

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats keras-team/tf-keras#711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

nicdumz commented Dec 20, 2023

nicdumz commented Dec 20, 2023

divyashreepathihalli commented Dec 21, 2023

nicdumz commented Dec 21, 2023 •

edited

Loading

divyashreepathihalli commented Dec 21, 2023

nicdumz commented Dec 21, 2023

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

Comments

nicdumz commented Dec 20, 2023

nicdumz commented Dec 20, 2023

divyashreepathihalli commented Dec 21, 2023

nicdumz commented Dec 21, 2023 • edited Loading

divyashreepathihalli commented Dec 21, 2023

nicdumz commented Dec 21, 2023

nicdumz commented Dec 21, 2023 •

edited

Loading