Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextVectorization: output_mode={multi_hot, count} promise int arrays but output floats #18973

Closed
nicdumz opened this issue Dec 20, 2023 · 5 comments
Assignees
Labels

Comments

@nicdumz
Copy link

nicdumz commented Dec 20, 2023

Documentation for output_mode currently reads:

"multi_hot": Outputs a single int array per batch, of either vocab_size or max_tokens size, containing 1s in all elements where the token mapped to that index exists at least once in the batch item.
"count": Like "multi_hot", but the int array contains a count of the number of times the token at that index appeared in the batch item.

But this isn't actually the case. A little test to show this:

 v = keras.layers.TextVectorization(output_mode="count")
 v.adapt(["foo", "bar", "baz"])
 self.assertEqual(v(["foo lol"]).dtype, tf.int64)  # AssertionError: tf.float32 != tf.int64

Source in fact currently outputs ints for output_mode="int", but floats for everything else. This seems to have been introduced as part of ef72bfb

nicdumz referenced this issue Dec 20, 2023
* Fix custom functional reload issue

* Fix issue with TextVectorization as first Sequential input

* Fix text vectorization output spec
@nicdumz
Copy link
Author

nicdumz commented Dec 20, 2023

(IndexLookup has exactly the same code, fwiw)

@sachinprasadhs sachinprasadhs added keras-team-review-pending Pending review by a Keras team member. type:Bug labels Dec 20, 2023
@divyashreepathihalli
Copy link
Collaborator

@nicdumz I tried with all backends and it seems to retuning int
image

can you please double check?

@divyashreepathihalli divyashreepathihalli removed the keras-team-review-pending Pending review by a Keras team member. label Dec 21, 2023
@nicdumz
Copy link
Author

nicdumz commented Dec 21, 2023

@divyashreepathihalli :

With a test program containing:

import tensorflow as tf, tensorflow.version as tv

print(f"{tv.VERSION}, {tv.COMPILER_VERSION}, {tv.GIT_VERSION}")

v = tf.keras.layers.TextVectorization(output_mode="count")
v.adapt(["foo", "bar", "baz"])
print(v(["bar baz"]).dtype)

Output is:

2.15.0, Ubuntu Clang 17.0.2 (++20231003073124+b2417f51dbbd-1~exp1~20231003073217.50), v2.15.0-2-g0b15fdfcb3f
<dtype: 'float32'>

I would have expected an int64 output.

@divyashreepathihalli
Copy link
Collaborator

oh I see, this is a tf keras issue. The change commit you linked was in the Keras 3 repo.
I have moved to issue to tf_keras here - keras-team/tf-keras#711

@nicdumz
Copy link
Author

nicdumz commented Dec 21, 2023

Thank you, sorry I was not aware of the difference; and thanks for the redirect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants