Masked tensor support as CompositeTensors? #18413

jackd · 2023-08-29T02:55:46Z

I'm not sure this is the correct place for this, and I'm even less confident that it's my place to say any of it, but having spent a good deal of time playing with masked tensors I figured I'd share my thoughts/frustrations in the hope that some aspects of the design could be reconsidered for the next major version bump. Apologies in advance for the long read.

TL;DR

Metrics respecting masks passed by the model is abusable and potentially highly misleading
The developer experience is significantly more difficult, unintuitive and error prone than e.g. tf composite tensors
I would be willing to spend large amounts of time doing the dog-work in a re-designed implementation based on a more general composite tensor framework

Masking in metrics

abusable - both intentionally and accidentally

Let's say I find a training script from a keras-based SOTA image classification model. I can almost guarantee you I can take that script and, by a minor tweak to the model architecture and without any retraining or manual weither adjustment, get 100% top-1 accuracy. How? By adding a masking layer to the final predictions such that all but the most confident prediction are masked, the metrics will simply ignore all the rest.

This seems absurd to me. Now maybe you're thinking, "That's a bit far-fetched, we shouldn't have to program such that our code can't intentionally be abused this way," and to some extent I'd agree. But mask propagation is incredibly easy to implement incorrectly, and fails silently when such a mistake is made. These silent failurs can be highly misleading.

Example: I recently wanted to see how well a very basic res-net based architecture would do on common NLP tasks. I took one of the official keras/keras-nlp examples and replaced the transformer model with a basic resnet architecture, and low-and-behold accuracy improved significantly. Does this mean resnets are the future of NLP?

No. At least, that's not something I can say from my experiments. Why not? Well, my resnet had residual connections that I made using x + y instead of Add()((x, y)). This (silently) destroyed the mask, so the loss and metrics included contributions from padding. Unsurprisingly, it's very easy for the model to learn that only PAD tokens appear after a PAD token, and when your dataset has a non-trivial amount of padding, this significantly improves metrics.

Not used correctly in any keras NLP example (from what I understand)

I accept that masking can be relevant in more than just NLP cases, but that seems to be the most commonly used case in examples. AFAIK there's not a single NLP example that does this correctly however. To illustrate, consider a sequence generation example, where the labels is a padded version of "hello world", padded to token length 6. We'll assume a causal decoder that takes inputs and evaluates based on a shifted version. The inputs should be masked to exclude padding tokens, as should the targets - but I'm yet to see an example which applies this shift to the masks

decoder inputs     : ["(START)", "hello", "world", "(END)", "(PAD)", "(PAD)"]
input mask         : [     True,    True,    True,    True,   False,   False]
targets            : [  "hello", "world", "(END)", "(PAD)", "(PAD)", "(PAD)"]
correct target mask: [     True,    True,    True,   False,   False,   False]

Note the mismatch between the input and target mask at the 4th token. As a result, every example I've seen gives credit to a model for learning that every "(END)" token is followed by a "(PAD)" token. In my opinion, this is not a meaningful thing to learn, and serves only to artificially inflate metric scores, particularly for short sequences.

To further illustrate the resnet example I described above, the actual mask my metrics used after accidentaly using + instead of Add() was:

fallback mask used when accidentally destroyed in model construction
                     : [     True,    True,    True,   True,   True,   True]

In this case, a model can achieve 50% accuracy by learning nothing more than PAD always follows END / PAD tokens (and this can be improved my increasing the maximum sequence length/amount of padding).

Metrics issues: resolution

Both of these issues can be resolved using sample_weights instead of masking. I'm unaware of any case where masking can do anything that an appropriate sample_weights couldn't, and implementation during dataset construction is straight forward:

def add_weights(inputs, targets):
    sample_weights = ops.cast(targets != 0, "float32")
    return inputs, targets, sample_weights

weighted_dataset = dataset.map(add_weights)

On this point, my question is: if sample_weights can do everything masking, isn't abusable and is (in my opinion) less confusing, can we completely remove consideration of masks from metrics?

Masking in Models

Despite the above, I accept there's a valid place fo masking in model construction, be in NLP, GNNs, point cloud networks or other use cases. In tensorflow there are RaggedTensors that can potentially be used for these representations, but masked tensors are often better because they allow for constant shape which can have a significant effect on performance especially with XLA compilation.

That said, having played around with supporting masks in my model, I find the developer experience unintuitive, error prone and difficult to debug. Supporting them in a layer involves setting a public attribute, changing function signatures for call, build and compute_output_[spec|shape] and potentially adding a new compute_output_mask. If you make a mistake in any of these, there is generally no error and the fallback behaviour is almost never what's intended. At best you get a warning (not always), but that warning doesn't include a stack trace so all you know is you've made a mistake somewhere. Outside of layer implementations there's not even a way to check whether a tensor is masked or not using the public API, and even unit tests depend on private API use to test some functionality.

The fact that there are numerous bugs related to masking currently open makes me believe I'm not the only one struggling.

This is completely different to my experience with composite tensors in tensorflow. If I want a layer to support a RaggedTensor, I simply make sure my call method supports RaggedTensor inputs. Same for SparseTensors. If I try and use a RaggedTensor input in an unsupported way, it throws an error and I know exactly what I need to fix. If I want the output to have a different structure, I return a different structure. If I want to check if a tensor is a regular tensor or a ragged/sparse tensor when building models using the functional API I just look at it's type. There is no analagous supports_masking attribute to set or compute_output_mask method to implement, and the number and names of arguments in other methods remain unchanged.

Now I appreciate that keras_core does not currently support composite tensors, but I'm very much hoping that's on the cards, at least in a limited capacity with a few concrete classes (e.g. SparseTensor, RaggedTensor) or a more general framework for users to define their own composite structures like tensorflow and jax have.

My questions on this point are:

is composite tensor support intended for keras_core 1.0?
if so, are there any arguments to have an entirely different infrastructure for dealing with masked tensors?
if not, is there any way I can assist in doing away in moving towards a future without the current masked tensor infrastructure?

Relevant issues:

Ragged tensor support Ragged Tensors #18414
Sparse tensor support Sparse Tensors #18420
masked tensor bugs
- Add Operators and some Merge (e.g. Add, Maximum) layers do not propagate masks #18416
- functional API Models do not propagate symbolic masks when called with symbolic inputs. #18417

The text was updated successfully, but these errors were encountered:

jackd · 2023-09-17T22:53:44Z

@sachinprasadhs any outcome on the review?

fchollet · 2024-04-11T04:16:09Z

Thanks for the suggestion. This is an idea that has come up quite a few times. The main issue we'd face in the multi-backend world, however, is that composite tensors are a concept exclusive to TensorFlow. So something as fundamental as masking can't rely on them.

google-ml-butler · 2024-04-11T04:16:10Z

Are you satisfied with the resolution of your issue?
Yes
No

sachinprasadhs added type:feature The user is asking for a new feature. keras-team-review-pending Pending review by a Keras team member. labels Aug 29, 2023

sachinprasadhs self-assigned this Aug 29, 2023

sachinprasadhs removed the keras-team-review-pending Pending review by a Keras team member. label Sep 5, 2023

fchollet transferred this issue from keras-team/keras-core Sep 22, 2023

google-ml-butler bot added the type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. label Sep 22, 2023

sachinprasadhs added the keras-team-review-pending Pending review by a Keras team member. label Apr 9, 2024

fchollet closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masked tensor support as CompositeTensors? #18413

Masked tensor support as CompositeTensors? #18413

jackd commented Aug 29, 2023

jackd commented Sep 17, 2023

fchollet commented Apr 11, 2024

google-ml-butler bot commented Apr 11, 2024

Masked tensor support as CompositeTensors? #18413

Masked tensor support as CompositeTensors? #18413

Comments

jackd commented Aug 29, 2023

TL;DR

Masking in metrics

abusable - both intentionally and accidentally

Not used correctly in any keras NLP example (from what I understand)

Metrics issues: resolution

Masking in Models

jackd commented Sep 17, 2023

fchollet commented Apr 11, 2024

google-ml-butler bot commented Apr 11, 2024