-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better normalization options for SoftmaxWithLoss layer #3296
Conversation
dcaeee0
to
f76fd37
Compare
Thanks @cdoersch -- this makes the normalization options clearer and looks correct to me. Ideally there would be a test for each option, but maybe that's not needed with the current set of tests passing, and the choice of normalization constant factored into a separate method that's called in both Forward and Backward. Maybe @longjon would like to take a look before merge since he added the |
It makes sense to address the divide-by-zero problem in this PR as well, since that's an issue with normalization. If there are 0 valid items (which can happen in many applications for some batches), then I think the loss should be 0 and the gradient also 0. Simple ways to address this: (a) make the minimum allowed denominator 1 (since it's always an integer) or (b) special-case the denominator code to not divide if 0. |
} else { | ||
top[0]->mutable_cpu_data()[0] = loss / outer_num_; | ||
} | ||
top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential divide-by-zero here (and other similar locations).
Yes, the old code had the same problem, but it might as well get fixed here, since a separate PR fixing the divide-by-zero would conflict with this one.
@seanbell this is a good point. I initially didn't worry about this because it indicates that you have a batch with zero examples; it seems to me like you would want to error out if this happeens. However, on second thought, I realize that in some cases, datasets may accidentally contain examples that are totally unlabeled, which means that once in a great while a randomly-selected batch will contain no labels at all, leading to heisenbug behavior. Do you think we should log a warning if we correct the denominator away from zero? |
I don't think a warning is always necessary. With multi-task setups, you can have auxiliary classification tasks that only have valid labels once in a while, so the log would get flooded with warnings in those cases. I'd be happy with a warning that was turned on by default, but easily disable-able. More debug info is better than less. |
But would you really want normalization turned on at all in that use case? i.e., if the labeled examples are so rare that many batches don't even contain one, do people really want an example to get half the weight if it just happens to occur in the same batch with another example where the label is defined? I guess I don't really have a strong opinion about this, so if you have a use case, then I'm fine with doing it your way. |
My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with, and this normalization should not be the default, but if it is used it shouldn't have a special case for 0 non-ignored labels, as the NaN/inf problem that occurs with 0 non-ignored labels is just the most extreme and illustrative case of the general problem with this normalization strategy. For example, say you're using a batch size of 128 -- then, relative to a batch with 128 valid labels, in a batch with 64 valid labels, the gradient of the each valid instance's error gradient is scaled up by a factor of 2; for a batch with 16 valid labels, each valid instance's error gradient is scaled up by a factor of 8; for a batch with 1 valid label, the instance's error gradient is scaled up by a factor of 128. Naturally, with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be. With 1 valid label things already pretty bad -- scaling that one instance's gradient up by a factor of 128 is likely to lead to quite unstable and high-variance learning if you chose your learning rate expecting your update to reflect the gradient averaged over 128 independent instances. That said, I think we should change the default to |
In my use case, I have a very large model, small batchsize (4), and use gradient accumulation to fit it on the GPU. With this setup, you only need a single iteration to have no valid labels to totally kill training. The moment you have NaN, everything is dead. So you might as well crash with a debug message in the normalizer code. I am saying that you may want to instead ignore the iteration and avoid the crash. I can see that it might not be the best for a default, I think it's an option worth considering. |
I'm not sure about a check as it's possible you might not actually be backpropagating the error and just want to compute the softmax loss for debugging/display purposes (e.g. you're using |
That's a valid use case too (though the I was just saying that I have a use case for avoiding divide-by-zero (noisy partially labeled datasets with small batchsizes, and certain multitask setups), and I thought others might as well. It could be a configurable option, with the default being to divide-by-zero and output NaN. But if nobody else runs into this problem, it could be left out for now. Edit: Another use case. I have a multitask setup where some batches have semantic segmentation labels, and others do not. For those that do have labels, I use "valid" normalization. For those that don't, the normalizing constant is 0, and I set |
Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch. Sorry if I'm side-tracking this PR. I'm arguing for something (avoiding divide-by-zero) which could be discussed in another PR. |
The ratio does change from minibatch to minibatch, but it's still an unbiased estimator of the objective if minibatches are sampled uniformly. But this statement made me go write down some math, and I realized I was wrong -- the Anyway though, I retract my earlier hardliner and probably mathematically unfounded position and would support the special case you suggested @seanbell (doesn't need to be in the PR, but also fine if it is, to me). Thanks for the discussion. |
Considering that we have one 'don't care', one 'weak support' and one 'support' from someone who actually has a use case, I think it makes sense to add this to the PR. I'll implement it as one-line change at the end of the |
9963079
to
d5a78e1
Compare
d5a78e1
to
8b2aa70
Compare
LGTM, thanks for the additional normalization options and clarifications of existing ones @cdoersch! |
Better normalization options for SoftmaxWithLoss layer
Thanks! This is very helpful. |
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail n.b. this changes the default normalization for this loss! previously this loss normalized by batch size, but now it normalizes by the total number of outputs/targets.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail note: the default normalization remains batch size, but valid normalization might be a better idea
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail n.b. this changes the default normalization for this loss! previously this loss normalized by batch size, but now it normalizes by the total number of outputs/targets.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.
sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.
The current SoftmaxWithLoss layer has two options for normalizing the output: either divide by the number of 'valid' samples (those without the ignore label), or by the batch size. Notably missing is any way to turn normalization off completely. This is needed in the case where the batches are not all the same size, and hence batches with more examples need to get more weight. One might expect that normalize = false would do this, but confusingly it still divides by the batch size.
This PR replaces the existing 'normalize' boolean parameter in caffe.proto with an enum that has four different options, with more informative names. Two of the options (VALID and BATCH_SIZE) mimic existing behavior, and the current boolean parameter is still supported for backwards compatibility (but is deprecated). The NONE option allows you to turn normalization off completely, and the FULL option allows you to normalize by the full shape of the output map, i.e., like VALID but locations with the 'ignore' label are still included in the count for normalization.
Note that there's still a bit of a mess here, since it seems that SoftmaxWithLoss is the only layer that actually reads LossParameter's normalization options. This remains unchanged for this PR.