Better normalization options for SoftmaxWithLoss layer #3296

cdoersch · 2015-11-06T22:53:25Z

The current SoftmaxWithLoss layer has two options for normalizing the output: either divide by the number of 'valid' samples (those without the ignore label), or by the batch size. Notably missing is any way to turn normalization off completely. This is needed in the case where the batches are not all the same size, and hence batches with more examples need to get more weight. One might expect that normalize = false would do this, but confusingly it still divides by the batch size.

This PR replaces the existing 'normalize' boolean parameter in caffe.proto with an enum that has four different options, with more informative names. Two of the options (VALID and BATCH_SIZE) mimic existing behavior, and the current boolean parameter is still supported for backwards compatibility (but is deprecated). The NONE option allows you to turn normalization off completely, and the FULL option allows you to normalize by the full shape of the output map, i.e., like VALID but locations with the 'ignore' label are still included in the count for normalization.

Note that there's still a bit of a mess here, since it seems that SoftmaxWithLoss is the only layer that actually reads LossParameter's normalization options. This remains unchanged for this PR.

jeffdonahue · 2015-11-10T21:12:33Z

Thanks @cdoersch -- this makes the normalization options clearer and looks correct to me. Ideally there would be a test for each option, but maybe that's not needed with the current set of tests passing, and the choice of normalization constant factored into a separate method that's called in both Forward and Backward. Maybe @longjon would like to take a look before merge since he added the normalize option in #1654 and this deprecates it?

seanbell · 2015-11-10T23:48:45Z

It makes sense to address the divide-by-zero problem in this PR as well, since that's an issue with normalization. If there are 0 valid items (which can happen in many applications for some batches), then I think the loss should be 0 and the gradient also 0.

Simple ways to address this: (a) make the minimum allowed denominator 1 (since it's always an integer) or (b) special-case the denominator code to not divide if 0.

seanbell · 2015-11-10T23:49:12Z

src/caffe/layers/softmax_loss_layer.cpp

-  } else {
-    top[0]->mutable_cpu_data()[0] = loss / outer_num_;
-  }
+  top[0]->mutable_cpu_data()[0] = loss / get_normalizer(normalization_, count);


Potential divide-by-zero here (and other similar locations).

Yes, the old code had the same problem, but it might as well get fixed here, since a separate PR fixing the divide-by-zero would conflict with this one.

cdoersch · 2015-11-10T23:56:20Z

@seanbell this is a good point. I initially didn't worry about this because it indicates that you have a batch with zero examples; it seems to me like you would want to error out if this happeens. However, on second thought, I realize that in some cases, datasets may accidentally contain examples that are totally unlabeled, which means that once in a great while a randomly-selected batch will contain no labels at all, leading to heisenbug behavior.

Do you think we should log a warning if we correct the denominator away from zero?

seanbell · 2015-11-11T00:44:19Z

I don't think a warning is always necessary. With multi-task setups, you can have auxiliary classification tasks that only have valid labels once in a while, so the log would get flooded with warnings in those cases. I'd be happy with a warning that was turned on by default, but easily disable-able. More debug info is better than less.

cdoersch · 2015-11-11T00:52:54Z

But would you really want normalization turned on at all in that use case? i.e., if the labeled examples are so rare that many batches don't even contain one, do people really want an example to get half the weight if it just happens to occur in the same batch with another example where the label is defined?

I guess I don't really have a strong opinion about this, so if you have a use case, then I'm fine with doing it your way.

jeffdonahue · 2015-11-11T00:57:21Z

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with, and this normalization should not be the default, but if it is used it shouldn't have a special case for 0 non-ignored labels, as the NaN/inf problem that occurs with 0 non-ignored labels is just the most extreme and illustrative case of the general problem with this normalization strategy. For example, say you're using a batch size of 128 -- then, relative to a batch with 128 valid labels, in a batch with 64 valid labels, the gradient of the each valid instance's error gradient is scaled up by a factor of 2; for a batch with 16 valid labels, each valid instance's error gradient is scaled up by a factor of 8; for a batch with 1 valid label, the instance's error gradient is scaled up by a factor of 128. Naturally, with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be. With 1 valid label things already pretty bad -- scaling that one instance's gradient up by a factor of 128 is likely to lead to quite unstable and high-variance learning if you chose your learning rate expecting your update to reflect the gradient averaged over 128 independent instances.

That said, I think we should change the default to BATCH_SIZE normalization, but that would change existing behavior so it might deserve a separate PR with more discussion later.

seanbell · 2015-11-11T01:04:35Z

with no valid labels, the error gradient is scaled up by infinity. Which is of course bad, but only exactly as bad as it should be.

In my use case, I have a very large model, small batchsize (4), and use gradient accumulation to fit it on the GPU. With this setup, you only need a single iteration to have no valid labels to totally kill training.

The moment you have NaN, everything is dead. So you might as well crash with a debug message in the normalizer code. I am saying that you may want to instead ignore the iteration and avoid the crash. I can see that it might not be the best for a default, I think it's an option worth considering.

jeffdonahue · 2015-11-11T01:12:12Z

I'm not sure about a check as it's possible you might not actually be backpropagating the error and just want to compute the softmax loss for debugging/display purposes (e.g. you're using loss_weight: 0), which is actually the one case where I'd agree with using the VALID normalization :). I'd prefer such cases be caught by something more generic like #1349.

seanbell · 2015-11-11T01:22:22Z

That's a valid use case too (though the loss_weight: 0 case may need fixing, #2895).

I was just saying that I have a use case for avoiding divide-by-zero (noisy partially labeled datasets with small batchsizes, and certain multitask setups), and I thought others might as well. It could be a configurable option, with the default being to divide-by-zero and output NaN. But if nobody else runs into this problem, it could be left out for now.

Edit: Another use case. I have a multitask setup where some batches have semantic segmentation labels, and others do not. For those that do have labels, I use "valid" normalization. For those that don't, the normalizing constant is 0, and I set loss = 0 instead of divide-by-zero.

seanbell · 2015-11-11T06:51:26Z

My personal opinion here is that scaling down the loss by the number of non-ignored labels is the wrong thing to do to begin with

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

Sorry if I'm side-tracking this PR. I'm arguing for something (avoiding divide-by-zero) which could be discussed in another PR.

jeffdonahue · 2015-11-11T08:36:51Z

Since weight decay is the same each iteration, I think it does make sense to scale by the number of valid labels. Otherwise, the ratio of regularizer-to-data changes from batch to batch.

The ratio does change from minibatch to minibatch, but it's still an unbiased estimator of the objective if minibatches are sampled uniformly. But this statement made me go write down some math, and I realized I was wrong -- the VALID normalization (with your proposed special case for 0 valid labels) also gives an unbiased estimator of an objective, just one with a different weighting of the data term from the BATCH normalization (amounting to a different weighting of the regularization term). Note that this assumes minibatches are drawn randomly, however; if you're stepping through the dataset sequentially the VALID normalization is biased in that some examples -- ones that appear in batches with more ignored instances -- are actually always weighted higher than others, whereas even with sequential training I think the BATCH strategy remains an unbiased estimator of the objective at each iteration, assuming the dataset was shuffled to begin with.

Anyway though, I retract my earlier hardliner and probably mathematically unfounded position and would support the special case you suggested @seanbell (doesn't need to be in the PR, but also fine if it is, to me). Thanks for the discussion.

cdoersch · 2015-11-11T16:37:16Z

Considering that we have one 'don't care', one 'weak support' and one 'support' from someone who actually has a use case, I think it makes sense to add this to the PR. I'll implement it as one-line change at the end of the get_normalizer function where I return max(1.0,normalizer).

jeffdonahue · 2015-11-23T03:33:34Z

LGTM, thanks for the additional normalization options and clarifications of existing ones @cdoersch!

Better normalization options for SoftmaxWithLoss layer

seanbell · 2015-11-23T03:37:51Z

Thanks! This is very helpful.

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail n.b. this changes the default normalization for this loss! previously this loss normalized by batch size, but now it normalizes by the total number of outputs/targets.

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail note: the default normalization remains batch size, but valid normalization might be a better idea

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail n.b. this changes the default normalization for this loss! previously this loss normalized by batch size, but now it normalizes by the total number of outputs/targets.

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail. this preserves the default normalization for sig-ce loss: batch size.

cdoersch force-pushed the normalize_batch branch from dcaeee0 to f76fd37 Compare November 6, 2015 23:08

seanbell reviewed Nov 10, 2015
View reviewed changes

shelhamer added enhancement documentation and removed documentation labels Nov 11, 2015

cdoersch force-pushed the normalize_batch branch 2 times, most recently from 9963079 to d5a78e1 Compare November 11, 2015 16:43

Better normalization options for SoftmaxWithLoss layer.

8b2aa70

cdoersch force-pushed the normalize_batch branch from d5a78e1 to 8b2aa70 Compare November 22, 2015 22:47

jeffdonahue added a commit that referenced this pull request Nov 23, 2015

Merge pull request #3296 from cdoersch/normalize_batch

8e8d97d

Better normalization options for SoftmaxWithLoss layer

jeffdonahue merged commit 8e8d97d into BVLC:master Nov 23, 2015

shelhamer mentioned this pull request Nov 15, 2016

Sigmoid Cross-Entropy Loss: ignore selected targets by ignore_label #4986

Merged

2 tasks

shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

sigmoid cross-entropy loss: normalize by one/batch size/all/etc.

3231a1d

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail

shelhamer added a commit to shelhamer/caffe that referenced this pull request Nov 17, 2016

sigmoid cross-entropy loss: normalize by one/batch size/all/etc.

bcbf54d

sig-ce loss handles all the same normalizations as the softmax loss; refer to BVLC#3296 for more detail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better normalization options for SoftmaxWithLoss layer #3296

Better normalization options for SoftmaxWithLoss layer #3296

cdoersch commented Nov 6, 2015

jeffdonahue commented Nov 10, 2015

seanbell commented Nov 10, 2015

seanbell Nov 10, 2015

cdoersch commented Nov 10, 2015

seanbell commented Nov 11, 2015

cdoersch commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

seanbell commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

seanbell commented Nov 11, 2015

seanbell commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

cdoersch commented Nov 11, 2015

jeffdonahue commented Nov 23, 2015

seanbell commented Nov 23, 2015

Better normalization options for SoftmaxWithLoss layer #3296

Better normalization options for SoftmaxWithLoss layer #3296

Conversation

cdoersch commented Nov 6, 2015

jeffdonahue commented Nov 10, 2015

seanbell commented Nov 10, 2015

seanbell Nov 10, 2015

Choose a reason for hiding this comment

cdoersch commented Nov 10, 2015

seanbell commented Nov 11, 2015

cdoersch commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

seanbell commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

seanbell commented Nov 11, 2015

seanbell commented Nov 11, 2015

jeffdonahue commented Nov 11, 2015

cdoersch commented Nov 11, 2015

jeffdonahue commented Nov 23, 2015

seanbell commented Nov 23, 2015