Question about `Softmax.categorical_crossentopy` #654

rizar · 2015-05-21T10:27:53Z

I guess I do not get something simple, but I will still ask: why are we not using theano.nnet.softmax there?

The text was updated successfully, but these errors were encountered:

bartvm · 2015-05-21T12:23:18Z

Because it often ends up not being numerically stable in my experience. See #150. I'm quite sure other people have experienced similar problems (e.g. https://groups.google.com/forum/#!searchin/blocks-users/softmax/blocks-users/f_oPjQi1o-8/6Gg7PgjvMk4J)

rizar · 2015-05-21T15:57:29Z

But all we do there is max subtraction, which should be implemented in Theano as well... I will take a look at what happens under the hood. Let's keep the issue open so far.

dwf · 2015-05-21T17:36:49Z

Pylearn2 did this as well. The optimizer that is supposed to handle this is apparently not that reliable.

vdumoulin · 2015-05-21T17:40:44Z

In the past I got bit by that when there were reshapes scattered in my graph: the optimization wasn't applied because it was confused by the reshapes.

lamblin · 2015-05-22T15:17:09Z

Those optimizers absolutely need to be worked on in Theano. I opened Theano/Theano#2944, but there would be more work to do to handle the reshapes as well.

rizar · 2015-05-22T16:36:27Z

Actually the C code for CPU seems to perform this max subtraction: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/nnet.py#L206

I could not get through GPU version. Anyway, if theano.nnet.softmax is unreliable currently, why are we using it in the apply method of the same brick??

rizar · 2015-05-22T16:46:43Z

And now I am even more confused, because max is also subtracted in Cuda code, https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/kernel_codegen.py#L151

bartvm · 2015-05-22T16:55:05Z

It's not just subtracting the max. There's also the gradient computation,
which could end up looking completely differently (because Softmax comes
with SoftmaxGrad, which might perform different computations from the
gradient of the manual computation).

On Fri, May 22, 2015 at 12:46 PM, Dmitry Bogdanov notifications@github.com
wrote:

And now I am even more confused, because max is also subtracted in Cuda
code,
https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/kernel_codegen.py#L151

—
Reply to this email directly or view it on GitHub
#654 (comment).

rizar · 2015-05-22T17:03:58Z

Okay, I see. But then, as I said above, we should not use it in apply as well. Created #659

lamblin · 2015-05-22T17:51:30Z

The computation in SoftmaxGrad is the same as the one automatically derived from the expression involving exp, we checked that with @harmdevries89 when working on Theano/Theano#2050.
What is missing in that case is an optimization for the case you do not need the gradient through the whole softmax expression, but only one index, for instance the usual case of log(softmax(...)[target_idx]). In that case, there is a much simpler expression, which is simplified to use CrossEntropySoftmaxArgmaxDx or something, which performs the simpler computation.

mjwillson · 2015-07-14T11:19:48Z

@lamblin I noticed in theano itself there is theano.tensor.nnet.categorical_crossentropy.

For the one-hot case this uses CrossentropyCategorical1Hot, which seems to have a grad implemented, and its docs state:

    :note: In the case that the coding distribution is the output of a
           softmax, an application of this Op will probably be optimized
           away in favour of one with a C implementation.

While I don't like the sound of "probably", is there a reason not to use this on top of theano.tensor.nnet.softmax in Softmax.categorical_crossentopy ? Is there an easy way I can find out what optimisations it's made under the hood?

It'd be good to clear this stuff up anyway, either here or in theano. I'm not enough of an expert to debug it authoritatively, but for me using blocks' Softmax.categorical_crossentopy seems to cause theano to do something suspiciously slow / inefficient (turning on profiling, around 2/3 of the time is spent in Softmax and Elemwise exp, which seems way too much...)

lamblin · 2015-07-14T21:35:31Z

So, there is an optimization that transforms crossentropy_categorical_onehot(softmax(...), target) into crossentropy_softmax_argmax_1hot_with_bias(..., target), and the equivalent for the gradient.

There are also optimizations trying to catch the equivalent expression (in 2D case) softmax(...)[arange(batch_size), target] (here) and its gradient (here).

The indexing form is easier to type, but the crossentropy_categorical_onehot may be easier to optimize.
However, I think the main problem is when we do not symply have crossentropy_categorical_onehot(softmax), but some reshaping, indexing, masking, or scan between the two.

mjwillson · 2015-07-15T13:34:24Z

OK, yeah fair enough -- I can see this getting more complicated with recurrent models.

Still in the case of Softmax.categorical_crossentopy, at least in the 1-hot vector case, from what I can see this does implement exactly the combination of crossentropy_categorical_onehot(softmax(...), target), no? In which case would it not be better for it to take advantage of the theano optimisation?

I've now figured out how to see the theano function graph after optimisations and it is optimising this combination to something involving CrossentropySoftmaxArgmax1HotWithBias, for me at least, and the gradient to something using CrossentropySoftmax1HotWithBiasDx.

There are GPU implementations of both these, which is nice to take advantage of. For a smaller model I'm finding the final softmax activation to be the bottleneck, which feels a bit wrong (the matrix operations should be the bottleneck), but after looking at the function graph I think is probably just down to a slowish implementation, on the CPU at least.

Also with the manual version of softmax where you do the max subtraction symbolically, it seems you end up with stuff in the function graph relating to the gradient with respect to the max operation, which is unnecessary and feels a bit funny to be doing this kind of numerical stuff symbolically. Feels like it needs sorting out upstream so this kind of hack isn't needed...

rizar added the question label May 21, 2015

rizar mentioned this issue May 22, 2015

Softmax.apply should not use theano.tensor.nnet.softmax #659

Closed

rizar closed this as completed May 22, 2015

kshmelkov mentioned this issue Aug 18, 2015

What the comment mean? skaae/Lasagne-CTC#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about `Softmax.categorical_crossentopy` #654

Question about `Softmax.categorical_crossentopy` #654

rizar commented May 21, 2015

bartvm commented May 21, 2015

rizar commented May 21, 2015

dwf commented May 21, 2015 via email

vdumoulin commented May 21, 2015

lamblin commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

lamblin commented May 22, 2015

mjwillson commented Jul 14, 2015

lamblin commented Jul 14, 2015

mjwillson commented Jul 15, 2015

Question about Softmax.categorical_crossentopy #654

Question about Softmax.categorical_crossentopy #654

Comments

rizar commented May 21, 2015

bartvm commented May 21, 2015

rizar commented May 21, 2015

dwf commented May 21, 2015 via email

vdumoulin commented May 21, 2015

lamblin commented May 22, 2015

rizar commented May 22, 2015

rizar commented May 22, 2015

bartvm commented May 22, 2015

rizar commented May 22, 2015

lamblin commented May 22, 2015

mjwillson commented Jul 14, 2015

lamblin commented Jul 14, 2015

mjwillson commented Jul 15, 2015

Question about `Softmax.categorical_crossentopy` #654

Question about `Softmax.categorical_crossentopy` #654