Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Softmax.categorical_crossentopy #654

Closed
rizar opened this issue May 21, 2015 · 13 comments
Closed

Question about Softmax.categorical_crossentopy #654

rizar opened this issue May 21, 2015 · 13 comments
Labels

Comments

@rizar
Copy link
Contributor

rizar commented May 21, 2015

I guess I do not get something simple, but I will still ask: why are we not using theano.nnet.softmax there?

@rizar rizar added the question label May 21, 2015
@bartvm
Copy link
Member

bartvm commented May 21, 2015

Because it often ends up not being numerically stable in my experience. See #150. I'm quite sure other people have experienced similar problems (e.g. https://groups.google.com/forum/#!searchin/blocks-users/softmax/blocks-users/f_oPjQi1o-8/6Gg7PgjvMk4J)

@rizar
Copy link
Contributor Author

rizar commented May 21, 2015

But all we do there is max subtraction, which should be implemented in Theano as well... I will take a look at what happens under the hood. Let's keep the issue open so far.

@dwf
Copy link
Contributor

dwf commented May 21, 2015 via email

@vdumoulin
Copy link
Contributor

In the past I got bit by that when there were reshapes scattered in my graph: the optimization wasn't applied because it was confused by the reshapes.

@lamblin
Copy link
Contributor

lamblin commented May 22, 2015

Those optimizers absolutely need to be worked on in Theano. I opened Theano/Theano#2944, but there would be more work to do to handle the reshapes as well.

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Actually the C code for CPU seems to perform this max subtraction: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/nnet.py#L206

I could not get through GPU version. Anyway, if theano.nnet.softmax is unreliable currently, why are we using it in the apply method of the same brick??

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

And now I am even more confused, because max is also subtracted in Cuda code, https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/kernel_codegen.py#L151

@bartvm
Copy link
Member

bartvm commented May 22, 2015

It's not just subtracting the max. There's also the gradient computation,
which could end up looking completely differently (because Softmax comes
with SoftmaxGrad, which might perform different computations from the
gradient of the manual computation).

On Fri, May 22, 2015 at 12:46 PM, Dmitry Bogdanov notifications@github.com
wrote:

And now I am even more confused, because max is also subtracted in Cuda
code,
https://github.com/Theano/Theano/blob/master/theano/sandbox/cuda/kernel_codegen.py#L151


Reply to this email directly or view it on GitHub
#654 (comment).

@rizar
Copy link
Contributor Author

rizar commented May 22, 2015

Okay, I see. But then, as I said above, we should not use it in apply as well. Created #659

@rizar rizar closed this as completed May 22, 2015
@lamblin
Copy link
Contributor

lamblin commented May 22, 2015

The computation in SoftmaxGrad is the same as the one automatically derived from the expression involving exp, we checked that with @harmdevries89 when working on Theano/Theano#2050.
What is missing in that case is an optimization for the case you do not need the gradient through the whole softmax expression, but only one index, for instance the usual case of log(softmax(...)[target_idx]). In that case, there is a much simpler expression, which is simplified to use CrossEntropySoftmaxArgmaxDx or something, which performs the simpler computation.

@mjwillson
Copy link

@lamblin I noticed in theano itself there is theano.tensor.nnet.categorical_crossentropy.

For the one-hot case this uses CrossentropyCategorical1Hot, which seems to have a grad implemented, and its docs state:

    :note: In the case that the coding distribution is the output of a
           softmax, an application of this Op will probably be optimized
           away in favour of one with a C implementation.

While I don't like the sound of "probably", is there a reason not to use this on top of theano.tensor.nnet.softmax in Softmax.categorical_crossentopy ? Is there an easy way I can find out what optimisations it's made under the hood?

It'd be good to clear this stuff up anyway, either here or in theano. I'm not enough of an expert to debug it authoritatively, but for me using blocks' Softmax.categorical_crossentopy seems to cause theano to do something suspiciously slow / inefficient (turning on profiling, around 2/3 of the time is spent in Softmax and Elemwise exp, which seems way too much...)

@lamblin
Copy link
Contributor

lamblin commented Jul 14, 2015

So, there is an optimization that transforms crossentropy_categorical_onehot(softmax(...), target) into crossentropy_softmax_argmax_1hot_with_bias(..., target), and the equivalent for the gradient.

There are also optimizations trying to catch the equivalent expression (in 2D case) softmax(...)[arange(batch_size), target] (here) and its gradient (here).

The indexing form is easier to type, but the crossentropy_categorical_onehot may be easier to optimize.
However, I think the main problem is when we do not symply have crossentropy_categorical_onehot(softmax), but some reshaping, indexing, masking, or scan between the two.

@mjwillson
Copy link

OK, yeah fair enough -- I can see this getting more complicated with recurrent models.

Still in the case of Softmax.categorical_crossentopy, at least in the 1-hot vector case, from what I can see this does implement exactly the combination of crossentropy_categorical_onehot(softmax(...), target), no? In which case would it not be better for it to take advantage of the theano optimisation?

I've now figured out how to see the theano function graph after optimisations and it is optimising this combination to something involving CrossentropySoftmaxArgmax1HotWithBias, for me at least, and the gradient to something using CrossentropySoftmax1HotWithBiasDx.

There are GPU implementations of both these, which is nice to take advantage of. For a smaller model I'm finding the final softmax activation to be the bottleneck, which feels a bit wrong (the matrix operations should be the bottleneck), but after looking at the function graph I think is probably just down to a slowish implementation, on the CPU at least.

Also with the manual version of softmax where you do the max subtraction symbolically, it seems you end up with stuff in the function graph relating to the gradient with respect to the max operation, which is unnecessary and feels a bit funny to be doing this kind of numerical stuff symbolically. Feels like it needs sorting out upstream so this kind of hack isn't needed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants