-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Softmax.categorical_crossentopy
#654
Comments
Because it often ends up not being numerically stable in my experience. See #150. I'm quite sure other people have experienced similar problems (e.g. https://groups.google.com/forum/#!searchin/blocks-users/softmax/blocks-users/f_oPjQi1o-8/6Gg7PgjvMk4J) |
But all we do there is max subtraction, which should be implemented in Theano as well... I will take a look at what happens under the hood. Let's keep the issue open so far. |
Pylearn2 did this as well. The optimizer that is supposed to handle this is
apparently not that reliable.
|
In the past I got bit by that when there were reshapes scattered in my graph: the optimization wasn't applied because it was confused by the reshapes. |
Those optimizers absolutely need to be worked on in Theano. I opened Theano/Theano#2944, but there would be more work to do to handle the reshapes as well. |
Actually the C code for CPU seems to perform this max subtraction: https://github.com/Theano/Theano/blob/master/theano/tensor/nnet/nnet.py#L206 I could not get through GPU version. Anyway, if |
And now I am even more confused, because |
It's not just subtracting the max. There's also the gradient computation, On Fri, May 22, 2015 at 12:46 PM, Dmitry Bogdanov notifications@github.com
|
Okay, I see. But then, as I said above, we should not use it in |
The computation in |
@lamblin I noticed in theano itself there is For the one-hot case this uses CrossentropyCategorical1Hot, which seems to have a grad implemented, and its docs state:
While I don't like the sound of "probably", is there a reason not to use this on top of It'd be good to clear this stuff up anyway, either here or in theano. I'm not enough of an expert to debug it authoritatively, but for me using blocks' Softmax.categorical_crossentopy seems to cause theano to do something suspiciously slow / inefficient (turning on profiling, around 2/3 of the time is spent in Softmax and Elemwise exp, which seems way too much...) |
So, there is an optimization that transforms There are also optimizations trying to catch the equivalent expression (in 2D case) The indexing form is easier to type, but the |
OK, yeah fair enough -- I can see this getting more complicated with recurrent models. Still in the case of Softmax.categorical_crossentopy, at least in the 1-hot vector case, from what I can see this does implement exactly the combination of crossentropy_categorical_onehot(softmax(...), target), no? In which case would it not be better for it to take advantage of the theano optimisation? I've now figured out how to see the theano function graph after optimisations and it is optimising this combination to something involving CrossentropySoftmaxArgmax1HotWithBias, for me at least, and the gradient to something using CrossentropySoftmax1HotWithBiasDx. There are GPU implementations of both these, which is nice to take advantage of. For a smaller model I'm finding the final softmax activation to be the bottleneck, which feels a bit wrong (the matrix operations should be the bottleneck), but after looking at the function graph I think is probably just down to a slowish implementation, on the CPU at least. Also with the manual version of softmax where you do the max subtraction symbolically, it seems you end up with stuff in the function graph relating to the gradient with respect to the max operation, which is unnecessary and feels a bit funny to be doing this kind of numerical stuff symbolically. Feels like it needs sorting out upstream so this kind of hack isn't needed... |
I guess I do not get something simple, but I will still ask: why are we not using
theano.nnet.softmax
there?The text was updated successfully, but these errors were encountered: