Softmax

Previously, we have introduced logistic regression, Logistic regression can solve the binary classification problem, softmax regression is a generalize of logistic regression that we could solve multi-category classification problem using softmax.

Here is how it comes:

Loss function of softmax regression

We firstly re-call the loss function for logistic regression:

where

In logistic regression, we will categories the samples into to 0-class and 1-class. The is always regarded as the probability to category the sample to 1-class, then the probability for the 0-class is .

Instead, in softmax regression, we would like to category the samples into classes. The softmax function have much more weight parameters and bias parameters for softmax regression: we have weight variables, bias parameters and also we have dependent variables. The dependent variable for each sample can be expressed as , in which of one of the K numbers is 1, others are all 0.

We first get the representation of softmax function for each sample, for simplification, I will skip the subscript for each sample:

Then we normalize it:

It is easy to get that:

Then the loss function of the softmax regression is defined as:

Then the loss function across all samples will be:

Gradient of softmax loss function

Also we start with the derivatives respect to each parameter for each samples. For one specific sample , we assume true class of it is the i-class. Then the cost function can be re-write as:

Then we could generate the derivatives:

here , and

With the chain rule in summary, we generate the derivatives for each parameter for one input sample,

For , then

and similarly for ,

To implement more easily and more fast, we also vectorize above derivatives, then we get:

With gradient above, it is quite easy to generate gradient across all the samples. Check the python implementation codes below.

One more problem in python implementation

We previously found that, we should calculate the softmax function for input sample, which may exceed the max limit of float number computing using computer. For example, when some is very big, 9999 for examples, then when computing the softmax function, we need to compute , this will definitely exceed the max float number limit. So, computers can not solve this problem, then how can we implementation? (Note, does this happens in Logistic Regression? Yes! Will also put a modified version to logistic regression)

We use following formulas to calculate the softmax function instead:

When set

Then we will never reach the positive infinity.

Python implementation of softmax regression using gradient descent

With gradient / derivatives calculated above, we could easily implement the softmax regression with python.

import os, glob, math
import numpy as np
import random

class softmax_regression:
  """
  Softmax regression with crossentropy loss function, and solving by gradient descent methods.
  """
  def __init__(self):
    self.w = None
    self.b = None
    self.y_pred = None
  
  def train(self, X, y, learning_rate=0.001, epoch=30000, methods="crossentropy"):
    """
    If data not the required format, this function not works
    @param X, independent variable, example: X = np.array([[1,2,],[1,3,],[1,4], ...]), with size n * m, n samples, each sample have m features
    @param y, dependent variable, example y = np.array([[0,1,0,],[1,0,0,],[0,0,0,],[0,0,1,],...]) with size n * K, n samples, each sample have K-dimension response vector. At one of the K positions is 1, others are 0s.
    Both X and y should be an array.
    @param learning_rate, step size for gradient descent
    @param epoch, times the training process go over the whole datasets
    @param methods, the loss function, OLS stands for ordinary least square loss function
    @return, this function will return weight vector w and bias parameter b
    """
    n = len(X)
    self.n = n
    m = len(X[0])
    K = len(y[0])
    if len(y) != n:
      print "Error, the input X, y should be the same length, while you have len(X)=%d and len(y)=%d"%(n, len(y))
    
    # following codes reformat the input
    X = np.array(X)
    y = np.array(y)
    
    # add a 1 column for the bias variable
    X = np.column_stack((X, np.ones(n)))
    self.X = X
    self.y = y
    
    # get the parameter w and b, first initize
    theta = np.random.randn(m+1, K)
    num = 0
    for i in xrange(epoch):
      theta -= learning_rate * self.getGradient(theta, methods)
      self.epoch = i
    self.w = theta[:,:-1]
    self.b = theta[:,-1]
    return theta
  
  def getGradient(self, theta, methods="crossentropy"):
    """
    Get the gradient of the loss function using parameter theta
    """
    methods_from = ["crossentropy"]
    if methods == "crossentropy":
      a = self.X.dot(theta)
      a_max = np.max(a, axis=1).reshape(n,1)
      a -= a_max
      a_exp = np.exp(a)
      a_sum = np.sum(a_exp, axis=1).reshape(n,1)
      s = a_exp / a_sum - self.y
      s = s.reshape(n,1,K)
      delta = np.sum(s*(self.X.reshape(n,m+1,1)), axis=0)
    else:
      print "Error, the methods parameter can only be ", methods_from
    return delta
  
  def predict(self, X, y=None):
    """
    @param X, independent variable.
    Example: X = np.array([[1,2,],[1,3,],[1,4], ...])
    @param y, dependent variable, example y = np.array([[1],[2],[3],[4],...])
    @return, return the predicted dependent variables for input X, or if y is specified
    """
    n = len(X)
    X = np.array(X)
    y_pred = X.dot(self.w) + self.b
    self.y_pred = y_pred
    if y != None:
      if n != len(y):
        print "Error, the input X, y should be the same length, while you have len(X)=%d and len(y)=%d"%(n, len(y))
      y = np.array(y)
      y_pred = y_pred.reshape(n)
      if len(y.shape) != 1:
        y = y.reshape(n)
      self.rmse = np.sqrt(np.sum(np.square(y-y_pred))) / n
    return self.y_pred

Summary

We introduced how softmax regression works and the python codes implementing the softmax regression with gradient descent. With knowledges introduced, we can now construct basic neural networks. I am updating the "deep learning from scratch" at the same time after this softmax. You can check the link if you would like to know how deep neural networks works: Deep Learning from Scratch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.softmax.md

6.softmax.md

Softmax

Loss function of softmax regression

Gradient of softmax loss function

One more problem in python implementation

Python implementation of softmax regression using gradient descent

Summary

Files

6.softmax.md

Latest commit

History

6.softmax.md

File metadata and controls

Softmax

Loss function of softmax regression

Gradient of softmax loss function

One more problem in python implementation

Python implementation of softmax regression using gradient descent

Summary