Skip to content

Latest commit

 

History

History
190 lines (133 loc) · 9.07 KB

File metadata and controls

190 lines (133 loc) · 9.07 KB

Softmax

Previously, we have introduced logistic regression, Logistic regression can solve the binary classification problem, softmax regression is a generalize of logistic regression that we could solve multi-category classification problem using softmax.

Here is how it comes:

Loss function of softmax regression

We firstly re-call the loss function for logistic regression:

equation

where

equation

In logistic regression, we will categories the samples into to 0-class and 1-class. The equation is always regarded as the probability to category the sample to 1-class, then the probability for the 0-class is equation.

Instead, in softmax regression, we would like to category the samples into equation classes. The softmax function have much more weight parameters and bias parameters for softmax regression: we have equation weight variables, equation bias parameters and also we have equation dependent variables. The dependent variable for each sample can be expressed as equation, in which of one of the K numbers is 1, others are all 0.

We first get the representation of softmax function for each sample, for simplification, I will skip the subscript equation for each sample:

equation

Then we normalize it:

equation

It is easy to get that:

equation

Then the loss function of the softmax regression is defined as:

equation

Then the loss function across all samples will be:

equation

Gradient of softmax loss function

Also we start with the derivatives respect to each parameter for each samples. For one specific sample equation, we assume true class of it is the i-class. Then the cost function can be re-write as:

equation

Then we could generate the derivatives:

equation

equation

equation

here equation, and

equation

With the chain rule in summary, we generate the derivatives for each parameter for one input sample,

For equation, then

equation

and similarly for equation,

equation

To implement more easily and more fast, we also vectorize above derivatives, then we get:

equation

With gradient above, it is quite easy to generate gradient across all the samples. Check the python implementation codes below.

One more problem in python implementation

We previously found that, we should calculate the softmax function for input sample, which may exceed the max limit of float number computing using computer. For example, when some equation is very big, 9999 for examples, then when computing the softmax function, we need to compute equation, this will definitely exceed the max float number limit. So, computers can not solve this problem, then how can we implementation? (Note, does this happens in Logistic Regression? Yes! Will also put a modified version to logistic regression)

We use following formulas to calculate the softmax function instead:

equation

When set

equation

Then we will never reach the positive infinity.

Python implementation of softmax regression using gradient descent

With gradient / derivatives calculated above, we could easily implement the softmax regression with python.

import os, glob, math
import numpy as np
import random

class softmax_regression:
  """
  Softmax regression with crossentropy loss function, and solving by gradient descent methods.
  """
  def __init__(self):
    self.w = None
    self.b = None
    self.y_pred = None
  
  def train(self, X, y, learning_rate=0.001, epoch=30000, methods="crossentropy"):
    """
    If data not the required format, this function not works
    @param X, independent variable, example: X = np.array([[1,2,],[1,3,],[1,4], ...]), with size n * m, n samples, each sample have m features
    @param y, dependent variable, example y = np.array([[0,1,0,],[1,0,0,],[0,0,0,],[0,0,1,],...]) with size n * K, n samples, each sample have K-dimension response vector. At one of the K positions is 1, others are 0s.
    Both X and y should be an array.
    @param learning_rate, step size for gradient descent
    @param epoch, times the training process go over the whole datasets
    @param methods, the loss function, OLS stands for ordinary least square loss function
    @return, this function will return weight vector w and bias parameter b
    """
    n = len(X)
    self.n = n
    m = len(X[0])
    K = len(y[0])
    if len(y) != n:
      print "Error, the input X, y should be the same length, while you have len(X)=%d and len(y)=%d"%(n, len(y))
    
    # following codes reformat the input
    X = np.array(X)
    y = np.array(y)
    
    # add a 1 column for the bias variable
    X = np.column_stack((X, np.ones(n)))
    self.X = X
    self.y = y
    
    # get the parameter w and b, first initize
    theta = np.random.randn(m+1, K)
    num = 0
    for i in xrange(epoch):
      theta -= learning_rate * self.getGradient(theta, methods)
      self.epoch = i
    self.w = theta[:,:-1]
    self.b = theta[:,-1]
    return theta
  
  def getGradient(self, theta, methods="crossentropy"):
    """
    Get the gradient of the loss function using parameter theta
    """
    methods_from = ["crossentropy"]
    if methods == "crossentropy":
      a = self.X.dot(theta)
      a_max = np.max(a, axis=1).reshape(n,1)
      a -= a_max
      a_exp = np.exp(a)
      a_sum = np.sum(a_exp, axis=1).reshape(n,1)
      s = a_exp / a_sum - self.y
      s = s.reshape(n,1,K)
      delta = np.sum(s*(self.X.reshape(n,m+1,1)), axis=0)
    else:
      print "Error, the methods parameter can only be ", methods_from
    return delta
  
  def predict(self, X, y=None):
    """
    @param X, independent variable.
    Example: X = np.array([[1,2,],[1,3,],[1,4], ...])
    @param y, dependent variable, example y = np.array([[1],[2],[3],[4],...])
    @return, return the predicted dependent variables for input X, or if y is specified
    """
    n = len(X)
    X = np.array(X)
    y_pred = X.dot(self.w) + self.b
    self.y_pred = y_pred
    if y != None:
      if n != len(y):
        print "Error, the input X, y should be the same length, while you have len(X)=%d and len(y)=%d"%(n, len(y))
      y = np.array(y)
      y_pred = y_pred.reshape(n)
      if len(y.shape) != 1:
        y = y.reshape(n)
      self.rmse = np.sqrt(np.sum(np.square(y-y_pred))) / n
    return self.y_pred

Summary

We introduced how softmax regression works and the python codes implementing the softmax regression with gradient descent. With knowledges introduced, we can now construct basic neural networks. I am updating the "deep learning from scratch" at the same time after this softmax. You can check the link if you would like to know how deep neural networks works: Deep Learning from Scratch