Below are some of the details about the Activation Functions. Choosing the suitable Activation Function can improve the neural network performance.
Sigmoid Activation Function Advantages:
-
Non-linear: It can capture the non-linearity in the data easily.
-
Probability Output: Outputs values between 0 and 1, suitable for binary classification as probabilities.
-
Smooth Gradient: The function is smooth and differentiable, aiding gradient-based optimization.
-
Historical Use: Mimics the firing rate of neurons, making it a traditional choice in neural networks.
Disadvantages:
-
Saturating Function: For very high or very low inputs, the gradient becomes very small, causing Vanishing Gradient Problem.
-
Not Zero-Centered: Outputs are always positive, leading to inefficient gradient updates and slower convergence.
-
Computational Cost: Involves exponential calculations, making it more computationally expensive than ReLU.
Tanh (Hyperbolic Tangent) Activation Function Advantages:
-
Non-linear: It can also capture the non-linearity in the data easily.
-
Zero-Centered Output: Outputs range from -1 to 1, aiding in centering data and potentially faster convergence.
-
Smooth Gradient: The gradient is larger for inputs close to zero compared to sigmoid, reducing the vanishing gradient problem.
Disadvantages:
-
Saturating Function: For very high or very low inputs, the gradient still becomes small causing vanishing gradient problem, though less severely than sigmoid.
-
Computational Cost: Similar to sigmoid, it involves exponential calculations, making it computationally expensive.
ReLU (Rectified Linear Unit) Activation Function Advantages:
-
Efficient Computation: Simple thresholding at zero makes it computationally efficient.
-
Non-Saturating Gradients: Does not saturate for positive inputs, maintaining large gradients and efficient learning. Negative inputs, however, lead to zero gradients (dying ReLU problem) which can cause vanishing gradient problem.
-
Sparse Activation: Produces sparse outputs (many neurons output zero), improving model efficiency and generalization.
Disadvantages:
-
Non Zero-Centered: Similar to Sigmoid, ReLU outputs are not zero-centered, which can cause issues in gradient updates (It can be resolved using batch normalization).
-
Dying ReLU Problem: Neurons can "die" if they get stuck outputting zero for all inputs, leading to no learning.
-
Unbounded Output: Outputs can grow very large, potentially requiring careful weight initialization and regularization techniques.