Activation Functions

Neural networks rely on a nonlinear transformation to learn nonlinear relationships in data. These nonlinear transformations are typically fixed functions that are applied after a linear transformation of the data. The linear transformation uses learned weights, while the nonlinear function is fixed in that there are no learned parameters. In most cases, these nonlinear functions can be thought of as activation functions that indicate the state of a unit within a layer of a neural network, given some data.

class slugnet.activation.ReLU[source]

Bases: slugnet.activation.Activation

The common rectified linean unit, or ReLU activation funtion.

A rectified linear unit implements the nonlinear function \phi(z) = \text{max}\{0, z\}.

(Source code, png, hires.png, pdf)

_images/activation-1.png
class slugnet.activation.Tanh[source]

Bases: slugnet.activation.Activation

The hyperbolic tangent activation function.

A hyperbolic tangent activation function implements the nonlinearity given by \phi(z) = \text{tanh}(z), which is equivalent to \sfrac{\text{sinh}(z)}{\text{cosh}(z)}.

(Source code, png, hires.png, pdf)

_images/activation-2.png
class slugnet.activation.Sigmoid[source]

Bases: slugnet.activation.Activation

Represent a probability distribution over two classes.

The sigmoid function is given by \phi(z) = \frac{1}{1 + e^{-z}}.

(Source code, png, hires.png, pdf)

_images/activation-3.png
class slugnet.activation.Softmax[source]

Bases: slugnet.activation.Activation

Represent a probability distribution over n classes.

The softmax activation function is given by

\phi(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}, \,
\forall \, i \in \{1, \dots, K\}

where K is the number of classes. We can see that softmax is a generalization of the sigmoid function to n classes. Below, we derive the sigmoid function using softmax with two classes.

\phi(z_1) &= \frac{e^{z_1}}{\sum_{i=1}^2 e^{z_i}} \\
          &= \frac{1}{e^{z_1 - z_1} + e^{z_2 - z_1}} \\
          &= \frac{1}{1 + e^{-z_1}}, \, \text{substituting} \, z_2 = 0

We substitute z_2 = 0 because we only need one variable to represent the probability distribution over two classes. This leaves us with the definition of the sigmoid function.