*2.6. Activation Layers*

As mentioned in the previous section, the main purpose of the activation layer is to provide a non-linear transform. These functions need to be derivative. There are several types of activation functions, including sigmoid (4), tanh (5) and rectified linear unit (ReLU) (6):

$$\tanh(\mathbf{x}) = \frac{\mathbf{e}^{\mathbf{x}} - \mathbf{e}^{-\mathbf{x}}}{\mathbf{e}^{\mathbf{x}} + \mathbf{e}^{-\mathbf{x}}} \tag{5}$$

$$\text{ReLU}(\mathbf{x}) = \max(\mathbf{x}, \mathbf{0}). \tag{6}$$

Unfortunately, these activation functions all have some flaws. When using the gradient decent methods, ReLU can be a ffected by gradient exploding, because ReLU does not change the range of the output value from the input. Another problem cause by ReLU is the dead ReLU problem. When a relu has a negative input value, it will give an output of 0, which will cause the whole chain of the input to not update at that time, or even worse, never update until training is finished. On the other hand, sigmoid and tanh are a ffected by the vanishing gradient problem, because the gradients BP from these functions will at most only have 1/4 left. Comparing with these two groups of activation functions, we observe that the problem of ReLU can be solved by adding a normalization layer, which also results in a faster processing speed. For these reasons, ReLU is now the most commonly-used activation function.
