AIE

What's the difference between hyperbolic tangent and sigmoid neurons?

asked 3 months ago

3.5K

Two common activation functions used in deep learning are the hyperbolic tangent function and the sigmoid activation function. I understand that the hyperbolic tangent is just a rescaling and translation of the sigmoid function:

$tanh(z) = 2sigma(z) - 1$.

Is there a significant difference between these two activation functions, and in particular, when is one preferable to the other?

I realize that in some cases (like when estimating probabilities) outputs in the range of $[0,1]$ are more convenient than outputs that range from $[-1,1]$. I want to know if there are differences other than convenience which distinguish the two activation functions.

2 Answers

AIE

• answered 3 months ago

I don't think it makes sense to decide activation functions based on desired properties of the output; you can easily insert a calibration step that maps the 'neural network score' to whatever units you actually want to use (dollars, probability, etc.).

So I think preference between different activation functions mostly boils down to the different properties of those activation functions (like whether or not they're continuously differentiable). Because there's just a linear transformation between the two, I think that means there isn't a meaningful difference between them.

AIE

• answered 3 months ago

Sigmoid > Hyperbolic tangent:

As you mentioned, the application of Sigmoid might be more convenient than hyperbolic tangent in the cases that we need a probability value at the output (as @matthew-graves says, we can fix this with a simple mapping/calibration step). In other layers, this makes no sense.

Hyperbolic tangent > Sigmoid:

Hyperbolic tangent has a property called "approximates identity near the origin" which means $tanh(0) = 0$, $tanh'(0) = 1$, and $tanh'(z)$ is continuous around $z=0$ (as opposed to $sigma(0)=0.5$ and $sigma'(0)=0.25$). This feature (which also exists in many other activation functions such as identity, arctan, and sinusoid) lets the network learn efficiently even when its weights are initialized with small values. In other cases (eg Sigmoid and ReLU) these small initial values can be problematic.

Further Reading:

Random Walk Initialization for Training Very Deep Feedforward Networks

What&#39;s the difference between hyperbolic tangent and sigmoid neurons?

2 Answers