In this video, we will discuss a problem with the sigmoid activation function that prevented neural networks from booming sooner. This problem is the vanishing gradient problem. Recall from the previous video, with a very simple network of two neurons only, the derivatives of the error with respect to the weights were as follows. See how small the gradients are, but more importantly, how small the gradient of the error with respect to w1. Well it turns out that because we are using the sigmoid function as the activation function, then all the intermediate values in the network are between 0 and 1. So when we do backpropagation, we keep multiplying factors that are less than 1 by each other, and so their gradients tend to get smaller and smaller as we keep on moving backward in the network. This means that the neurons in the earlier layers learn very slowly as compared to the neurons in the later layers in the network. The earlier layers in the network, are the slowest to train. The result is a training process that takes too long and a prediction accuracy that is compromised. Accordingly, this is the reason why we do not use the sigmoid function or similar functions as activation functions, since they are prone to the vanishing gradient problem. Therefore, in the next video we will learn about other activation functions that became so popular and are now the activation functions that get used almost all the time in the hidden layers, since they help in overcoming the vanishing gradient problem.