Notes on Neural Networks Part I – Backpropagation

As far as I know, the story behind deep learning is like this: original there are neural networks with only one hidden layer, which is not powerful enough; to increase its power, one may add more hidden layers, however, with the standard gradient-base learning algorithm, i.e. backpropagation, the gradients vanish in hidden layers fastly when the hidden layers are more than one, and also due to non-convexity of the objective, deep neural networks suffer deep local optimal and overfitting, which makes neural networks stuck in its downturn for a while. Neural networks seem to be borned with the concept of “feature/representation learning”, but it was purely act in a supervised fashion, until 2006, Hinton et al. proposed unsupervised feature/representation learning for addressing the critical issues in training a deep neural nets, which called unsupervised pre-training, aiming to initialize the deep neural networks with much better weights than randomly assigned one. After pre-training, the followed supervised learning (called fine-tuning) will work much better compared with original supervised learning. There are mainly two ways to do the pre-training, one is using Restricted Boltzmann Machines, the other is to use stacked Autoencoder (however, some recent studies suggest that using rectifier units, deep neural nets can be well trained even without unsupervised pre-training).

Okay, back to the basics, let me scribe the classic Backpropagation learning algorithm for neural networks in here:

Notations:

a_i^k : scalar output of node i at k-th layer; a^k is an n^k \times 1 matrix; note a^1 at 1st layer, i.e. input layer, is the input vector x_j of j-th instance, and a^K at output layer is output prediction t.

\delta_i^k : “Pseudo-error” for node i at k-th level, used for calculating gradients; \delta^k is an n^k \times 1 matrix.

W_{ij}^k : weight of link from node i at k-th layer to node j at k+1-th layer; W^k is an n^k \times n^{k+1} matrix.

b_j^k : bias from k-th layer to node j at k+1-th layer; b^k is an n^{k+1} \times 1matrix.

h(x) : element-wise activation function for x, usually it would be sigmoid, tanh, or rectifier.

\eta : learning rate.

x \circ y : element-wise product of two matrices.

Backpropagation with stochastic gradient descent on squared error loss function and sigmoid activation function h(x):


Initialization

For each training instances repeat following until convergence or maximum number of iteration reached:

% forward propagation

For layer bigger than 1:

(1)   \begin{equation*} a^k = h( (W^{k-1})^T a^{k-1} + b^{k-1}) \end{equation*}

% Back propagation of “pseudo errors”

For top layer, say the K-th layer,

(2)   \begin{equation*} \delta^K = a^K \circ (\mathbf{1} - a^K) \circ (a^K - t) \end{equation*}

For each k-th hidden layer,

(3)   \begin{equation*} \delta^k = a^k \circ (\mathbf{1} - a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

% Calculate and apply gradients

For k-th layer (exclude the output layer),

(4)   \begin{equation*} \nabla_{W^k} = a^k (\delta^{k+1})^T \end{equation*}

(5)   \begin{equation*} \nabla_{b^k} = \delta^{k+1} \end{equation*}

(6)   \begin{equation*} W_{new} = W_{old} - \eta \nabla_{W} \end{equation*}

(7)   \begin{equation*} b_{new} = b_{old} - \eta \nabla_{b} \end{equation*}


For general loss functions and activation functions, we only need to replace equations 2 and 3 with:

(8)   \begin{equation*} \delta^K = h'(a^K) \circ \frac{\partial E}{\partial h(a^K)} \end{equation*}

And,

(9)   \begin{equation*} \delta^k = h'(a^k) \circ ( (W^k)^T \delta^{k+1}) \end{equation*}

Where h'(a^k) is a gradient of each dimension of function h to its corresponding dimension in the input vector a^k. \frac{\partial E}{\partial h(a^K)} is the gradient of error/loss function to predicted output function h(a^K), “element-wisely”.