Neural Network Representation

I. A Shallow Neural Network Representation

Let’s use a shallow neural network for an education purpose. Shallow neural networks are neural networks with only one hidden layer. Therefore, there are only 3 layers in this neural network: an input layer, a hidden layer, and finally an output layer.

Let’s denote $X$ as our input matrix that has 3 features. Therefore,

X = \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix}

Whereas, $y$ is our label, which is a scalar.

From the given information, we define the following shallow neural network:.

You may notice some notations in the network diagram, but do not worry as we will explain these notations right now.

The i-th layer is commonly represented as $a^{[i]}$ , where $a$ stands for activation and $i$ indicates the i-th layer. Although there is no activation function applies on the input layer, it is sometimes referred as the $a^{[0]}$ layer, where as the i-th count starts at 1 from the first hidden layer to the output layer. Moreover, $a^{[i]}$ is a vector.

a^{[i]} = \begin{bmatrix} a^{[i]}_1 \\ \vdots \\ a^{[i]}_n \end{bmatrix}

The $j-th$ element on the $i-th$ layer is denoted as $a^{[i]}_{j}$ . For example, the first neuron in the first hidden layer is referred to as $a^{[1]}_1$ .

Let’s talk weights and bias. The weight matrix and the bias of the i-th layer are denoted as $W^{[i]}$ and $b^{[i]}$ , respectively. And the shape of these matrices are the following:

W^{[i]} = (n_i, n_{i-1})

and

b^{[i]} = (n_i, 1)

where $n_i$ is the number of neurons in the i-th layer.

For example, in the our network, $W^{[1]} = (4,3)$ , whereas $b^{[1]} = (4,1)$ since there are 4 neurons in the first hidden layer and there are 3 features in the input layer.

II. Computing a Neural Network Output

1. Single training example

Let’s do a forward pass on our Neural Network. The value of the first neuron in the first hidden layer is:

a^{[1]}_1 = \sigma(W^{[1]T}_1x + b^{[1]}_1)

This applies a sigmoid activation function to the linear equation of the inputs. Similarly, we can compute the remaining neurons:

a^{[1]}_1 = \sigma(W^{[1]T}_1x + b^{[1]}_1) \\ a^{[1]}_2 = \sigma(W^{[1]T}_2x + b^{[1]}_2) \\ a^{[1]}_3 = \sigma(W^{[1]T}_3x + b^{[1]}_3) \\ a^{[1]}_4 = \sigma(W^{[1]T}_4x + b^{[1]}_4) \\

We can vectorize these equations as the following:

\begin{bmatrix} W^{[1]T}_1 \\ W^{[1]T}_2 \\ W^{[1]T}_3 \\ W^{[1]T}_4 \\ \end{bmatrix} x + \begin{bmatrix} b^{[1]}_1 \\ b^{[1]}_2 \\ b^{[1]}_3 \\ b^{[1]}_4 \\ \end{bmatrix} = W^{[1]}x + b^{[1]}

where $x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}$

2. Multiple training examples

Let $X = \begin{bmatrix} x^{(1)} & x^{(2)} & \dots & x^{(n)} \end{bmatrix}$

Then, by using matrix multiplication we can calculate the hidden layer’s matrix:

A^{[1]} = W^{[1]}X + b^{[1]}

where

$X: (n_{\text{features}}, m)$
$W: (n_i, n_{i-1})$
$b: (n_i, m)$