I. How a Neural Network learn

So basically, we can divide the learning process of a Neural Network into two steps:

Forward propagation: the Neural Network passes the inputs into the layers and does calculations to get the outputs, then feeds these outputs as the inputs of the next layers until the final layer, where it gets the predictions.
Backward propagation: the Neural Network efficiently calculates the gradient with respect to the weights of a layer using the gradient with respect to the weights of the next layer (the chain rule). Therefore, algorithms such as Gradient Descent could be applied to minimize the loss by updating the weights using the calculated gradients.

The hard work and the appeal of Neural Networks is Backward Propagation. Imagine if a model has millions of parameters (million of weights, and biases), then writing out millions of gradient expressions would take more time than training the model itself!

II. Pytorch tracks our computations

Pytorch is our savior here! Pytorch’s tensor has the ability to track down what computations it has been passed through by using DAGs (we will mention this in detail in another post). But to enable this feature as well as auto gradient computation for a Tensor, the Tensor must have the grad argument enabled. For example, we have a Tensor a:

In [15]: a = torch.tensor([[1,2],[3,4]], requires_grad=True)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[15], line 1
----> 1 a = torch.tensor([[1,2],[3,4]], requires_grad=True)

RuntimeError: Only Tensors of floating point and complex dtype can require gradients

We have got an error here, which tells us that only floating point and complex Tensors can require gradients. Because the gradient’s computations using real numbers (or even complex).

In [16]: a = torch.tensor([[1.,2., 3.,4.]], requires_grad=True)

In [17]: a
Out[17]:
tensor([[1., 2.],
        [3., 4.]], grad_fn=<ReshapeAliasBackward0>)

Here, we created a 2x2 Tensor a. But moreover, we initiated the Tensor with an extra argument requires_grad=True! This argument tells Pytorch to keep track of computation tree of Tensor a. Let’s do some computations with it to check whether Pytorch does that (Notice the grad_fn part of the Tensor):

In [18]: b = 2 * a

In [19]: c = b + 5

In [20]: d = c ** 2

In [21]: d
Out[21]:
tensor([ 49.,  81., 121., 169.], grad_fn=<PowBackward0>)

Let’s summarize what we have done here. We multiply Tensor a by 2, and assigned the result to b. After that, add 5 to b (assigned as c), then exponent c by a value of 2.

By using the grad_fn attribute, we can track back what operation we have done to get the current Tensor:

In [22]: d.grad_fn
Out[22]: <PowBackward0 at 0x7f0432ec9e20>

In [23]: d.grad_fn.next_functions
Out[23]: ((<AddBackward0 at 0x7f043225a430>, 0),)

d.grad_fn shows us PowBackward0, because we exponent the Tensor by 2. We can do better and view the previous operation by using the next_functions attribute, which returns a tuple, which contains the previous operations (wrapped in another tuples with their indices), and allows us to traverse back to these operations. In the following example, we are able to track back to the addition and the multiplication, which is the same to b and c’s grad_fn:

In [24]: d.grad_fn.next_functions[0][0]
Out[24]: <AddBackward0 at 0x7f043225a430>

In [25]: d.grad_fn.next_functions[0][0].next_functions[0][0]
Out[25]: <MulBackward0 at 0x7f04320e1370>

In [26]: c.grad_fn
Out[26]: <AddBackward0 at 0x7f043225a430>

In [27]: b.grad_fn
Out[27]: <MulBackward0 at 0x7f04320e1370>

III. Pytorch’s Autograd

1. Derive the Gradients

Being able to “remember” the computation history, Pytorch gives us a handy method to calculate the gradients of the component Tensors based on these histories.

But first, let’s compute a scalar value based on d since Pytorch only allow us to take gradients on scalars. We also check what grad_fn of out is.

In [30]: out = d.sum()

In [44]: out.grad_fn
Out[44]: <SumBackward0 at 0x7f0432f240d0>

Then, call the backward() method on out to calculate the gradients based on the computation graph.

In [31]: out.backward()

Just to remind, we did the following operations:

In [16]: a = torch.tensor([1.,2.,3.,4.], requires_grad=True)
In [18]: b = 2 * a
In [19]: c = b + 5
In [20]: d = c ** 2
In [30]: out = d.sum()

The gradient of a is

\begin{aligned} \frac{\partial\text{out}}{\partial a} &= \frac{\partial b}{\partial a} \frac{\partial \text{out}}{\partial b} \\ &= \frac{\partial b}{\partial a} \frac{\partial c}{\partial b} \frac{\partial \text{out}}{\partial c} \\ &= \frac{\partial b}{\partial a} \frac{\partial c}{\partial b} \frac{\partial d}{\partial c} \frac{\partial \text{out}}{\partial d} \\ &= 2.1. \left( \begin{bmatrix} 2c_{1} \\ 2c_{2}\\ 2c_{3} \\ 2c_{4}\\ \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 1\\ 1 \\ 1\\ \end{bmatrix} \right) \\ &= 4. \begin{bmatrix} c_1 \\ c_2\\ c_3\\ c_4\\ \end{bmatrix} \\ &= 4. \begin{bmatrix} 2.1+5 \\ 2.2+5\\ 2.3 + 5 \\ 2.4 + 5\\ \end{bmatrix} \\ &= \begin{bmatrix} 28 \\ 36\\ 44 \\ 52\\ \end{bmatrix} \\ \end{aligned}

Which is essentially the same to Pytorch’s answer:

In [23]: a.grad
Out[23]: tensor([28., 36., 44., 52.])

Pytorch calculates the gradients using the chain rule, in a similar manner to how we manually compute gradients.

2. Gradients Accumulation

a) Tensors

In order to lower memory footprint when calculating gradients, Pytorch accumulate the gradients to the buffer until it is full, and then update the parameter.

For example:

In [7]: out.backward(retain_graph=True)

In [8]: a.grad
Out[8]: tensor([28., 36., 44., 52.])

The result is as expected. Let’s calculate the gradients again!

In [9]: out.backward(retain_graph=True)

You may expect the result to be the same with the previous command. But it does not! In fact, the result is doubled, as the newly calculated gradients accumulates with the previous gradient calculation.

In [10]: a.grad
Out[10]: tensor([ 56.,  72.,  88., 104.])

When calling out.backward(retain_graph=True), PyTorch retains the computation graph used to calculate the gradients, allowing subsequent calls to out.backward() to accumulate the gradients for the tensor a. Without using retain_graph=True, the computational graph is “freed” after the first call to .backward(), meaning the temporary values used to calculate the gradients are cleared, and it is not possible to call .backward() again on the same tensor.

Note: “Freed” means that the intermediate values used to calculate the gradients are discarded, not the gradients themselves.

Since we are not always use retain_graph, this is not an issue anyways.

b) Pytorch Optimizers

Optimizers in PyTorch accumulate gradients differently from normal Tensors. Instead of retaining the computation graph with retain_graph=True, optimizers keep track of the gradient values directly and accumulate them across multiple backward passes to save memory. This allows optimizers to update model parameters based on the accumulated gradients from multiple iterations.

Therefore, we must set the gradients of the optimizers to zero in order to avoid gradients accumulation.

For example, we create a simple Neural Network and train it. One with optimizer’s gradients set to zero, and another without the operation.

import torch
from torch import nn
from torch.optim import SGD

Declaring the dataset:

t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_c = torch.column_stack([t_c, torch.ones_like(t_c)])
t_u = torch.tensor(t_u)

Then, we define a Neural Network which has one hidden layer.

class MyNetwork(nn.Module):
  def __init__(self):
    super().__init__()
    self.layer1 = nn.Sequential(nn.Linear(2,2), nn.ReLU())
    self.output = nn.Linear(2,1)
  
  def forward(self, x):
    x = self.layer1(x)
    return self.output(x)

After that, we define the first model (with optimizer’s gradients set to zero), with an SGD optimizer and MSE loss.

model1 = MyNetwork()
optimizer = SGD(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

Then, we train the model 5000 times:

for _ in range(5000):
  optimizer.zero_grad()
  output = model1(t_c).squeeze() # MSELoss outputs a 1D Tensor
  loss = criterion(output, t_u)
  loss.backward()
  optimizer.step()

  if _ % 1000 == 0:
    print(loss.item())

The output of the snippet is:

27978515625
725187301635742
245370864868164
080016136169434
614316940307617

Let’s define another model, model2 and do the exact same step, except optimizer.zero_grad().

model2 = MyNetwork()
optimizer = SGD(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

for _ in range(5000):
  # optimizer.zero_grad()
  output = model2(t_c).squeeze()
  loss = criterion(output, t_u)
  loss.backward()
  optimizer.step()

  if _ % 1000 == 0:
    print(loss.item())

The result of the code is:

3048.69921875
nan
nan
nan
nan

The result of the second model includes 4 nan, which means the losses are to big to interpret; whereas in the first model, the loss decreases to 7.

Therefore, setting optimizer’s gradients to zero plays a very important role in the training iterations. It keeps the gradients from being accumulated and thus increase the accuracy of the model’s prediction.