Training a neural network

Stochastic gradient descent(SGD) is the fundamental algorithm for training an artificial neural network(ANN). Although there are many different types of artificial neural networks, a lot of them can be trained using a gradient based optimization. Let’s look at some pseudo-codes first.

Stochastic Gradient descend

Input: θ_{0}, learning rate η for t = 0, 1, 2, \dots do: g_{t} = \nabla f (θ_{t}) θ_{t + 1} = θ_{t} - η g_{t}

This algorithm is one of many gradient based algorithms used for training neural networks. There are only two core steps in these algorithms:

Compute gradient of loss with respect to networks parameters,
Subtract a fraction of the gradient from current parameters. These steps are repeated for fixed number of steps, or until there is a change(which is usually described as “until converges”). This leads to a general framework you can find in the PyTorch implementation.

for epoch in 1..epochs:

	batch = random sample from X and y
	
	predictions = current prediction from the network
	loss = error computed from prediction and true label
	
	gradients = list of partial derivative for each parameter in network
	
	for p in network.parameters:
		p -= learning_rate * gradients[p]

Computing the gradient

A neural network can be written as a composition of functions:

f_{N} \circ f_{N - 1} \circ \dots \circ f_{1} (X) = f_{N} (\dots (f_{2} (f_{1} (X, θ_{1}), θ_{2}) \dots), θ_{N}),

where $f_{i}$ is the $i$ -th layer and $θ_{1}, θ_{2}, \dots, θ_{M}$ are the parameters of the network. The parameters are tight to each layer. The network output is

y_{pred} = f_{N} \circ \dots \circ f_{1} (X)

and the loss function is

L (y_{pred}, y) = L (f_{N} \circ \dots \circ f_{1} (X), y)

which can be seen as adding another “layer” with a fixed parameter $y$ .

Backpropagation

The main idea behind the backpropagation algorithm is to use chain rule, which is just applying the composite derivative rule $(f \circ g)^{'} (x) = f^{'} (g (x)) \cdot g^{'} (x)$ over and over again.

First, we compute the forward message by passing through the network and caching results in each layer. To obtain partial derivative for parameter $θ_{k}$ in layer $k$ , we need to compute

\frac{\partial L}{\partial θ _{k}} = \frac{\partial L}{\partial f _{N}} \cdot \frac{\partial f _{N}}{\partial f _{N - 1}} \dots \frac{\partial f _{k + 2}}{\partial f _{k + 1}} \cdot \frac{\partial f _{k + 1}}{\partial θ _{k}} = \frac{\partial L}{\partial f _{k + 1}} \cdot \frac{\partial f _{k + 1}}{\partial θ _{k}}

which is called the parameter message. To compute the parameter message, we need to compute the backward message

\frac{\partial L}{\partial f _{k + 1}}

for which we need, by the same chain rule, backward messages for all the $k + 2$ , $k + 3$ , $\dots$ , $N$ layers. The loss derivative has no parameters and can be computed using only the prediction value $y_{pred}$ . We set it as $δ_{L}$ and use it to initiate the algorithm. For each layer $k$ , we only need to know how to compute the derivative given the numerical value of following layers derivative $δ_{k + 1}$ and the layers input $X$ .

Framework outline

The following Layer python code can be used for layers. The math is illustrative.

class Layer():
	def __init__(self):
		# initialize weights
		# check layer parameters
		self.params = [...]
		self.grads = [...] # zeroes for each parameter
		
	def forward(self, X):
		"""Compute forward message and cache input X"""
		self.X = X # caching the input
 
		# compute layer for input X
		return X + 1
		
	def backward(self, delta):
		"""Compute the backward and parameter message."""
		
		# compute parameter message
		for i in range(len(self.params)):
			self.grads[i] = max(0, self.params[i] * self.X * delta)
		
		# compute and return the backward message
		dX = self.X * delta
		return dX
	
	def parameters(self):
		"""Generator to return parameters"""
	    for idx in range(len(self.params)):
	        yield self.params[idx], self.grads[idx]

The network can use the same template with forward and backward, but (minimal) network would typically look like

class Network():
    def __init__(self):
        # define the layers
        self.layer1 = ...
        self.layer2 = ...
        self.layers = [self.layer1, self.layer2]
 
    def forward(self, X):
        """Forward pass through the whole network"""
        for layer in self.layers:
            X = layer.forward(X)
        return X
 
    def backward(self, delta):
        """Backward pass through the network"""
        # propagate backward through layers in reverse
        for layer in reversed(self.layers):
            delta = layer.backward(delta)
        return delta
 
    def _parameters(self):
		"""Yield all parameters from all layers"""
		for layer in self.layers:
			if layer.has_params():
				yield from layer.parameters()
 
	def parameters(self ):
		"""Public method returns a fresh generator"""
		return self._parameters()

while. Notice that torch doesn’t force you to implement backward for the network. The reason is that they are more clever and spend a decade to perfect it :). If you’re interested how its done, read the Gentle Introduction to??torch.autograd.

Optimizer

To finish this basic torch-like framework, we need the optimizer. The only role optimizer has is to take the computed gradients and perform an optimizer step. Outline snippet could look like this.

class Optimizer():
	def __init__(self, params, learning_rate):
		self.params = params # could be generator or list
		self.lr = learning_rate
 
	def step(self):
		for param, grad in self.params(): # in case self.params is a generator
			param -= self.lr * grad

A word of warning - this generator syntax assumes that param is a mutable object. That is numpy array, torch tensor etc. Don’t store plain int or double, that won’t update anything.

Wrap up

With this framework outline, training loop created previously changes to

import numpy as np
from math import ceil
 
# some training data
X = np.array([...])
y = np.array([...])
 
n = X.shape[0]
rng = np.random.default_rng()
batch_size = ceil(n**0.5)
num_epochs = 1000
 
network = Network()
loss_fn = Loss()
optimizer = Optimizer(network.parameters, learning_rate=0.01)
 
for _ in range(num_epochs):
    # sample random batch from training data
    idx = rng.integers(0, n, size=batch_size)
    Xb = X[idx]
    yb = y[idx]
 
    # get predictions
    preds = network.forward(Xb)
 
    # compute gradients
    delta = loss_fn.backward(preds, yb)
    network.backward(delta)
 
    # update parameters
    optimizer.step()

Next time, I will use this outline to implement the precursor of neural network - the perceptron.

Vojtěch Tóth

Explorer