Stochastic gradient descent(SGD) is the fundamental algorithm for training an artificial neural network(ANN). Although there are many different types of artificial neural networks, a lot of them can be trained using a gradient based optimization. Let’s look at some pseudo-codes first.
Stochastic Gradient descend
This algorithm is one of many gradient based algorithms used for training neural networks. There are only two core steps in these algorithms:
- Compute gradient of loss with respect to networks parameters,
- Subtract a fraction of the gradient from current parameters. These steps are repeated for fixed number of steps, or until there is a change(which is usually described as “until converges”). This leads to a general framework you can find in the PyTorch implementation.
for epoch in 1..epochs:
batch = random sample from X and y
predictions = current prediction from the network
loss = error computed from prediction and true label
gradients = list of partial derivative for each parameter in network
for p in network.parameters:
p -= learning_rate * gradients[p]
Computing the gradient
A neural network can be written as a composition of functions:
where is the -th layer and are the parameters of the network. The parameters are tight to each layer. The network output is
and the loss function is
which can be seen as adding another “layer” with a fixed parameter .
Backpropagation
The main idea behind the backpropagation algorithm is to use chain rule, which is just applying the composite derivative rule over and over again.
First, we compute the forward message by passing through the network and caching results in each layer. To obtain partial derivative for parameter in layer , we need to compute
which is called the parameter message. To compute the parameter message, we need to compute the backward message
for which we need, by the same chain rule, backward messages for all the , , , layers. The loss derivative has no parameters and can be computed using only the prediction value . We set it as and use it to initiate the algorithm. For each layer , we only need to know how to compute the derivative given the numerical value of following layers derivative and the layers input .
Framework outline
The following Layer python code can be used for layers. The math is illustrative.
class Layer():
def __init__(self):
# initialize weights
# check layer parameters
self.params = [...]
self.grads = [...] # zeroes for each parameter
def forward(self, X):
"""Compute forward message and cache input X"""
self.X = X # caching the input
# compute layer for input X
return X + 1
def backward(self, delta):
"""Compute the backward and parameter message."""
# compute parameter message
for i in range(len(self.params)):
self.grads[i] = max(0, self.params[i] * self.X * delta)
# compute and return the backward message
dX = self.X * delta
return dX
def parameters(self):
"""Generator to return parameters"""
for idx in range(len(self.params)):
yield self.params[idx], self.grads[idx]
The network can use the same template with forward and backward, but (minimal) network would typically look like
class Network():
def __init__(self):
# define the layers
self.layer1 = ...
self.layer2 = ...
self.layers = [self.layer1, self.layer2]
def forward(self, X):
"""Forward pass through the whole network"""
for layer in self.layers:
X = layer.forward(X)
return X
def backward(self, delta):
"""Backward pass through the network"""
# propagate backward through layers in reverse
for layer in reversed(self.layers):
delta = layer.backward(delta)
return delta
def _parameters(self):
"""Yield all parameters from all layers"""
for layer in self.layers:
if layer.has_params():
yield from layer.parameters()
def parameters(self ):
"""Public method returns a fresh generator"""
return self._parameters()
while. Notice that torch doesn’t force you to implement backward for the network. The reason is that they are more clever and spend a decade to perfect it :). If you’re interested how its done, read the Gentle Introduction to??torch.autograd.
Optimizer
To finish this basic torch-like framework, we need the optimizer. The only role optimizer has is to take the computed gradients and perform an optimizer step. Outline snippet could look like this.
class Optimizer():
def __init__(self, params, learning_rate):
self.params = params # could be generator or list
self.lr = learning_rate
def step(self):
for param, grad in self.params(): # in case self.params is a generator
param -= self.lr * gradA word of warning - this generator syntax assumes that param is a mutable object. That is numpy array, torch tensor etc. Don’t store plain int or double, that won’t update anything.
Wrap up
With this framework outline, training loop created previously changes to
import numpy as np
from math import ceil
# some training data
X = np.array([...])
y = np.array([...])
n = X.shape[0]
rng = np.random.default_rng()
batch_size = ceil(n**0.5)
num_epochs = 1000
network = Network()
loss_fn = Loss()
optimizer = Optimizer(network.parameters, learning_rate=0.01)
for _ in range(num_epochs):
# sample random batch from training data
idx = rng.integers(0, n, size=batch_size)
Xb = X[idx]
yb = y[idx]
# get predictions
preds = network.forward(Xb)
# compute gradients
delta = loss_fn.backward(preds, yb)
network.backward(delta)
# update parameters
optimizer.step()
Next time, I will use this outline to implement the precursor of neural network - the perceptron.