Mean square error is defined as

where both and is a 1D array of values. The loss is always positive, the larger the difference from truth, the bigger the penalty is and it’s easily differentiable.

We are interested in the derivative given the prediction . The derivative for the -th label is

Therefore, the gradient is

Some people like to omit the . While optimizing, the factor doesn’t really matter - meaning the optimum stays the same.

Implementation

Let’s jump directly into the code

import numpy as np
 
class MSELoss():
	def forward(self, y, y_pred):
		assert len(y.shape) == 1 and len(y_pred.shape) == 1, "Not a 1D array."
		assert y.shape == y_pred.shape, "Dimension mismatch"
		return np.mean(np.power(y - y_pred, 2))
	
	def backward(self, y, y_pred):
		assert len(y.shape) == 1 and len(y_pred.shape) == 1, "Not a 1D array."
		assert y.shape == y_pred.shape, "Dimension mismatch"
		n = y.shape[0]
		return (2.0 / n) * (y_pred - y)