ReLU

Rectified Linear unit is the most basic activation function. For each element in the input vector, it returns 0 if the value is negative, otherwise it returns the value itself.

$ReLU (x) = max (0, x)$ The forward message can be rewritten as $max (0, x) = {x x \geq 0 0 else = \frac{∣ x ∣ + x}{2} .$ The python code in numpy can therefore be either

def forward(self, X):
	self.X = X
	return X * (X > 0)

def forward(self, X):
	self.X = X
	return (np.abs(X) + X) / 2

However, first version will almost surely result in less operations, though it’s not as fancy. Notice that we need to store the input matrix X for the backward message. Or better, store only the boolean mask (X > 0).

For the backward message, activation function doesn’t have any parameters, but we still need a backward message for previous layers. For a single sample input, the derivative with respect to $x$ is given by $\frac{\partial m a x ( 0 , x )}{\partial x} = {1 x > 0 0 else .$ The full chain rule $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial ReLU ( x )} \cdot \frac{\partial ReLU ( x )}{\partial x}$ results in

def backward(self, dY):
	return dY * (self.X > 0)

In all the cases, the x > 0 is performed element-wise. So is the * operator. Beware that at 0, the gradient technically doesn’t exists. However, it is standard to set it to 0(as this implementation does).

Vojtěch Tóth

Explorer

Implementing activation functions

ReLU

Graph View