Understanding Backpropagation from Scratch

6 min readNov 19, 2024

In the world of machine learning, neural networks have revolutionized how we approach complex problems like image recognition, natural language processing, and more. At the heart of training neural networks lies an algorithm called backpropagation. This algorithm adjusts the weights of the network to minimize the error between the predicted output and the actual output.

This article aims to demystify backpropagation by breaking it down into simple terms, mathematical explanations, and practical examples. Whether you’re a beginner or someone looking to refresh your understanding, this guide is for you.

The Basics of Neural Networks

Before diving into backpropagation, let’s briefly revisit how neural networks work.

A neural network consists of layers of interconnected nodes or neurons:

Input Layer: Receives the input data.
Hidden Layers: Perform computations and feature extraction.
Output Layer: Produces the final output.

Each connection between neurons has an associated weight that determines the strength and direction of the signal.

Activation Functions introduce non-linearity into the network, allowing it to learn complex patterns. Common activation functions include the sigmoid, tanh, and ReLU functions.

The Need for Backpropagation

Training a neural network involves finding the optimal set of weights that minimizes the difference between the predicted output and the actual output. This difference is measured using a loss function (also known as a cost function), such as Mean Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.

But how do we adjust the weights to minimize the loss?

Enter backpropagation: an algorithm that efficiently computes the gradient of the loss function with respect to each weight in the network. By knowing these gradients, we can adjust the weights in the direction that most reduces the loss using an optimization algorithm like Gradient Descent.

Mathematical Foundations

To understand backpropagation, we need to delve into some math. Don’t worry; we’ll keep it as straightforward as possible.

1. The Chain Rule

Backpropagation relies heavily on the chain rule from calculus, which allows us to compute the derivative of a composite function.

If we have functions f(x) and g(x), then the derivative of the composite function h(x) = f(g(x)) is:

2. Loss Function

Let’s denote:

Predicted Output: y^
Actual Output: y
Loss Function: L(y,y^)

For simplicity, we’ll use the Mean Squared Error (MSE) loss function:

The factor 1/2 simplifies the derivative.

Let’s Dive Into How Backpropagation Works

Let’s consider a simple neural network with one input, one hidden neuron, and one output neuron.

Step 1: Forward Pass

Input: x
Weights: w1 (input to hidden), w2 (hidden to output)
Biases: b1 (hidden layer), b2 (output layer)
Activation Function: Sigmoid

Hidden Layer Computation:

Output Layer Computation:

Step 2: Compute Loss

Step 3: Backward Pass (Compute Gradients)

Our goal is to compute the gradients of the loss with respect to each weight and bias:

Gradient with respect to w2:

Using the chain rule:

Compute each part:

So:

Similarly, compute gradients for w1, b1, and b2.

Okay, Let’s Walk Through This With An Example

Let’s plug in some numbers to see backpropagation in action. Less gooo!

Given:

Input: x = 0.5
Weights: w1 = 0.4, w2 = 0.7
Biases: b1 = 0.1, b2 = −0.3
Actual Output: y = 0.6

Forward Pass:

Compute z1 and a1:

2. Compute z2 and y^:

3. Compute Loss:

Backward Pass:

Update Weights and Biases

Result: The weights and biases have been updated to reduce the loss.

Implementing Backpropagation from Scratch

Let’s translate our understanding into code using Python. We’ll implement the same example.

import numpy as np

# Sigmoid activation function and its derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    return a * (1 - a)

# Given parameters
x = 0.5
y = 0.6
w1 = 0.4
w2 = 0.7
b1 = 0.1
b2 = -0.3
learning_rate = 0.1

# Forward Pass
z1 = w1 * x + b1
a1 = sigmoid(z1)

z2 = w2 * a1 + b2
a2 = sigmoid(z2)
hat_y = a2

# Compute Loss
loss = 0.5 * (y - hat_y) ** 2

# Backward Pass
dL_dhat_y = -(y - hat_y)
dhat_y_dz2 = sigmoid_derivative(hat_y)
dL_dz2 = dL_dhat_y * dhat_y_dz2

dz2_dw2 = a1
dL_dw2 = dL_dz2 * dz2_dw2

dL_db2 = dL_dz2

dL_da1 = dL_dz2 * w2
da1_dz1 = sigmoid_derivative(a1)
dL_dz1 = dL_da1 * da1_dz1

dz1_dw1 = x
dL_dw1 = dL_dz1 * dz1_dw1

dL_db1 = dL_dz1

# Update Weights and Biases
w2 -= learning_rate * dL_dw2
b2 -= learning_rate * dL_db2
w1 -= learning_rate * dL_dw1
b1 -= learning_rate * dL_db1

# Output updated parameters
print(f"Updated w1: {w1}")
print(f"Updated b1: {b1}")
print(f"Updated w2: {w2}")
print(f"Updated b2: {b2}")

Updated w1: 0.40015893033226196
Updated b1: 0.10031786066452393
Updated w2: 0.7010670395458765
Updated b2: -0.2981424781163503

This code snippet mirrors our manual calculations and updates the weights and biases accordingly.

Conclusion

Backpropagation is a fundamental algorithm for training neural networks. By understanding its mathematical underpinnings and working through examples, we gain deeper insights into how neural networks learn.

Key takeaways:

Chain Rule: Essential for computing gradients in composite functions.
Gradients: Indicate the direction to adjust weights to minimize loss.
Learning Rate: Controls the step size during weight updates.

By implementing backpropagation from scratch, we appreciate the complexities hidden behind high-level frameworks and become better equipped to troubleshoot and optimize neural networks in practice.

Check Out This Awesome Video on Backpropagation!

Thank you for reading! If you found this article helpful, feel free to share it with others.