Youtube: Gradient descent, how neural networks learn | Deep Learning Chapter 2

Key Takeaway

Learning = minimizing the cost function by nudging every weight and bias in the direction that reduces cost

That direction is given by the negative gradient

Backpropagation is how we compute that gradient efficiently

This chapter answers:
How does a neural network actually learn?

Introduce

Learning in neural networks = adjusting weights and biases to make predictions less wrong.

This chapter: how the network updates parameters.
Core idea: Gradient Descent.

Two Main Goals of This Video

Understand the intuition of gradient descent.
Connect that intuition to neural network parameters (all weights and biases).

What Should Learning Optimize?

We need one number that says “how bad the network currently is”.

This number is called the cost (or loss over dataset).
- If cost is high, predictions are bad.
- If cost is low, predictions are better.

Cost Function (Simple Form)

A cost function is a mathematical formula used to measure how wrong a model’s predictions are.

Neural Network function:

Input: 784 pixels
Output: 10 numbers (0~9)
Parameters: 13,002 (weights/bias)

Cost function is:

Input: 13,002 (weights/bias)
Output: 1 number (the cost)
Parameters: many, many, many training examples

For one training example:

L = j \sum (a_{j} - y_{j})^{2}

$a_{j}$ : output activation of neuron $j$
$y_{j}$ : target value of neuron $j$
Squaring has two main purposes:
- It prevents positive and negative errors from canceling each other out
- It penalizes larger errors more heavily

Cost Function = Average loss over all training examples
$C (w, b) = \frac{1}{N} \sum_{x \in training set} L (x; w, b)$

average: $\frac{1}{N}$
sum all training set: $\sum_{x \in training set} L (x)$
$C$ is the average error over the training set.
Learning goal: make $C$ as small as possible.

Find the min cost point of cost function

Cost Function Curve for a Single Weight

Explain derivative

Derivative = the instantaneous rate of change of a function.
Cost function as C(w) because different values of w produce different cost values;
- compute the derivative $\frac{d C}{d w}$ to see whether the cost is increasing or decreasing at the current w
- and use that slope information to adjust w step by step until C(w) gets closer to its minimum.
The figure labels: $\frac{d C}{d w} (w) = 0$
- This means: $\frac{d C}{d w}$ is the derivative of C(w) with respect to w
- the derivative tells us the slope of the curve
- at the minimum point, the curve is flat
so the slope is zero, $\frac{d C}{d w} = 0$ at the bottom of the curve.

Gradient Descent

Imagine the cost function as a landscape (hills and valleys). We want to move downhill.

Gradient descent (∇ = gradient symbol):

is the method the model uses to learn.
It works by repeatedly adjusting the parameters (weights and biases) a little bit so that the cost function becomes smaller.
At each step, it looks at the gradient, which tells how the cost changes with respect to each parameter, and then moves the parameters in the opposite direction of that gradient.
- Parameters (all weights + biases) are coordinates in a high-dimensional space.
- At current position, compute the slope direction that increases cost fastest.
- Move in the opposite direction to decrease cost fastest.

Update Rule

If parameter vector is $θ$ :
$θ \leftarrow θ - η \nabla C (θ)$

New parameters = old parameters - a small step opposite to the gradient.

Symbols mean:
- θ: parameter
- $C (θ)$ : cost function
- $\nabla C (θ)$ : gradient vector (all partial derivatives)
- $η$ : learning rate (step size)
the model adjusts its parameters a little bit each step in order to reduce the cost.

For each parameter separately:

w_{i} \leftarrow w_{i} - η \frac{\partial C}{\partial w _{i}}, b_{i} \leftarrow b_{i} - η \frac{\partial C}{\partial b _{i}}

new value = old value - learning rate × the partial derivative of the cost with respect to that parameter

Why Gradient?

Gradient points to steepest increase of the function.
Negative gradient points to steepest decrease.
So using $- \nabla C$ is the most efficient local direction to reduce cost.

Learning Rate (η)

Links: What is Gradient Descent - GeeksforGeeks

Learning rate is a important hyperparameter in gradient descent
that controls how big or small the steps should be when going downwards in gradient for updating models parameters.
It is essential to determines how quickly or slowly the algorithm converges toward minimum of cost function.
- If Learning Rate too small:
  - Learning is very slow.
- If Learning Rate too large:
  - May overshoot and bounce around, or even diverge.
Practical training needs a reasonable learning rate schedule.

High-Dimensional Perspective

A small network can already have thousands of parameters.
Real networks can have millions or billions.
We cannot visualize this space directly, but gradient descent still applies mathematically.

Why Random Initialization?

If all weights start the same, many neurons stay symmetric and learn the same thing.
Random initialization breaks symmetry.
Then different neurons can learn different useful features.

Example

Example 1: One-Parameter Gradient Descent

Suppose: $C (w) = (w - 3)^{2}$
Then: $\frac{d C}{d w} = 2 * (w - 3)$
Let initial $w_{0} = 0$ , learning rate $η = 0.1$ .
- Step 1:
  - Gradient at $w = 0$
    - $s l o p e = 2 * (0 - 3) = - 6$
  - Update: $w_{1} = w_{0} - η \cdot s l o p e = 0 - 0.1 (- 6) = 0.6$
- Step 2:
  - Gradient at $w = 0.6$
    - $s l o p e = 2 * (0.6 - 3) = - 4.8$
  - Update: $w_{2} = w_{1} - η \cdot s l o p e = 0.6 - 0.1 (- 4.8) = 1.08$
You can see $w$ gradually moves toward 3, where cost is minimal.
Update formula stays the same no matter whether the slope is negative or positive;
- only the direction of the update changes. $w_{n e w} = w_{o l d} - η \cdot s l o p e$
- if $s l o p e = \frac{\partial C}{\partial w _{i}} = 0$ , so w does not change, and it may mean we have already reached the minimum point.

Example 2: Network Parameter Update (Conceptual)

Suppose one weight has derivative $\frac{\partial C}{\partial w} = 0.25$ .
- If $η = 0.01$ :
- $w_{n e w} = w_{o l d} - 0.01 \times 0.25 = w_{o l d} - 0.0025$
If the derivative is positive, increasing $w$ increases the cost, so we move $w$ smaller.
If the derivative is negative, increasing $w$ decreases the cost, so we move $w$ larger.

Summary

Neural network learning is an optimization problem.
Define cost, compute gradients, update parameters repeatedly.
Gradient descent is the core engine behind this process.
Chapter 2 builds the optimization intuition; Chapter 3 explains backprop in detail.

Derivative Power rule

$\frac{d}{d x} (x^{n}) = n x^{n - 1}$ Examples:

$\frac{d}{d x} (x^{2}) = 2 x$
$\frac{d}{d x} (x^{3}) = 3 x^{2}$
$\frac{d}{d x} (x^{4}) = 4 x^{3}$

Notes

Explorer

DL2 Gradient descent

Introduce

What Should Learning Optimize?

Cost Function (Simple Form)

Neural Network function:

Cost function is:

Find the min cost point of cost function

Explain derivative

Gradient Descent

Update Rule

Why Gradient?

Learning Rate (η)

High-Dimensional Perspective

Why Random Initialization?

Example

Example 1: One-Parameter Gradient Descent

Example 2: Network Parameter Update (Conceptual)

Summary

Derivative Power rule

Table of Contents

Graph View

Table of Contents