Cost vs Gradient in Neural Networks

Core Idea

Cost tells us how bad the current parameters are.
Gradient tells us how to change the parameters to make the cost smaller.

1. Single-Parameter Case

This is the simplest case, only for understanding the idea.

Suppose $C (w) = (w - 3)^{2}$
- Assume this cost function is smallest when $w = 3$
- $w$ is the parameter
- $C (w)$ is the cost
- the gradient is the derivative: $\frac{d C}{d w}$
If $w = 1$ , then $C (1) = (1 - 3)^{2} = 4$
The derivative is $\frac{d C}{d w} = 2 (w - 3)$ , so at $w = 1$ , $\frac{d C}{d w} = - 4$
This means:
- current cost = 4
- current gradient = -4
The update rule is $w_{n e w} = w_{o l d} - η \frac{d C}{d w}$
Since the current gradient is $- 4$ , the slope is negative at $w = 1$ .
- A negative gradient means that increasing $w$ will reduce the cost.
- So gradient descent moves $w$ in the opposite direction of the gradient.
Using the update rule $w_{n e w} = w_{o l d} - η \frac{d C}{d w}$ :
- if $η = 0.1$ , then $w_{n e w} = 1 - 0.1 (- 4) = 1.4$
This means:
- the parameter moves from $1$ to $1.4$
- it moves closer to $3$ , which is the value that minimizes the cost
- after the update, the cost becomes smaller

2. Multi-Parameter Case

This is the real neural network case.

The cost depends on many parameters: $C = C (w_{1}, w_{2}, w_{3}, b_{1}, b_{2}, \dots)$
So we use partial derivatives instead of a single ordinary derivative.
- Partial Derivative measures how a function changes with respect to one variable while keeping all other variables fixed.
The gradient vector is built one component at a time:
- $\frac{\partial C}{\partial w _{1}}$ : change only $w_{1}$ , keep others fixed
- $\frac{\partial C}{\partial w _{2}}$ : change only $w_{2}$ , keep others fixed
- $\frac{\partial C}{\partial b _{1}}$ : change only $b_{1}$ , keep others fixed

Then we stack them together:
$\nabla C = \frac{\partial C}{\partial w _{1}} \frac{\partial C}{\partial w _{2}} \frac{\partial C}{\partial w _{3}} \frac{\partial C}{\partial b _{1}} \frac{\partial C}{\partial b _{2}} ⋮$

At the current parameter values, the cost function gives one cost value, which is a single number.
From this one cost function, we compute a gradient vector by measuring how that cost changes with respect to each parameter, one parameter at a time.

cost = a scalar: one number that measures how bad the current parameters are
- The cost function produces one scalar value.
- scalar: a single numerical value
gradient = a vector: a collection of partial derivatives showing how that same cost changes with respect to each parameter

3. Relationship Between Cost and Gradient

cost tells us how bad the current parameters are
gradient tells us how each parameter should change to reduce the cost

In short:

cost = current error
gradient = direction for improvement

4. Training Process

Randomly initialize weights and biases
Run a forward pass
Compute the cost
Compute the gradient for each parameter
Update parameters using gradient descent

For all parameters together: $θ_{n e w} = θ_{o l d} - η \nabla C$

5. Important Distinction

Do not confuse parameter values with gradient values.

For example:

parameter values: $w_{1} = 0.3, w_{2} = - 0.8, b = 0.1$
gradient values: $\frac{\partial C}{\partial w _{1}} = 0.17, \frac{\partial C}{\partial w _{2}} = 0.80, \frac{\partial C}{\partial b} = - 0.04$

The parameters are the current values inside the model.
The gradient values tell us how sensitive the cost is to those parameters.

6. Final Summary

single parameter → derivative
multiple parameters → gradient vector
Cost measures current performance.
Gradient shows how to adjust the parameters to make the cost smaller.

Notes

Explorer