Core Idea

Cost tells us how bad the current parameters are.
Gradient tells us how to change the parameters to make the cost smaller.

1. Single-Parameter Case

This is the simplest case, only for understanding the idea.

  • Suppose

    • Assume this cost function is smallest when
    • is the parameter
    • is the cost
    • the gradient is the derivative:
  • If , then

  • The derivative is , so at ,

  • This means:

    • current cost = 4
    • current gradient = -4
  • The update rule is

  • Since the current gradient is , the slope is negative at .

    • A negative gradient means that increasing will reduce the cost.
    • So gradient descent moves in the opposite direction of the gradient.
  • Using the update rule :

    • if , then
  • This means:

    • the parameter moves from to
    • it moves closer to , which is the value that minimizes the cost
    • after the update, the cost becomes smaller

2. Multi-Parameter Case

This is the real neural network case.

  • The cost depends on many parameters:
  • So we use partial derivatives instead of a single ordinary derivative.
    • Partial Derivative measures how a function changes with respect to one variable while keeping all other variables fixed.
  • The gradient vector is built one component at a time:
    • : change only , keep others fixed
    • : change only , keep others fixed
    • : change only , keep others fixed

Then we stack them together:

At the current parameter values, the cost function gives one cost value, which is a single number.
From this one cost function, we compute a gradient vector by measuring how that cost changes with respect to each parameter, one parameter at a time.

  • cost = a scalar: one number that measures how bad the current parameters are
    • The cost function produces one scalar value.
    • scalar: a single numerical value
  • gradient = a vector: a collection of partial derivatives showing how that same cost changes with respect to each parameter

3. Relationship Between Cost and Gradient

  • cost tells us how bad the current parameters are
  • gradient tells us how each parameter should change to reduce the cost

In short:

  • cost = current error
  • gradient = direction for improvement

4. Training Process

  1. Randomly initialize weights and biases
  2. Run a forward pass
  3. Compute the cost
  4. Compute the gradient for each parameter
  5. Update parameters using gradient descent

For all parameters together:

5. Important Distinction

Do not confuse parameter values with gradient values.

For example:

  • parameter values:
  • gradient values:

The parameters are the current values inside the model.
The gradient values tell us how sensitive the cost is to those parameters.

6. Final Summary

  • single parameter derivative

  • multiple parameters gradient vector

  • Cost measures current performance.

  • Gradient shows how to adjust the parameters to make the cost smaller.