Understanding gradient descent — not just what it is

Every ML explanation gives you the same analogy: imagine you are standing on a hilly landscape in the dark. You want to get to the lowest point. You cannot see far, so you feel which direction slopes downward and take a small step that way. Repeat until you stop descending.

This is fine as far as it goes. The problem is that it leaves out everything that actually matters when gradient descent fails, slows down, or behaves unexpectedly — which is most of the time.

What the analogy skips

The "hilly landscape" is your loss surface — a function that maps every possible set of model weights to a loss value. For a model with millions of parameters, this surface exists in millions of dimensions. You cannot visualise it. The hill analogy is technically true but geometrically misleading.

What matters in practice is not the shape of the surface but how you navigate it. That is where learning rate and batch size come in.

The learning rate is the most important hyperparameter

The learning rate controls how large each step is. Too large and you overshoot minima, bouncing around or diverging entirely. Too small and training takes forever or gets stuck in a flat region.

python

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

A learning rate of 0.01 is a reasonable default for SGD. But "reasonable" is doing a lot of work there. In practice you will need to try a range and watch the loss curve.

A learning rate schedule adjusts the rate during training — starting high to move quickly, then reducing to settle into a minimum:

python

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

This halves the learning rate every 10 epochs. You will see this pattern in almost every serious training run.

Batch size and the gradient estimate

In full-batch gradient descent, you compute the gradient using the entire training set. This is accurate but slow. Stochastic gradient descent (SGD) uses a single example per step — fast but noisy. Mini-batch gradient descent is the practical middle ground: compute the gradient over a small batch (typically 32–256 examples) and update.

The noise from mini-batches is not purely a problem. It acts as a form of regularisation, helping the optimizer escape shallow local minima that full-batch descent would get stuck in.

What "converged" actually means

Training does not end when the loss reaches zero. It ends when the loss stops meaningfully decreasing — or when it starts increasing on the validation set (overfitting). Watching both curves together is how you decide when to stop.

python

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    print(f"Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f}")

If val_loss rises while train_loss falls, you are overfitting. If both are high and flat, your learning rate is too low or your model is underpowered.

The thing to implement once

Implement gradient descent from scratch for a simple linear regression problem — no PyTorch, no scikit-learn. Just NumPy. Compute the gradient manually, update the weights, watch the loss fall.

It takes an afternoon. It will change how you read every training log for the rest of your ML work.

This is fine as far as it goes. The problem is that it leaves out everything that actually matters when gradient descent fails, slows down, or behaves unexpectedly — which is most of the time.

What the analogy skips

What matters in practice is not the shape of the surface but how you navigate it. That is where learning rate and batch size come in.

The learning rate is the most important hyperparameter

The learning rate controls how large each step is. Too large and you overshoot minima, bouncing around or diverging entirely. Too small and training takes forever or gets stuck in a flat region.

python

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

A learning rate of 0.01 is a reasonable default for SGD. But "reasonable" is doing a lot of work there. In practice you will need to try a range and watch the loss curve.

A learning rate schedule adjusts the rate during training — starting high to move quickly, then reducing to settle into a minimum:

python

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

This halves the learning rate every 10 epochs. You will see this pattern in almost every serious training run.

Batch size and the gradient estimate

The noise from mini-batches is not purely a problem. It acts as a form of regularisation, helping the optimizer escape shallow local minima that full-batch descent would get stuck in.

What "converged" actually means

python

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    print(f"Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f}")

If val_loss rises while train_loss falls, you are overfitting. If both are high and flat, your learning rate is too low or your model is underpowered.

The thing to implement once

Implement gradient descent from scratch for a simple linear regression problem — no PyTorch, no scikit-learn. Just NumPy. Compute the gradient manually, update the weights, watch the loss fall.

It takes an afternoon. It will change how you read every training log for the rest of your ML work.

The difference between reading about gradient descent and actually understanding it

What the analogy skips

The learning rate is the most important hyperparameter

Batch size and the gradient estimate

What "converged" actually means

The thing to implement once

The difference between reading about gradient descent and actually understanding it

What the analogy skips

The learning rate is the most important hyperparameter

Batch size and the gradient estimate

What "converged" actually means

The thing to implement once