Every ML explanation gives you the same analogy: imagine you are standing on a hilly landscape in the dark. You want to get to the lowest point. You cannot see far, so you feel which direction slopes downward and take a small step that way. Repeat until you stop descending.
This is fine as far as it goes. The problem is that it leaves out everything that actually matters when gradient descent fails, slows down, or behaves unexpectedly — which is most of the time.
What the analogy skips
The "hilly landscape" is your loss surface — a function that maps every possible set of model weights to a loss value. For a model with millions of parameters, this surface exists in millions of dimensions. You cannot visualise it. The hill analogy is technically true but geometrically misleading.
What matters in practice is not the shape of the surface but how you navigate it. That is where learning rate and batch size come in.
The learning rate is the most important hyperparameter
The learning rate controls how large each step is. Too large and you overshoot minima, bouncing around or diverging entirely. Too small and training takes forever or gets stuck in a flat region.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
A learning rate of 0.01 is a reasonable default for SGD. But "reasonable" is doing a lot of work there. In practice you will need to try a range and watch the loss curve.
A learning rate schedule adjusts the rate during training — starting high to move quickly, then reducing to settle into a minimum:
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
This halves the learning rate every 10 epochs. You will see this pattern in almost every serious training run.
Batch size and the gradient estimate
In full-batch gradient descent, you compute the gradient using the entire training set. This is accurate but slow. Stochastic gradient descent (SGD) uses a single example per step — fast but noisy. Mini-batch gradient descent is the practical middle ground: compute the gradient over a small batch (typically 32–256 examples) and update.
The noise from mini-batches is not purely a problem. It acts as a form of regularisation, helping the optimizer escape shallow local minima that full-batch descent would get stuck in.
What "converged" actually means
Training does not end when the loss reaches zero. It ends when the loss stops meaningfully decreasing — or when it starts increasing on the validation set (overfitting). Watching both curves together is how you decide when to stop.
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
print(f"Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f}")
If val_loss rises while train_loss falls, you are overfitting. If both are high and flat, your learning rate is too low or your model is underpowered.
The thing to implement once
Implement gradient descent from scratch for a simple linear regression problem — no PyTorch, no scikit-learn. Just NumPy. Compute the gradient manually, update the weights, watch the loss fall.
It takes an afternoon. It will change how you read every training log for the rest of your ML work.