MachineLearning

From Derivatives to Adam: Building Optimization Intuition with PyTorch

From first and second derivatives to Adam optimizer: a hands-on walkthrough of key optimization ideas using PyTorch. Simple, minimal examples with gradients, momentum, and adaptive learning strategies to build deep intuition.

Deepak Sadulla

16 May 2025 • 11 min read

Photo by Rohit Dey / Unsplash

Optimization is the backbone of machine learning, statistics, and applied mathematics. In this guide, we'll walk from first principles like the first and second derivative, through classical optimizers like Gradient Descent and Momentum, and build up to the powerful Adam optimizer. We use small, intuitive examples with PyTorch to illustrate each step.

In this article, we try to explain:

Convexity, first and second derivative tests and
How Newton-Raphson and Gradient Descent work
How Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and Adam evolve from earlier ideas
PyTorch implementations for each method
A brief mention of other optimization techniques outside the scope

Derivatives and Convexity

An optimization problem minimizes a function f : Rⁿ → R.

A function is convex if the line segment between any two arbitrary points on the graph lies above the graph.

A function is convex if:

f(λx + (1−λ)y ) ≤ λf(x) + (1−λ) f(y), ∀ x,y ∈ Rⁿ, λ ∈ [0,1]

A convex function is one where the graph curves upward, like a bowl. This is important because convex functions have a nice property: any local minimum is also a global minimum, which makes them easier to work with.

What’s a Gradient? (And Why It Matters)

Imagine you're hiking on a hill, and you want to know which direction is steepest to climb or descend. The gradient of a function is like a compass pointing you in the direction of the steepest increase. In mathematical terms, the gradient is a vector of partial derivatives: it tells you how the function changes in each direction.

For example, suppose you have a simple function:

f(x,y) = x²+y²

This represents a bowl-shaped surface. The gradient is:

This tells you how the function increases in the x and y directions.

What’s the Hessian?

While the gradient tells you the direction of steepest ascent, the Hessian tells you how the steepness itself changes: it captures the curvature of the surface. It’s a matrix of second-order partial derivatives.

For our function:

f(x,y) = x²+y²

The Hessian is:

This matrix tells us how the slope (the gradient) changes as we move across the surface. Since both diagonal values are positive, the function is curving upwards in all directions.

First Derivative Test

The first derivative of a function (also called a gradient) tells us the slope or the rate of change. For scalar functions:

Positive gradient: function is increasing
Negative gradient: function is decreasing

A local minimum denoted by x^∗ satisfies ∇f(x^∗) = 0 i.e. the derivative (equivalently slope or tangent) at a minima or maxima will be zero.

Second Derivative Test

The second derivative tells us about curvature:

Positive: convex (minimum possible)
Negative: concave (maximum possible)

If ∇f(x^∗) = 0 and ∇²f(x^∗) ≻ 0, then x^∗ is a local minimum.

What Does “Positive Semi-Definite” Mean?

A matrix is positive semi-definite if it never "curves down": meaning, the function either curves up or stays flat. In our example, the Hessian matrix has positive values on the diagonal and zero elsewhere, so it is positive definite (which is even stronger than semi-definite).

Because the Hessian above is positive semi-definite everywhere (in fact, it’s constant and always the same), we can say that: f(x,y)=x²+y² is a convex function.

Here’s a breakdown of the different types of definiteness, how many types there are, and what they mean.

Type	Symbolic Condition	Eigenvalues	Curvature Meaning	Function Shape
Positive Definite	$x^T A x > 0$ for all $x \neq 0$	All positive	Curves upward in every direction	Bowl (strictly convex)
Positive Semi-Definite	$x^T A x \geq 0$ for all $x$	All non-negative	Curves upward or flat, never downward	Flat bowl or valley
Negative Definite	$x^T A x < 0$ for all $x \neq 0$	All negative	Curves downward in every direction	Dome (strictly concave)
Negative Semi-Definite	$x^T A x \leq 0$ for all $x$	All non-positive	Curves downward or flat, never upward	Flat dome or ridge
Indefinite	$x^T A x$ can be positive or negative	Mixed signs	Curves up in some directions, down in others	Saddle shape

Summary:

Positive Definite: Think of a round bowl: every direction curves upward → convex.
Positive Semi-Definite: Like a bowl with some flat directions → still convex, but less strict.
Negative Definite: Upside-down bowl: curves downward → concave.
Negative Semi-Definite: Flat dome or ridge: never curves upward.
Indefinite: Saddle: some directions curve up, others down → not convex or concave.

Why This Matters for Optimization

For minimization problems, we want the function to be convex.
So, check whether the function's Hessian is positive semi-definite (or definite) in our region of interest.
If it’s indefinite, the function has saddle points; trickier to optimize.

Here’s how to compute gradient and Hessian in PyTorch:

import torch

x = torch.tensor([2.0], requires_grad=True)
def f(x):
    return (x - 3) ** 2

# First derivative
y = f(x)
y.backward() # ∇f(x)
print('Gradient:', x.grad.item())  # Should be 2*(x - 3)

# Second derivative (Hessian for 1D)
x = torch.tensor([2.0], requires_grad=True)
y = f(x)
g = torch.autograd.grad(y, x, create_graph=True)[0] # ∇²f(x)
H = torch.autograd.grad(g, x)[0]
print('Hessian:', H.item())  # Should be 2

Remember:

Gradient = direction of steepest change.
Hessian = how that steepest direction changes (the curvature).
If the Hessian is positive semi-definite everywhere, the function is convex — like a bowl that never dips down.

0:00

/0:14

A quick preview of how the different optimizers behave on toy optimization problems. We are looking at each step that optimizers take to reach the minimum.

The code provided in this article to understand these optimizers is available in this Kaggle notebook.

Newton-Raphson Method for Minimization

Originally used to find roots f(x) = 0, we adapt Newton-Raphson to find stationary points ∇f(x) = 0:

x_k+1= x_k− (∇ f(x_k) / [∇²f(x_k)] )

Explanation:

∇f(x_k): gradient (slope)
∇²f(x_k): Hessian (curvature)
This step updates x_k by moving in the direction of zero slope, adjusted by curvature

This is slightly different from the original Newton Raphson method since we are moving in the direction of the slope tending to zero (finding gradient's root) instead of the value of the function (finding function's root), notice how the fraction is gradient-over-hessian instead of the function-over-gradient.

The convergence to minimum is quite fast, even if the initialization is very far and requires very few steps to find the minimum.

PyTorch snippet (for 1D):

# The value of x also represents our initial guess or starting position
x = torch.tensor([9999.0], requires_grad=True) 

for _ in range(5):
    f = (x - 3.9999)**2
    grad = torch.autograd.grad(f, x, create_graph=True)[0]
    hess = torch.autograd.grad(grad, x, create_graph=True)[0]
    with torch.no_grad():
        x -= grad / hess
    print(x.item())


print("Newton result:", x.item()) # The answer should be very close to 3.9999

Limitations:

Requires second derivative (expensive in high-dimensions)
Not stable unless function is very smooth

Gradient Descent (GD)

Gradient Descent updates the variable in the direction opposite to the gradient of the loss function.

The update rule is:

x_k+1= x_k− η∇f(x_k)

Terms:

x_k: current value of the parameter
η: learning rate (step size), a small positive scalar
∇f(x_k): gradient (direction of steepest ascent) of the function at x_k

No second derivative is involved, we move in the direction of negative gradient to reach the minimum. Convergence to minima is dependent on initialization and learning rate. If initialization is far from minima it will require more steps (requires tuning learning rate) and might stop early.

PyTorch version:

import torch

x = torch.tensor([2.0], requires_grad=True) #initialization
lr = 0.1 # learning rate

for i in range(20):
    f = (x - 3)**2
    f.backward()
    with torch.no_grad():
        x -= lr * x.grad
    x = x.grad.zero_()
    print(f"Result at step {i+1} : {x.item()}")


print("GD result:", x.item())

Try increasing the value of x to a very large number in the above code snippet to notice that it takes more steps (20) to converge and needs lr to be tuned for convergence to be achieved.

Stochastic Gradient Descent (SGD)

SGD follows the same update rule as Gradient Descent, but instead of using the full dataset (or exact function), it uses a randomly selected subset (mini-batch). It introduces noise but often improves generalization.

Use PyTorch’s built-in SGD:

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.SGD([x], lr=0.1) # omitted momentum argument (next method)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")


print("SGD result:", x.item())

Momentum

Momentum adds a velocity term that accumulates gradients over time, allowing the optimizer to gain speed in directions with consistent gradient:

v_k+1= β v_k+ η∇ f(x_k)

x_k+1= x_k− v_k+1

Terms:

v_k: velocity (running average of gradients)
β: momentum coefficient (e.g., 0.9)
η: learning rate (step size), a small positive scalar

PyTorch implementation:

import torch

x = torch.tensor([5.0], requires_grad=True)

def f(x): return (x - 3)**2

v = torch.tensor([0.0])
lr = 0.1
beta = 0.9

for i in range(50):
    y = f(x)
    y.backward()
    with torch.no_grad():
        v = beta * v + lr * x.grad # calculating momentum
        x -= v
        _ = x.grad.zero_()
    print(f"Result at step {i+1} : {x.item()}")

print('Momentum Result:', x.item())

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.SGD([x], lr=0.1, momentum=0.9)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")

print("Momentum result:", x.item())

Nesterov Accelerated Gradient (NAG)

NAG improves upon momentum by computing the gradient after a lookahead step:

x_lookahead = x_k− βv_k

v_k+1= βv_k+ η∇ f(x_lookahead)

x_k+1= x_k− v_k+1

Terms:

x_lookahead : lookahead point
v_k+1: updated velocity

This anticipates the next step and can lead to faster convergence.

In PyTorch:

import torch

x = torch.tensor([0.0], requires_grad=True)
def f(x): return (x - 3)**2

v = torch.tensor([0.0])
lr = 0.1
beta = 0.9

for i in range(50):
    x_lookahead = x - beta * v
    x_lookahead.retain_grad()
    y = f(x_lookahead)
    y.backward()
    with torch.no_grad():
        v = beta * v + lr * x_lookahead.grad
        x -= v
        _ = x.grad.zero_()
    print(f"Result at step {i+1} : {x.item()}")


print('NAG Result:', x.item())

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.SGD([x], lr=0.1, momentum=0.9, nesterov=True)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")

print("NAG result:", x.item())

AdaGrad

AdaGrad adapts the learning rate for each parameter by scaling inversely proportional to the square root of past squared gradients:

G_k= G_k−1+ (∇ f(x_k))²

x_k+1= x_k− ( η / ( sqrt(G_k) + ϵ)) ∇f(x_k)

Terms:

G_k: sum of past squared gradients (per-dimension)
ϵ: small constant to prevent divide-by-zero

PyTorch:

import torch

x = torch.tensor([0.0], requires_grad=True)
def f(x): return (x - 3)**2

G = torch.tensor([0.0])
eps = 1e-8
lr = 0.5

for i in range(50):
    y = f(x)
    y.backward()
    with torch.no_grad():
        G += x.grad**2
        x -= lr / (G.sqrt() + eps) * x.grad
        _ = x.grad.zero_()
    print(f"Result at step {i+1} : {x.item()}")


print('AdaGrad Result:', x.item())

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.Adagrad([x], lr=0.5)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")


print("Adagrad result:", x.item())

RMSProp

RMSProp fixes AdaGrad's aggressively shrinking learning rate by using an exponentially weighted moving average of squared gradients:

G_k=β G_k−1+ (1−β)∇ f(x_k)²

x_k+1= x_k− ( η / ( sqrt(G_k) + ϵ)) ∇f(x_k)

Terms:

β: decay factor (e.g. 0.9)

The second equation i.e. the update rule is the same as AdaGrad.

PyTorch:

import torch

x = torch.tensor([0.0], requires_grad=True)
def f(x): return (x - 3)**2

G = torch.tensor([0.0])
gamma = 0.9
lr = 0.1
eps = 1e-8

for i in range(50):
    y = f(x)
    y.backward()
    with torch.no_grad():
        G = gamma * G + (1 - gamma) * x.grad**2
        x -= lr / (G.sqrt() + eps) * x.grad
        _ = x.grad.zero_()
    print(f"Result at step {i+1} : {x.item()}")


print('RMSProp Result:', x.item())

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.RMSprop([x], lr=0.1, alpha=0.9)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")


print("RMSProp result:", x.item())

Adam Optimizer

Adam combines Momentum + RMSProp with bias correction. Adam which stands for Adaptive Moment Estimation combines ideas from Momentum and RMSProp. It keeps track of both the mean and uncentered variance of the gradients:

m_k= β₁m_k−1+ (1−β1) ∇f(x_k) (1st moment)

v_k= β₂v_k-1 + (1−β²) ∇f(x_k)² (2nd moment)

m^'_k= m_k / (1−β₁^k)

v'_k= v_k/ (1−β₂^k)

x_k+1= x_k− ηm^'_k / (sqrt(v^'_k) + ϵ)

Terms:

m_k: exponential moving average of gradients (1st moment)
v_k: exponential moving average of squared gradients (2nd moment)
β₁,β₂: decay rates for moment estimates (default: 0.9, 0.999)
ϵ : small constant for numerical stability

PyTorch:

import torch

x = torch.tensor([0.0], requires_grad=True)
def f(x): return (x - 3)**2

m = torch.tensor([0.0])
v = torch.tensor([0.0])
beta1, beta2 = 0.9, 0.999
lr = 0.1
eps = 1e-8

for t in range(1, 51):
    y = f(x)
    y.backward()
    with torch.no_grad():
        g = x.grad
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x -= lr * m_hat / (v_hat.sqrt() + eps)
        _ = x.grad.zero_()
    print(f"Result at step {t} : {x.item()}")


print('Adam Result:', x.item())

import torch

x = torch.nn.Parameter(torch.tensor([2.0]))
opt = torch.optim.Adam([x], lr=0.1)

for i in range(20):
    loss = (x - 3)**2
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(f"Result at step {i+1} : {x.item()}")

print("Adam result:", x.item())

What We Didn’t Cover:

Second-Order and Constrained Optimization

There exist powerful methods in libraries and tools like SciPy, COIN-OR, IPOPT, cvxpy, Gurobi, etc., that go beyond first-order optimizers:

Interior-Point Methods
Trust-Region Methods
Sequential Least Squares Programming (SLSQP)
Newton-CG, BFGS, L-BFGS

These are useful when:

You need exact Hessians (second derivatives)
You have complex equality/inequality constraints
The problem is small enough for expensive matrix operations

But these optimizers are not amenable to deep learning since

These aren’t easily scalable to high dimensions or GPU workloads.
PyTorch is mostly used for first-order optimizers in deep learning.
These are best used via SciPy's scipy.optimize.minimize() and other minimization packages and tools like cvxpy, IPOPT etc., although GPU support and related speedup is limited.

Hence these methods are especially useful for small-to-medium-sized constrained optimization problems, or when second-order information (like Hessians) is cheap to compute.

We explored optimization from first principles through a progression of ideas, each adding something new:

Gradient Descent: basic local slope
Momentum/Nesterov: smoothing with inertia
AdaGrad/RMSProp: adaptive step sizes
Adam: combines momentum and RMSProp with bias correction

Understanding each helps demystify why Adam became the go-to optimizer for deep learning.