🤵🏻‍♂️

The Adams Family

LinkTree

The Adams Family: Overview of Gradient Optimizers

Gradient-based iterative optimization has been the cornerstone of modern machine learning and has been studied for years. The dominant optimizer in this field is currently the $Adam$ optimizer (Kingma and Ba 2014), which is used for any task, model, and dataset. Since its inception, many improvements have been made to the original Adam optimizer, resulting in optimizers that are computationally faster, consume less memory, and/or converge faster. This post will briefly discuss some of the well-known improvements and variations of Adam.

Optimization and Deep Learning are inseparable. In fact, Deep Learning would not be possible without the giants who made significant contributions to optimization research. Gradient Descent and co. have been the standard optimization algorithms used to train models of all sizes, shapes and dynamics. It is the simplicity of the idea of gradient descent as well as its impressive results that have contributed to its success.

While simple gradient descent is good enough for small models and datasets, its slow convergence speed and lack of robustness to local minima leave much to demand. In comes Adam, with fast convergence and a highly robust update mechanism, while having a simple algorithm making it easily optimised for heavy-duty distributed training on large scales with big data.

Adam has been the industry standard optimizer for all deep-learning tasks for almost a decade. While it might seem that not much progress has been made on the side of optimizers in industrial usage, a vast variety of optimizer variants have been suggested that improve over the original Adam optimizer in academia.

Notation

Symbol	Meaning
$w \in W$	weights in the Weight matrix
$g_{w} \in G_{w}$	gradients of the parameters in the Gradient matrix
$\eta$	Learning rate or step size of the update

Basics

Stochastic Gradient Descent

Stochastic gradient descent (SGD), which is also known as Mini-batch Gradient Descent, is an optimization algorithm that is commonly used in machine learning. It is an iterative method that minimizes an objective function by updating the parameters of the model in the direction of the negative gradient of the objective function with respect to the parameters. The objective function is typically a sum of many individual loss functions, where each loss function corresponds to a single training example.

Fig: Intuition behind Stochastic Gradient Descent (source: rasbt.github.io)

The standard form of SGD updates the model parameters using the following formulation:

def SGD (w, grad, lr):
	w = w - lr * grad
	return w

Code: Python code for SGD, taking in weights of the model, gradients of the weights and the learning rate and providing the updated weights for the model

Momentum

Momentum is a modification of SGD that helps to accelerate the convergence of the optimization algorithm. It accomplishes this by adding a fraction of the previous update to the current update:

v = beta * v + (1 - beta) * gradient
w = w - alpha * v

where v is the momentum vector, and beta is the momentum parameter. By adding the momentum vector to the current update, the optimization algorithm is able to move more quickly through regions of the parameter space with a consistent gradient.

RMSProp

RMSProp is another modification of SGD that helps to accelerate the convergence of the optimization algorithm. It accomplishes this by normalizing the gradient by a moving average of its magnitude:

s = beta * s + (1 - beta) * gradient^2
w = w - alpha * gradient / sqrt(s + epsilon)

where s is the moving average of the squared gradient, beta is the decay rate, and epsilon is a small constant to prevent division by zero. By normalizing the gradient in this way, RMSProp is able to take larger steps in directions with a consistent gradient and smaller steps in directions with a noisy or inconsistent gradient.

The Adams

The Adams are a family of optimization algorithms that build on the basic ideas of SGD, momentum, and RMSProp. The most well-known members of the Adams family are Adam, AdaMax, AdamW, AMSGrad, RAdam, NAdam, and AdamP.

Adam

Adam is an optimization algorithm that combines the ideas of momentum and RMSProp. It accomplishes this by computing a moving average of the gradient and the squared gradient, and using these moving averages to update the model parameters:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * m / (sqrt(v) + epsilon)

where m is the moving average of the gradient, v is the moving average of the squared gradient, beta1 and beta2 are the exponential decay rates for the moving averages, and epsilon is a small constant to prevent division by zero.

AdaMax

AdaMax is a modification of Adam that replaces the L2 norm of the gradient with the L-infinity norm:

m = beta1 * m + (1 - beta1) * gradient
u = max(beta2 * u, abs(gradient))
w = w - alpha * m / (u + epsilon)

where u is the moving average of the L-infinity norm of the gradient.

AdamW

AdamW is a modification of Adam that adds weight decay to the update rule:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + weight_decay * w)

where weight_decay is the weight decay parameter.

AMSGrad

AMSGrad is a modification of Adam that prevents the moving average of the squared gradient from decreasing over time:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
v_hat = max(v_hat, v)
w = w - alpha * m / (sqrt(v_hat) + epsilon)

where v_hat is the maximum of the moving averages of the squared gradient.

RAdam

RAdam is a modification of Adam that uses a dynamic rectified update rule:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
rho_inf = 2 / (1 - beta2) - 1
rho_t = rho_inf - 2 * t * beta2^t / (1 - beta2^t)
r_t = sqrt((rho_t - 4) * (rho_t - 2) * rho_inf / ((rho_inf - 4) * (rho_inf - 2) * rho_t))
w = w - alpha * r_t * m / (sqrt(v) + epsilon)

where rho_inf is an upper bound on the second moment of the gradient, rho_t is a corrected first moment estimate, and r_t is a correction term that depends on the variance of the gradient.

NAdam

NAdam is a modification of Adam that uses a combination of the Nesterov momentum and Adam update rules:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = (1 - beta1) * gradient + beta1 * m
v_hat = max(v_hat, v)
w = w - alpha * m_hat / (sqrt(v_hat) + epsilon)

where m_hat is an estimate of the gradient at the next time step.

AdamP

AdamP is a modification of Adam that adds penalty terms to the update rule:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + wd_ratio * w / (sqrt(p) + epsilon))
p = beta3 * p + (1 - beta3) * gradient^2

where wd_ratio is the weight decay ratio, and p is the moving average of the squared gradient with penalty terms.

Adan

Adan is a modification of Adam that adds adaptive weight decay to the update rule:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + wd_ratio * w / (sqrt(v) + epsilon))

where wd_ratio is the adaptive weight decay ratio.

QHAdam

QHAdam is a variant of the Adam optimization algorithm that was proposed in 2020 by Ma, Y., Yarats, D., & Campbell, R., in their paper "Quasi-hyperbolic momentum and Adam for deep learning". QHAdam enhances Adam's performance by using a quasi-hyperbolic momentum term, which enables it to adapt to different learning rates and perform well on a variety of optimization problems. QHAdam has been shown to outperform other optimization algorithms such as Adam, AMSGrad, and Nadam on a range of deep learning tasks.

Citation

This post can be cited as:

Minhas, Bhavnick. (Mar 2023). The Adams Family: Overview of Gradient Optimizers. B’Log. https://blog.bhavnicksm.com/the-adam-family .

Or use the following BibTeX:

@article{minhas2023adamsfam,
	title   = "The Adams Family: Overview of Gradient Optimizers",
	author  = "Minhas, Bhavnick",
	journal = "B'Log",
	year    = "2023",
	month   = "Mar",
	url     = "https://blog.bhavnicksm.com/the-adam-family"
}

References

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

On this page

The Adams Family: Overview of Gradient Optimizers
Notation
Basics
Stochastic Gradient Descent
Momentum
RMSProp
The Adams
Adam
AdaMax
AdamW
AMSGrad
RAdam
NAdam
AdamP
Adan
QHAdam
Citation
References