The Adams Family: Overview of Gradient Optimizers
Gradient-based iterative optimization has been the cornerstone of modern machine learning and has been studied for years. The dominant optimizer in this field is currently the optimizer (Kingma and Ba 2014), which is used for any task, model, and dataset. Since its inception, many improvements have been made to the original Adam optimizer, resulting in optimizers that are computationally faster, consume less memory, and/or converge faster. This post will briefly discuss some of the well-known improvements and variations of Adam.
Optimization and Deep Learning are inseparable. In fact, Deep Learning would not be possible without the giants who made significant contributions to optimization research. Gradient Descent and co. have been the standard optimization algorithms used to train models of all sizes, shapes and dynamics. It is the simplicity of the idea of gradient descent as well as its impressive results that have contributed to its success.
While simple gradient descent is good enough for small models and datasets, its slow convergence speed and lack of robustness to local minima leave much to demand. In comes Adam, with fast convergence and a highly robust update mechanism, while having a simple algorithm making it easily optimised for heavy-duty distributed training on large scales with big data.
Adam has been the industry standard optimizer for all deep-learning tasks for almost a decade. While it might seem that not much progress has been made on the side of optimizers in industrial usage, a vast variety of optimizer variants have been suggested that improve over the original Adam optimizer in academia.
Notation
Symbol | Meaning |
weights in the Weight matrix | |
gradients of the parameters in the Gradient matrix | |
Learning rate or step size of the update |
Basics
Stochastic Gradient Descent
Stochastic gradient descent (SGD), which is also known as Mini-batch Gradient Descent, is an optimization algorithm that is commonly used in machine learning. It is an iterative method that minimizes an objective function by updating the parameters of the model in the direction of the negative gradient of the objective function with respect to the parameters. The objective function is typically a sum of many individual loss functions, where each loss function corresponds to a single training example.
The standard form of SGD updates the model parameters using the following formulation:
def SGD (w, grad, lr):
w = w - lr * grad
return w
Momentum
Momentum is a modification of SGD that helps to accelerate the convergence of the optimization algorithm. It accomplishes this by adding a fraction of the previous update to the current update:
v = beta * v + (1 - beta) * gradient
w = w - alpha * v
where v
is the momentum vector, and beta
is the momentum parameter. By adding the momentum vector to the current update, the optimization algorithm is able to move more quickly through regions of the parameter space with a consistent gradient.
RMSProp
RMSProp is another modification of SGD that helps to accelerate the convergence of the optimization algorithm. It accomplishes this by normalizing the gradient by a moving average of its magnitude:
s = beta * s + (1 - beta) * gradient^2
w = w - alpha * gradient / sqrt(s + epsilon)
where s
is the moving average of the squared gradient, beta
is the decay rate, and epsilon
is a small constant to prevent division by zero. By normalizing the gradient in this way, RMSProp is able to take larger steps in directions with a consistent gradient and smaller steps in directions with a noisy or inconsistent gradient.
The Adams
The Adams are a family of optimization algorithms that build on the basic ideas of SGD, momentum, and RMSProp. The most well-known members of the Adams family are Adam, AdaMax, AdamW, AMSGrad, RAdam, NAdam, and AdamP.
Adam
Adam is an optimization algorithm that combines the ideas of momentum and RMSProp. It accomplishes this by computing a moving average of the gradient and the squared gradient, and using these moving averages to update the model parameters:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * m / (sqrt(v) + epsilon)
where m
is the moving average of the gradient, v
is the moving average of the squared gradient, beta1
and beta2
are the exponential decay rates for the moving averages, and epsilon
is a small constant to prevent division by zero.
AdaMax
AdaMax is a modification of Adam that replaces the L2 norm of the gradient with the L-infinity norm:
m = beta1 * m + (1 - beta1) * gradient
u = max(beta2 * u, abs(gradient))
w = w - alpha * m / (u + epsilon)
where u
is the moving average of the L-infinity norm of the gradient.
AdamW
AdamW is a modification of Adam that adds weight decay to the update rule:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + weight_decay * w)
where weight_decay
is the weight decay parameter.
AMSGrad
AMSGrad is a modification of Adam that prevents the moving average of the squared gradient from decreasing over time:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
v_hat = max(v_hat, v)
w = w - alpha * m / (sqrt(v_hat) + epsilon)
where v_hat
is the maximum of the moving averages of the squared gradient.
RAdam
RAdam is a modification of Adam that uses a dynamic rectified update rule:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
rho_inf = 2 / (1 - beta2) - 1
rho_t = rho_inf - 2 * t * beta2^t / (1 - beta2^t)
r_t = sqrt((rho_t - 4) * (rho_t - 2) * rho_inf / ((rho_inf - 4) * (rho_inf - 2) * rho_t))
w = w - alpha * r_t * m / (sqrt(v) + epsilon)
where rho_inf
is an upper bound on the second moment of the gradient, rho_t
is a corrected first moment estimate, and r_t
is a correction term that depends on the variance of the gradient.
NAdam
NAdam is a modification of Adam that uses a combination of the Nesterov momentum and Adam update rules:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = (1 - beta1) * gradient + beta1 * m
v_hat = max(v_hat, v)
w = w - alpha * m_hat / (sqrt(v_hat) + epsilon)
where m_hat
is an estimate of the gradient at the next time step.
AdamP
AdamP is a modification of Adam that adds penalty terms to the update rule:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + wd_ratio * w / (sqrt(p) + epsilon))
p = beta3 * p + (1 - beta3) * gradient^2
where wd_ratio
is the weight decay ratio, and p
is the moving average of the squared gradient with penalty terms.
Adan
Adan is a modification of Adam that adds adaptive weight decay to the update rule:
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
w = w - alpha * (m / (sqrt(v) + epsilon) + wd_ratio * w / (sqrt(v) + epsilon))
where wd_ratio
is the adaptive weight decay ratio.
QHAdam
QHAdam is a variant of the Adam optimization algorithm that was proposed in 2020 by Ma, Y., Yarats, D., & Campbell, R., in their paper "Quasi-hyperbolic momentum and Adam for deep learning". QHAdam enhances Adam's performance by using a quasi-hyperbolic momentum term, which enables it to adapt to different learning rates and perform well on a variety of optimization problems. QHAdam has been shown to outperform other optimization algorithms such as Adam, AMSGrad, and Nadam on a range of deep learning tasks.
Citation
This post can be cited as:
Minhas, Bhavnick. (Mar 2023). The Adams Family: Overview of Gradient Optimizers. B’Log. https://blog.bhavnicksm.com/the-adam-family .
Or use the following BibTeX:
@article{minhas2023adamsfam,
title = "The Adams Family: Overview of Gradient Optimizers",
author = "Minhas, Bhavnick",
journal = "B'Log",
year = "2023",
month = "Mar",
url = "https://blog.bhavnicksm.com/the-adam-family"
}
References
- Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.