ZeRO: Memory Optimization for Distributed AI

Introduction to ZeRO

ZeRO (Zero Redundancy Optimizer) is a memory optimization strategy that eliminates redundancy across GPUs, allowing for the training of larger models on the same hardware. In this article, we will explore how ZeRO works and its benefits.

Understanding ZeRO

ZeRO has three levels: ZeRO-1, ZeRO-2, and ZeRO-3. ZeRO-1 partitions only optimizer states, while ZeRO-2 partitions both optimizer states and gradients. ZeRO-3 partitions optimizer states, gradients, and model parameters.

Memory Problem in DDP

In Distributed Data Parallelism (DDP), every GPU holds a complete copy of the model parameters, gradients, and optimizer states. This redundancy becomes a significant waste of precious VRAM for large models.

For example, a 7B-parameter model using Adam and FP32 would require 112 GB of memory per GPU in DDP. ZeRO helps reduce this memory usage by partitioning the optimizer states, gradients, and model parameters across GPUs.

How ZeRO Works

Let’s take a look at how ZeRO-1 and ZeRO-2 work. In ZeRO-1, each GPU holds the full model parameters and gradients but only stores 1/N of the optimizer states. In ZeRO-2, both optimizer states and gradients are partitioned.

ZeRO-2 uses reduce-scatter to give each GPU only the gradients it needs, saving both memory and communication bandwidth. This results in significant memory savings, making it possible to train larger models on the same hardware.

Benefits of ZeRO

ZeRO offers several benefits, including reduced memory usage, improved scalability, and increased efficiency. By eliminating redundancy across GPUs, ZeRO enables the training of larger models, making it an essential tool for distributed AI.