MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
Existing general purpose frameworks for gigantic model training, i.e., models with billions to trillions of parameters, cannot scale efficiently on public cloud environments due to large communication overheads. In this paper, we propose MiCS, which Minimizes the Communication Scale to bring down communication overhead. Specifically, by decreasing the number of participants in a communication collective, MiCS can utilize existing heterogeneous network bandwidth on the cloud, reduce network traffic over slower links, and amortize expensive global gradient synchronization overheads. Our evaluation on AWS shows that the system throughput of MiCS is up to 2.89× that of the state-of-the-art large model training systems. MiCS achieves near-linear scaling efficiency, which is up to 1.27× that of DeepSpeed. MiCS allows us to train a proprietary model with 100 billion parameters on 512 GPUs with 99.4 theoretical computation power of each GPU on a public cloud with less GPU memory and more restricted networks than DGX-A100 clusters.
READ FULL TEXT