Az amerikai kormány Frontier szuperszámítógépe ezermilliárd paraméteres AI modell tanítását teszteli
Researchers with Oak Ridge National Laboratory and the Universite Paris-Saclay have tried to train large-scale language models on the world’s most powerful publicly disclosed supercomputer, Oak Ridge’s ‘Frontier’ system. The results show that a) the US government has been able to do a non-trivial training run, and also b) the US government has a long way to go in getting its supercomputers to do things at the same scale as private companies.
Here, the researchers try to debug training large language models of 22B, 175B, and 1 Trillion parameters in size. The idea here is to understand what it takes to train LLMs efficiently at this scale and also to identify the particular difficulties of using the Frontier supercomputer which uses AMD (MI250X) GPUs rather than NVIDIA GPUs. After doing some hyperparameter tuning and analysis, they figured out some stable settings for training 22 billion and 175 billion parameter models. Once they did they, they “finally trained a trillion parameter model”, though only for a few steps.
- Required porting Megatron-DeepSpeed to Frontier's infrastructure
- Rewrote various NVIDIA-optimized CUDA operations into AMD HIP
- Reimplemented several operations to work on AMD ROCM software
- Customized Pytorch Distributed to work with SLURM HPC software
- Scaled training from 1024 GPUs (for 175B model) to 3072 GPUs (for 1T model)
- The 1T parameter training run used only 4% of the system's 75,264 GPUs
Miért fontos?
The best the US’s largest supercomputer can do is behind industry: In 2023, there were a bunch of public GPU training runs on the level of a few thousand GPUs. There were also some very large non-public training runs that occurred in 2022 and 2023 (e.g, GPT4 and Claude2) which are broadly believed to be significantly larger than that. Now, the very important question is: how ambitious is the US government willing to be here and will it be satisfied that its best supercomputer plays second fiddle to the private clusters found within the private sector?