AI TRAINING
A Google kutatói skálázási törvényeket állítottak fel a hatékony elosztott AI tanításhoz
Google researchers have studied the 'scaling laws' for a type of distributed training pioneered by Google DeepMind called DiLoCo. Their results are surprising - they show that when well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. In other words, distributed training techniques - where you train one AI system across multiple data centers - can match or exceed the performance and efficiency of training systems within single datacenters. This has significant implications for AI policy, though will need to be proved out at larger scales for those to come to pass. The most important idea this research suggests is that it may be possible to train an AI system across multiple distinct data centers and obtain the same quality of system as one you might train in a single large-scale facility.
- Focuses on predictions for evaluation loss and optimal hyperparameter choices for specific model sizes.
- Testing on models with 4 billion and 10 billion parameters proved accurate scaling predictions.
- DiLoCo reduced total communication between nodes by a factor of over 100.
- Simulated training for larger models like Llama3 405B and DeepSeek-V3 671B showed promising computational efficiency.
- DiLoCo with M = 1 attained lower evaluation loss and higher zero-shot accuracy than standard Data-Parallel training.
Miért fontos?
Distributed training breaks some of the core assumptions of AI policy. If a 70B model can be trained across 10 distinct datacenters, tools like monitoring large compute clusters or export controls on centralized facilities might be invalidated.