AI INFRASTRUCTURE
A HuggingFace elindítja a Boom projektet a nagyléptékű elosztott modelltanításhoz
HuggingFace has started the 'Boom' project, whose goal is to 'train a decoder-only Transformer language model at the 70-100 billion parameter scale for +20T tokens'. They estimate the compute requirement will be ~5 million H100-hours, equivalent to month-long allocations of 512 H100s from ~10 different datacenters. HuggingFace is apparently validating the project now, in discussion with 12 data center operators, and has already confirmed compute from ~6 of them and will start a pilot in March/April. If HuggingFace succeeds, AI policy could end up looking quite different.
- Goal is to train a model at the 70-100 billion parameter scale.
- Training set involves over 20 trillion tokens.
- Requires approximately 5 million H100-hours of compute.
- Collaborating with 12 different data center operators for distributed resources.
- Pilot program is expected to begin in March or April of 2025.
Miért fontos?
This project serves as a real-world test for distributed training scaling laws. Success would prove that state-of-the-art models no longer require single-location massive supercomputers, decentralizing the power of AI development.