Machine Learning Research
A Mistral kiadta a Ministral 3 modellcsaládot, amely kaszkád desztillációval készült
Mistral compressed Mistral Small 3.1 into much smaller versions, yielding a family of relatively small, open-weights, vision-language models that perform better by some measures than competing models of similar size. The method combines pruning and distillation. Mistral AI released weights for the Ministral 3 family in parameter counts of 14 billion, 8 billion, and 3 billion. Each size comes in base, instruction-tuned, and reasoning variants. The team detailed its recipe for distilling the models in a paper. Starting with a larger parent, they alternately pruned (removed less-important parameters) and distilled (trained a smaller model to mimic the larger model's outputs) it into progressively smaller children.
- The team pruned Mistral Small 3.1 (24B) to create Ministral 3 14B, which then served as the parent for the 8B and 3B versions.
- Pruning was achieved by removing layers that changed their input least and reducing the width of fully connected layers.
- Pruned models were trained to mimic Mistral Small 3.1 during pretraining, while fine-tuning stages benefited from mimicking the larger Mistral Medium 3.
- Instruction-tuning utilized ODPO, a technique where an LLM compares responses to steer the model toward preferred outputs.
- Reasoning variants were trained on step-by-step examples in math and coding using GRPO to further improve performance.
- Ministral 3 14B base outperformed its parent model on GPQA Diamond and benchmarks for math and multimodal understanding.
- Training required only 1 trillion to 3 trillion tokens, significantly less than the 15 trillion to 36 trillion tokens used for Qwen 3 or Llama 3.
Miért fontos?
Cascade distillation offers a way to produce a high-performance model family from a single parent at a fraction of the usual cost. Training runs were shorter and the algorithm is relatively simple, potentially enabling developers to build multiple model sizes without proportionately higher training costs.