A Mistral kiadta a Ministral 3 modellcsaládot, amely kaszkád desztillációval készült

Machine Learning Research

A Mistral kiadta a Ministral 3 modellcsaládot, amely kaszkád desztillációval készült

2026. február 6. · MI Történik? · 1 perc olvasás

Mistral compressed Mistral Small 3.1 into much smaller versions, yielding a family of relatively small, open-weights, vision-language models that perform better by some measures than competing models of similar size. The method combines pruning and distillation. Mistral AI released weights for the Ministral 3 family in parameter counts of 14 billion, 8 billion, and 3 billion. Each size comes in base, instruction-tuned, and reasoning variants. The team detailed its recipe for distilling the models in a paper. Starting with a larger parent, they alternately pruned (removed less-important parameters) and distilled (trained a smaller model to mimic the larger model's outputs) it into progressively smaller children.

The team pruned Mistral Small 3.1 (24B) to create Ministral 3 14B, which then served as the parent for the 8B and 3B versions.
Pruning was achieved by removing layers that changed their input least and reducing the width of fully connected layers.
Pruned models were trained to mimic Mistral Small 3.1 during pretraining, while fine-tuning stages benefited from mimicking the larger Mistral Medium 3.
Instruction-tuning utilized ODPO, a technique where an LLM compares responses to steer the model toward preferred outputs.
Reasoning variants were trained on step-by-step examples in math and coding using GRPO to further improve performance.
Ministral 3 14B base outperformed its parent model on GPQA Diamond and benchmarks for math and multimodal understanding.
Training required only 1 trillion to 3 trillion tokens, significantly less than the 15 trillion to 36 trillion tokens used for Qwen 3 or Llama 3.

Miért fontos?

Cascade distillation offers a way to produce a high-performance model family from a single parent at a fraction of the usual cost. Training runs were shorter and the algorithm is relatively simple, potentially enabling developers to build multiple model sizes without proportionately higher training costs.

Eredeti forrás megtekintése (angol) →