PostTrainBench: LLMs fine-tune other LLMs, nearing human performance

2026. január 5. · MI Történik? · 1 perc olvasás

Researchers with the University of Tübingen have built and released PostTrainBench, a test to see how well frontier language models from companies like Anthropic, OpenAI, and Google, can effectively fine-tune open weight models. The results show that frontier models are already able to eke out 20%+ improvements on specific benchmarks through fine-tuning, compared to 60%+ for a human. LLMs are given an input consisting of benchmark tasks to improve performance on, a model to use, some standard resources (one H200 GPU for 10 hours), and an agent harness (e.g, Claude gets Claude Code, and GPT gets Codex). Agents are also given a prompt, a testing script, task context, and web search access. The agents then produce a fine-tuned model as well as training logs. This is a general approach, so you could select whatever benchmark seemed high signal to you. Here, the researchers use AIME 2025, BFCL, GPQA, GSM8K, and HumanEval as their targets. Tested models include Qwen 3 1.7B and 3B, SmolLM-3B, and Gemma 3 4B. OpenAI’s GPT 5.1 Codex Max does the best overall, scoring an aggregated 30%+ improvement across all tested models and benchmarks, followed by Opus 4.5 (20%+) and Gemini 3 Pro (~18%).

Miért fontos?

Benchmarks like this give us a sense of how well AI systems can perform many of the tasks that an AI researcher does. It also measures how well they can do an inherently complicated, multi-step, long-time-horizon task. These properties make PostTrainBench a useful benchmark for examining to get a sense of how well AI systems are doing at components of AI research itself - and here the evidence is that today’s frontier models are already within striking distance of a human. I’d expect we’ll see a system come along and beat the human baseline here by September 2026.

Eredeti forrás megtekintése (angol) →