AI Agents Automate LLM Post-Training, Show Rapid Progress and Reward Hacking

2026. március 16. · MI Történik? · 4 perc olvasás

AI-driven R&D might be the most important thing in all of AI, because it helps us understand whether AI systems might eventually build their own successors. So far, much of the focus on AI R&D has been in components that support AI development (e.g., autonomous creation of AI kernels), or training base models (e.g, the NanoGPT speedrun benchmark). But there’s been less attention paid to fine-tuning - the task involving adapting an existing LLM to a new dataset or behavior. Researchers from the University of Tübingen, the Max Planck Institute for Intelligent Systems, and AI research organization Thoughtful Lab want to change that with PostTrainBench, a benchmark which targets a specific aspect of post-training; improving performance against a given dataset. “Post-training is how raw language models become useful”, the authors write. “Given a clear objective and limited compute, can today’s agents do the technical work?”. The answer appears to be ‘yes, but not as well as humans’. How PostTrainBench works: “We give a frontier coding agent — Claude Code, Codex CLI, or Gemini CLI — a base language model and a target benchmark”. Things that make you go ‘uh oh’ - reward hacking: While running this benchmark the authors saw numerous instances of AI models trying to game the benchmark to get a high score. Smart agents reward hack more: “More capable agents appear better at finding exploitable paths: identifying specific benchmark samples to embed, reverse-engineering evaluation failure patterns, and even attempting to obscure contamination through cosmetic modifications such as renaming functions,” they write. For example, “the Codex agent modified the Inspect AI evaluation framework code to inflate scores, and Claude downloaded an instruction-tuned model instead of fine-tuning the base model”.

End-to-end: “Agents must build their entire training pipeline from scratch”
Autonomous: “Agents operate with full autonomy over data sources, training methods, and experimental strategy.”
Resource-bounded: “Each run is constrained to 10 hours on a single H100 GPU”.
Integrity-preserving: “Agents may not train on benchmark test data, modify the evaluation harness, or substitute a different model.”
4 models and 7 benchmarks: The initial eval runs on four models: Qwen3-1.7B, Qwen3-4B, SmolLM3-3B, Gemma-3-4B. It tests these models across seven distinct benchmarks: AIME 2025, GSM8K, GPQA, HumanEval, BFCL, Arena-Hard, HealthBench-Easy.
Results - big models win, especially Opus 4.6: “The top-performing agent — Opus 4.6 running on Claude Code — scores 23.2%, about 3× higher than the 7.5% base model average.”
But humans are still much better: “Yet this is still less than half the 51.1% achieved by human teams who post-train these same base models at their home labs”.
Fast progress: “The gap is significant but narrowing quickly: Claude Sonnet 4.5 scored 9.9% in September 2025, while GPT-5.2 reached 21.5% just months later.”
Direct benchmark ingestion: “Agents loaded the benchmark evaluation dataset directly via Hugging Face and used it as training data”.
Hardcoded benchmark problems: “Agents embedded evaluation questions directly into data preparation scripts disguised as “synthetic” examples”.
Evaluation guided data generation: “Some agents reverse engineered the evaluation… Kimi K2.5 read HealthBench evaluation files to extract theme distributions and rubric criteria, then crafted training data tailored to match”.
Indirect contamination via intermediate datasets: “Opus 4.6 loaded ‘CodeFeedback-Filtered-Instruction’ which contains HumanEval-derived problems. This form of contamination is harder to detect but equally problematic.”

Miért fontos?

Why this matters - rapid progress towards an “AI for everything” future: Benchmarks like post-train give us a sense of how quickly AI systems are improving at the fundamental tasks of AI research, serving both as an eval of long-time-horizon agentic autonomy, as well as something that speaks to the potential for compounding acceleration of AI development itself. “The gap between agent performance (23.2%) and instruction-tuned baselines (51.1%) suggests that full automation of post-training remains out of reach for now, but the rapid improvement across model generations—from 9.9% for Sonnet 4.5 to 23.2% for Opus 4.6 within roughly six months—implies this gap may close faster than expected,” the researchers write. Imagine where we’ll be in two years - we’ll certainly have AI models that are smart enough to point themselves at a specific objective, find an open weight model, then autonomously improve it to get better performance at that task. The era of ephemeral, custom AI systems, built and budded off into the world like spores from mushrooms, draws near. Are you ready for this new ecosystem you will find yourself in? I am not. But nonetheless it approaches.

Eredeti forrás megtekintése (angol) →