ChipBench: AI Chip Design Harder Than Current Benchmarks Suggest
- Many Verilog benchmarks contain simple functional modules ranging from 10 to 76 lines. In real-world deployments, Verilog modules exceed 10,000 lines.
- Insufficient focus on debugging: Bugs cost a lot in physical hardware, so it may be better to concentrate on using LLMs for debugging chip designs.
- Verilog focus detracts from reference model evaluation: “In industrial workflows, reference model generation is even more resource-intensive than Verilog design, reflected in a 1:1 - 5:1 ratio of verification engineers (write reference model) to design engineers (write Verilog)”.
- Verilog writing: Based on 44 modules from real world hardware. “Our dataset features 3.8x longer code length and 13.9x more cells than VerilogEval.” These tests have three categories: self-contained module tests, hierarchical modules that are non-self-contained, and CPU IP modules sourced directly from open-source CPU projects.
- Verilog debugging: 89 test cases covering four error types: timing, arithmetic, assignment, and state machine bugs. These tests were built by manually injecting faults into known-good Verilog modules. Provides two types of debugging tests: zero-shot and one-shot. “The zero-shot test provides the model with the module description and buggy implementation, indicating that an error exists without providing localization details. The one-shot test provides identical information but supplements it with simulation waveform data (.vcd files)”.
- Reference model generation: 132 samples, enabling evaluation of reference model generation across Python, SystemC, and CXXRTL.
- Verilog generation:CPU IP: Highest is 22.22% (Claude 4.5 Opus, Gemini 3 Flash, GPT 5.2)
- Non-Self-Contained: Highest is 50% (DeepSeek-Coder)
- Self-contained: Highest is 36.67% (Claude 4.5 Opus, Gemini 3 Flash)
- Python reference model generation:CPU IP: 11.1% (Claude 4.5 Sonnet, Gemini 3 Flash)
- Non-Self-Contained: 0% (pass@1).
- Self-Contained: 40% (Claude-4.5 Haiku, Opus, Gemini 2.5 Pro, GPT-5)
- Verilog debugging:Generally better performance, but still no model cracks 50% pass@1 when averaged across tasks.
Though some AI systems have been used to build chips, they’ve been typically highly specialized, or stuck inside incredibly good scaffolds for eliciting good chip design behavior and stopping them from causing problems. What the researchers show here is that out-of-the-box LLMs are still pretty shitty at doing general purpose, real world chip design: “Current models have significant limitations in AI-aided chip design and remain far from ready for real industrial workflow integration.” At the same time, I can’t escape the feeling that there’s a scaffold for “being good at Verilog” which a contemporary AI system might be able to build if asked to and which would radically improve performance of systems on this benchmark.