ChipMATE

Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench at generation time, depend on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases. We present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice — where correctness emerges from cross-comparison between independently written RTL modules and reference models — ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. With a backtrack-based inference workflow and a two-stage training pipeline, ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models — outperforming all existing self-trained models and even DeepSeek V4 (1600B parameters).

ChipMATE overview: a Verilog agent and a Python reference-model agent cross-verify each other through a backtrack-based inference workflow, trained in two stages with a hybrid data generation framework.

Overview of ChipMATE. (a) A multi-agent cross-verification workflow with a backtrack mechanism. (b) The two-stage training pipeline, complemented by a hybrid reference-model data generation framework.

Introduction

LLMs excel at general-purpose programming but struggle with Register-Transfer Level (RTL) code generation, a cornerstone of modern chip design — largely due to the scarcity of high-quality RTL data in public corpora. Recent agentic workflows like MAGE and VerilogCoder orchestrate LLMs through task decomposition and iterative self-correction, achieving promising results on academic benchmarks.

Despite these advances, current agentic RTL pipelines are fundamentally misaligned with industrial chip-design practice in three coupled ways:

  • No golden testbench in production — In real chip vendors, testbenches are written by dedicated verification engineers after the RTL, not before. Even when one exists, it is often imperfect — not a reliable oracle.
  • API-LLMs conflict with security posture — RTL is treated as first-class IP; core development servers are air-gapped to prevent any data leakage to third-party endpoints.
  • Internal codebases sit unused — Years of production-grade RTL — far superior to any public training data — remain entirely unused because API-based LLMs cannot be fine-tuned on proprietary code.

Recent self-trained models (QiMeng-CodeV-R1, RTLSeek) remove the API dependency, but generate code in a single turn: no mechanism to check or correct their own output. This is unsurprising — even a senior design engineer rarely writes correct RTL on the first attempt. The chip industry solves this through a cross-verification workflow: design engineers write RTL while verification engineers independently implement a reference model in a high-level language; neither side is assumed correct, and the two iteratively compare outputs until the design is verified.

This industrial workflow directly inspires ChipMATE: a multi-agent system where one agent generates Verilog in the role of a design engineer and another generates a Python reference model in the role of a verification engineer. The two agents iteratively cross-verify each other to produce high-quality RTL code.

Three Challenges

  • (C1) Error propagation without a golden oracle — A mismatch between agents does not reveal which side is wrong. Naively asking one agent to "correct" the other compounds errors turn by turn, producing worse results than a single-model baseline.
  • (C2) Weak individual capability — Off-the-shelf Qwen3.5-4B/9B achieve below 45% on both Verilog and reference-model generation. Each agent's individual capability must be strengthened before any meaningful collaboration can occur.
  • (C3) No training data for reference-model generation — Even DeepSeek-R1 achieves below 20% pass@1 when converting Verilog into Python reference models. Direct distillation is impractical.

Three Contributions

  • Cross-verification multi-agent workflow (C1) — A backtrack mechanism that automatically reverts to a previous turn whenever the current turn produces worse results — preventing error compounding.
  • Two-stage training pipeline (C2) — Stage 1 trains each agent separately with SFT + RL to saturate individual capability; Stage 2 trains the team jointly with our X-GRPO algorithm to learn collaboration.
  • Hybrid data-generation framework (C3) — Combines API distillation, IR-based AST conversion, and category-specific augmentation to produce 64.4K high-quality reference-model samples — at a fraction of pure-API cost.

ChipMATE: Three Components

Cross-Verification Workflow

A Verilog agent and a Python reference-model agent mutually verify each other's outputs. A backtrack mechanism reverts to the last accepted turn whenever the current turn does not strictly improve the match rate.

Two-Stage Training Pipeline

Stage 1: each agent is trained separately with SFT + RL to saturate individual capability. Stage 2: both agents are trained jointly with our X-GRPO algorithm to learn collaboration.

Hybrid Data Generation

Combines API distillation, IR-level conversion, and category-specific augmentation to produce 64.4K verified Python reference-model samples for training.

Cross-Verification Workflow

Two agents independently generate N candidate implementations from the same specification. A cross-language comparison tool simulates both on 1000 random stimuli; if outputs disagree, each agent self-corrects over multiple turns — without ever seeing the other agent's code.

(1) Backtrack Mechanism

A correction is accepted only if it strictly improves the match rate; otherwise the agent reverts to its last accepted version. This prevents the agent from drifting into worse states turn over turn — a failure mode that single-pass multi-agent setups suffer from naively.

(2) Waveform → Natural Language

Raw waveform mismatches are unparseable for LLMs. We build a converter that locates the first divergent cycle and packages the surrounding I/O context into a structured description the agent can directly act upon — bridging the low-level cycle-accurate world and the natural-language reasoning world.

Prompt Design

We mirror the combinational/sequential separation found in standard RTL textbooks — LLMs already partially internalize this from pretraining. Each prompt has four sections: code skeleton (with module name and parameterized port list), combinational-logic guidelines (latch avoidance, full-case coverage), sequential-logic guidelines (reset handling, non-blocking conventions), and a few-shot example. This structured prompt alone lifts first-attempt pass rate by 1–5% on frontier API models.

Two-Stage Training Pipeline

X-GRPO trajectory sampling: standard GRPO collapses to group size 1 across multi-turn rollouts when dense per-turn rewards are used; X-GRPO restores meaningful group variance by sampling K candidates from each agent and pairing them index-wise into K×K candidate pairs.

(a) Standard GRPO collapses to group size 1 across multi-turn rollouts when dense per-turn rewards are used. (b) X-GRPO restores meaningful group variance: both agents independently sample K candidates per turn, paired into K×K candidate pairs. The best-scoring pair becomes the shared prefix for the next turn.

After Stage 1, both agents already excel at their respective code-generation tasks. Stage 2 teaches them to collaborate. When a mismatch arises, each agent must analyze the diagnostic information, determine whether the fault lies in its own code or the other agent's, and apply self-correction only when it identifies a genuine error in its own implementation.

X-GRPO Trajectory Sampling

Standard GRPO is ill-suited to multi-turn multi-agent settings: assigning a dense reward at the end of each turn collapses the effective group size to 1 for every turn beyond the first. We propose X-GRPO, drawing from Tree-of-Thought and AT-GRPO, which restores meaningful within-group variance across turns by sampling K×K candidate pairs and selecting the best as the shared prefix.

Hierarchical Reward Design

The per-agent reward has three components:

  • Local reward (Rlocal) — Multi-tiered: {0, 0.1, 0.2, 0.2 + 0.8c} for compile failure, runtime error, I/O port mismatch, and partial pass rate c. Encourages syntactically correct, high-pass code.
  • Correct-fix bonus (Rfix) — Sparse binary reward for successfully resolving a previously mismatched stimulus. Withheld if both agents agree on a wrong answer.
  • Team-match reward (Rmatch) — Dense reward proportional to the overall match ratio between the two agents — a smooth gradient toward mutual agreement.

Aggregate: R = δlocal·Rlocal + δfix·Rfix + δmatch·Rmatch, with δlocal=10, δfix=0.2, δmatch=0.5.

Hybrid Reference-Model Data Generation

No public Python reference-model dataset exists. We build a three-pipeline hybrid framework yielding 64.4K high-quality samples — at a fraction of pure-API cost.

Hybrid data generation framework: three pipelines combining LLM-API agentic distillation, IR-based AST-level conversion, and category-specific augmentation.

Hybrid data generation framework — combining LLM-API distillation, IR-based conversion, and category-specific augmentation to produce the reference-model training corpus.

API Distillation (~25K)

Frontier LLM (DeepSeek-R1) generates Python references with CoT, verified by our cross-language comparator. Cost: $2,000+ and 200+ hours of compute, yielding only ~25K verified samples.

IR-Based Conversion (~36K)

Deterministic AST-level pipeline using PyVerilog: parse → behavioral lowering → top-module wrapping. Cost-free, completes in <2 hours, yields ~36K additional verified samples.

Targeted Augmentation (~3.4K)

After identifying weakness in FSMs, multi-cycle protocol blocks, and bit-level arithmetic, we collect targeted Verilog examples and convert them via the IR pipeline. Shifts these categories' share from ~15% to 28%.

Experimental Results

We evaluate ChipMATE against large foundation LLMs (GPT-4o, Claude Opus 4.7, DeepSeek V4), specialized models (CodeV-R1), and base Qwen3.5 on four benchmarks: VerilogEval v2, RTLLM v2, ChipBench-SC, and CVDP cid03.

75.0%
VerilogEval v2 / 4B

ChipMATE-Agents-4B pass@1

80.1%
VerilogEval v2 / 9B

ChipMATE-Agents-9B pass@1

1600B
DeepSeek V4 beaten

9B ChipMATE outperforms 1.6T DeepSeek V4

64.4K
Reference-model samples

Generated by our hybrid pipeline

End-to-End Verilog Generation

pass@k (%) on four benchmarks against foundation LLMs (20×–180× larger than us), specialized RTL models, and base models.

Type Model Size VerilogEval v2 RTLLM v2 ChipBench-SC CVDP cid03
p@1p@5p@1p@5 p@1p@5p@1p@5
FoundationGPT-4o64.173.756.570.320.033.339.040.4
Claude Opus 4.786.990.464.868.031.346.742.847.9
DeepSeek Coder236B68.580.857.670.016.730.022.337.2
DeepSeek V41.6T67.380.158.866.018.036.721.534.6
DeepSeek R1671B77.584.764.775.826.740.027.742.1
SpecializedCodeV-R1 (distill)7B65.275.257.271.913.326.726.242.1
CodeV-R17B68.878.268.078.230.040.026.843.3
BaseQwen3.5-4B4B41.760.934.349.76.710.011.813.9
Qwen3.5-9B9B48.566.636.157.813.320.013.321.5
ChipMATE
(Ours)
ChipMATE-Verilog-4B4B67.471.868.074.626.733.324.739.2
ChipMATE-Agents-4B4B75.076.374.677.333.343.332.141.3
ChipMATE-Verilog-9B9B75.377.671.975.830.036.728.142.1
ChipMATE-Agents-9B9B80.182.475.877.336.743.340.444.6

Table 1. ChipMATE-Agents-9B achieves 6.7%–13.6% higher pass@1 than the previous SOTA self-trained model (CodeV-R1), and outperforms all API-based LLMs that are 20–180× larger.

Python Reference-Model Generation

Even the 4B ChipMATE-Python ranks 2nd or higher across all benchmarks — well above all foundation models. Reference-model generation requires predicting cycle-accurate hardware behavior; targeted fine-tuning on this specific task pays off dramatically.

Type Model Size VerilogEval v2 RTLLM v2 ChipBench-SC CVDP cid03
p@1p@5p@1p@5 p@1p@5p@1p@5
FoundationDeepSeek Coder236B60.173.142.450.028.740.021.535.1
DeepSeek V41.6T59.771.844.454.330.740.024.736.5
DeepSeek R1671B57.170.749.657.828.736.726.237.2
SpecializedCodeV-R1 (distill)7B40.448.536.046.223.330.024.737.2
CodeV-R17B45.356.442.751.828.733.322.334.6
BaseQwen3.5-4B4B46.863.738.246.116.723.310.212.7
Qwen3.5-9B9B48.064.740.151.820.026.711.419.9
ChipMATE
(Ours)
ChipMATE-Python-4B4B77.680.175.380.141.346.734.245.3
ChipMATE-Python-9B9B82.483.577.381.350.053.343.449.4

Table 2. ChipMATE-Python-9B achieves 5.4%–15.2% higher pass@1 than ChipMATE-Verilog-9B across all four benchmarks — suggesting that LLMs can readily leverage pre-trained Python priors once they receive targeted fine-tuning on hardware-behavior simulation.

Python vs. Verilog Agent

The Python reference-model agent consistently outperforms the Verilog agent. Our multi-agent workflow lifts Verilog generation toward the level of the stronger Python agent — showing that reference-model accuracy sets the upper bound for the entire pipeline.

Pass@k of three ChipMATE-4B variants across benchmarks: the Python agent achieves the highest accuracy, the Verilog agent the lowest, and the multi-agent workflow lands between them.

Pass@k of three ChipMATE-4B variants. The multi-agent workflow consistently lands between the two single agents, lifting Verilog generation toward the stronger Python agent's level.

Ablation & Workflow Exploration

We trace the contribution of each technique on VerilogEval v2 pass@1, and explore the design space of the agentic workflow by sweeping the per-turn sampling budget and the maximum number of turns.

Ablation study on VerilogEval v2 pass@1: SFT yields the largest gain (+19.2% on 4B, +22.0% on 9B); single-agent RL and multi-agent RL contribute a further 1.4–6.5%; multi-agent without backtracking causes an unexpected drop of 11.6–14.3%; introducing backtracking rebounds by 16–18.6%.

Ablation study on VerilogEval v2 pass@1 — tracing each technique and training stage. BT = Backtracking. Note the −11.6% drop from naive multi-agent (no BT), and the +16–18.6% rebound once BT is enabled.

Heatmap of accuracy as a function of per-turn sampling budget N and maximum number of turns T, with peak at N=3 / T=3.

Workflow design exploration on ChipMATE-4B. Accuracy does not increase monotonically with either factor; Best-of-3 with T = 3 achieves the peak (75.6 pass@1) at minimal inference cost.

Key Findings

SFT Carries the Bulk

SFT yields the largest gain (+19.2% on 4B, +22.0% on 9B), equipping the LLMs with foundational Verilog knowledge. RL then refines.

Backtrack is the Multi-Agent Linchpin

Multi-agent without backtracking drops accuracy by 11.6–14.3% due to error compounding. Adding backtrack rebounds by 16–18.6%, surpassing single-agent by +4.4%.

Reference Model Sets the Ceiling

Across all four benchmarks the Python agent achieves 5.4–15.2% higher pass@1 than the Verilog agent. Targeted Python fine-tuning is more impactful than larger models.

3 Samples × 3 Turns is Sweet Spot

Accuracy plateaus at T = 3; larger candidate pools enlarge the cross-verifier's selection task and lock onto plausible-but-incorrect pairs. Best-of-3 / T=3 is optimal — diminishing returns thereafter.