ChipMATE

Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench at generation time, depend on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases. We present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice — where correctness emerges from cross-comparison between independently written RTL modules and reference models — ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. With a backtrack-based inference workflow and a two-stage training pipeline, ChipMATE achieves 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models — outperforming all existing self-trained models and even DeepSeek V4 (1600B parameters).

ChipMATE overview: a Verilog agent and a Python reference-model agent cross-verify each other through a backtrack-based inference workflow, trained in two stages with a hybrid data generation framework.

Overview of ChipMATE. (a) A multi-agent cross-verification workflow with a backtrack mechanism. (b) The two-stage training pipeline, complemented by a hybrid reference-model data generation framework.

Introduction

LLMs excel at general-purpose programming but struggle with Register-Transfer Level (RTL) code generation, a cornerstone of modern chip design — largely due to the scarcity of high-quality RTL data in public corpora. Recent agentic workflows like MAGE and VerilogCoder orchestrate LLMs through task decomposition and iterative self-correction, achieving promising results on academic benchmarks.

Despite these advances, current agentic RTL pipelines are fundamentally misaligned with industrial chip-design practice in three coupled ways:

No golden testbench in production — In real chip vendors, testbenches are written by dedicated verification engineers after the RTL, not before. Even when one exists, it is often imperfect — not a reliable oracle.
API-LLMs conflict with security posture — RTL is treated as first-class IP; core development servers are air-gapped to prevent any data leakage to third-party endpoints.
Internal codebases sit unused — Years of production-grade RTL — far superior to any public training data — remain entirely unused because API-based LLMs cannot be fine-tuned on proprietary code.

Recent self-trained models (QiMeng-CodeV-R1, RTLSeek) remove the API dependency, but generate code in a single turn: no mechanism to check or correct their own output. This is unsurprising — even a senior design engineer rarely writes correct RTL on the first attempt. The chip industry solves this through a cross-verification workflow: design engineers write RTL while verification engineers independently implement a reference model in a high-level language; neither side is assumed correct, and the two iteratively compare outputs until the design is verified.

This industrial workflow directly inspires ChipMATE: a multi-agent system where one agent generates Verilog in the role of a design engineer and another generates a Python reference model in the role of a verification engineer. The two agents iteratively cross-verify each other to produce high-quality RTL code.

Three Challenges

(C1) Error propagation without a golden oracle — A mismatch between agents does not reveal which side is wrong. Naively asking one agent to "correct" the other compounds errors turn by turn, producing worse results than a single-model baseline.
(C2) Weak individual capability — Off-the-shelf Qwen3.5-4B/9B achieve below 45% on both Verilog and reference-model generation. Each agent's individual capability must be strengthened before any meaningful collaboration can occur.
(C3) No training data for reference-model generation — Even DeepSeek-R1 achieves below 20% pass@1 when converting Verilog into Python reference models. Direct distillation is impractical.

Three Contributions

Cross-verification multi-agent workflow (C1) — A backtrack mechanism that automatically reverts to a previous turn whenever the current turn produces worse results — preventing error compounding.
Two-stage training pipeline (C2) — Stage 1 trains each agent separately with SFT + RL to saturate individual capability; Stage 2 trains the team jointly with our X-GRPO algorithm to learn collaboration.
Hybrid data-generation framework (C3) — Combines API distillation, IR-based AST conversion, and category-specific augmentation to produce 64.4K high-quality reference-model samples — at a fraction of pure-API cost.

ChipMATE: Three Components

Cross-Verification Workflow

A Verilog agent and a Python reference-model agent mutually verify each other's outputs. A backtrack mechanism reverts to the last accepted turn whenever the current turn does not strictly improve the match rate.

Two-Stage Training Pipeline

Stage 1: each agent is trained separately with SFT + RL to saturate individual capability. Stage 2: both agents are trained jointly with our X-GRPO algorithm to learn collaboration.

Hybrid Data Generation

Combines API distillation, IR-level conversion, and category-specific augmentation to produce 64.4K verified Python reference-model samples for training.

Cross-Verification Workflow

Two agents independently generate N candidate implementations from the same specification. A cross-language comparison tool simulates both on 1000 random stimuli; if outputs disagree, each agent self-corrects over multiple turns — without ever seeing the other agent's code.

(1) Backtrack Mechanism

A correction is accepted only if it strictly improves the match rate; otherwise the agent reverts to its last accepted version. This prevents the agent from drifting into worse states turn over turn — a failure mode that single-pass multi-agent setups suffer from naively.

(2) Waveform → Natural Language

Raw waveform mismatches are unparseable for LLMs. We build a converter that locates the first divergent cycle and packages the surrounding I/O context into a structured description the agent can directly act upon — bridging the low-level cycle-accurate world and the natural-language reasoning world.

Prompt Design

We mirror the combinational/sequential separation found in standard RTL textbooks — LLMs already partially internalize this from pretraining. Each prompt has four sections: code skeleton (with module name and parameterized port list), combinational-logic guidelines (latch avoidance, full-case coverage), sequential-logic guidelines (reset handling, non-blocking conventions), and a few-shot example. This structured prompt alone lifts first-attempt pass rate by 1–5% on frontier API models.

Two-Stage Training Pipeline

X-GRPO trajectory sampling: standard GRPO collapses to group size 1 across multi-turn rollouts when dense per-turn rewards are used; X-GRPO restores meaningful group variance by sampling K candidates from each agent and pairing them index-wise into K×K candidate pairs.

(a) Standard GRPO collapses to group size 1 across multi-turn rollouts when dense per-turn rewards are used. (b) X-GRPO restores meaningful group variance: both agents independently sample K candidates per turn, paired into K×K candidate pairs. The best-scoring pair becomes the shared prefix for the next turn.

After Stage 1, both agents already excel at their respective code-generation tasks. Stage 2 teaches them to collaborate. When a mismatch arises, each agent must analyze the diagnostic information, determine whether the fault lies in its own code or the other agent's, and apply self-correction only when it identifies a genuine error in its own implementation.

X-GRPO Trajectory Sampling

Standard GRPO is ill-suited to multi-turn multi-agent settings: assigning a dense reward at the end of each turn collapses the effective group size to 1 for every turn beyond the first. We propose X-GRPO, drawing from Tree-of-Thought and AT-GRPO, which restores meaningful within-group variance across turns by sampling K×K candidate pairs and selecting the best as the shared prefix.

Hierarchical Reward Design

The per-agent reward has three components:

Local reward (R_local) — Multi-tiered: {0, 0.1, 0.2, 0.2 + 0.8c} for compile failure, runtime error, I/O port mismatch, and partial pass rate c. Encourages syntactically correct, high-pass code.
Correct-fix bonus (R_fix) — Sparse binary reward for successfully resolving a previously mismatched stimulus. Withheld if both agents agree on a wrong answer.
Team-match reward (R_match) — Dense reward proportional to the overall match ratio between the two agents — a smooth gradient toward mutual agreement.

Aggregate: R = δ_local·R_local + δ_fix·R_fix + δ_match·R_match, with δ_local=10, δ_fix=0.2, δ_match=0.5.

Hybrid Reference-Model Data Generation

No public Python reference-model dataset exists. We build a three-pipeline hybrid framework yielding 64.4K high-quality samples — at a fraction of pure-API cost.

Hybrid data generation framework: three pipelines combining LLM-API agentic distillation, IR-based AST-level conversion, and category-specific augmentation.

Hybrid data generation framework — combining LLM-API distillation, IR-based conversion, and category-specific augmentation to produce the reference-model training corpus.

API Distillation (~25K)

Frontier LLM (DeepSeek-R1) generates Python references with CoT, verified by our cross-language comparator. Cost: $2,000+ and 200+ hours of compute, yielding only ~25K verified samples.

IR-Based Conversion (~36K)

Deterministic AST-level pipeline using PyVerilog: parse → behavioral lowering → top-module wrapping. Cost-free, completes in <2 hours, yields ~36K additional verified samples.

Targeted Augmentation (~3.4K)

After identifying weakness in FSMs, multi-cycle protocol blocks, and bit-level arithmetic, we collect targeted Verilog examples and convert them via the IR pipeline. Shifts these categories' share from ~15% to 28%.

Experimental Results

We evaluate ChipMATE against large foundation LLMs (GPT-4o, Claude Opus 4.7, DeepSeek V4), specialized models (CodeV-R1), and base Qwen3.5 on four benchmarks: VerilogEval v2, RTLLM v2, ChipBench-SC, and CVDP cid03.

75.0%

VerilogEval v2 / 4B

ChipMATE-Agents-4B pass@1

80.1%

VerilogEval v2 / 9B

ChipMATE-Agents-9B pass@1

1600B

DeepSeek V4 beaten

9B ChipMATE outperforms 1.6T DeepSeek V4

64.4K

Reference-model samples

Generated by our hybrid pipeline

End-to-End Verilog Generation

pass@k (%) on four benchmarks against foundation LLMs (20×–180× larger than us), specialized RTL models, and base models.

Type	Model	Size	VerilogEval v2		RTLLM v2		ChipBench-SC		CVDP cid03
Type	Model	Size	p@1	p@5	p@1	p@5	p@1	p@5	p@1	p@5
Foundation	GPT-4o	—	64.1	73.7	56.5	70.3	20.0	33.3	39.0	40.4
	Claude Opus 4.7	—	86.9	90.4	64.8	68.0	31.3	46.7	42.8	47.9
	DeepSeek Coder	236B	68.5	80.8	57.6	70.0	16.7	30.0	22.3	37.2
	DeepSeek V4	1.6T	67.3	80.1	58.8	66.0	18.0	36.7	21.5	34.6
	DeepSeek R1	671B	77.5	84.7	64.7	75.8	26.7	40.0	27.7	42.1
Specialized	CodeV-R1 (distill)	7B	65.2	75.2	57.2	71.9	13.3	26.7	26.2	42.1
Specialized	CodeV-R1	7B	68.8	78.2	68.0	78.2	30.0	40.0	26.8	43.3
Base	Qwen3.5-4B	4B	41.7	60.9	34.3	49.7	6.7	10.0	11.8	13.9
Base	Qwen3.5-9B	9B	48.5	66.6	36.1	57.8	13.3	20.0	13.3	21.5
ChipMATE (Ours)	ChipMATE-Verilog-4B	4B	67.4	71.8	68.0	74.6	26.7	33.3	24.7	39.2
	ChipMATE-Agents-4B	4B	75.0	76.3	74.6	77.3	33.3	43.3	32.1	41.3
	ChipMATE-Verilog-9B	9B	75.3	77.6	71.9	75.8	30.0	36.7	28.1	42.1
	ChipMATE-Agents-9B	9B	80.1	82.4	75.8	77.3	36.7	43.3	40.4	44.6

Table 1. ChipMATE-Agents-9B achieves 6.7%–13.6% higher pass@1 than the previous SOTA self-trained model (CodeV-R1), and outperforms all API-based LLMs that are 20–180× larger.

Python Reference-Model Generation

Even the 4B ChipMATE-Python ranks 2nd or higher across all benchmarks — well above all foundation models. Reference-model generation requires predicting cycle-accurate hardware behavior; targeted fine-tuning on this specific task pays off dramatically.

Type	Model	Size	VerilogEval v2		RTLLM v2		ChipBench-SC		CVDP cid03
Type	Model	Size	p@1	p@5	p@1	p@5	p@1	p@5	p@1	p@5
Foundation	DeepSeek Coder	236B	60.1	73.1	42.4	50.0	28.7	40.0	21.5	35.1
	DeepSeek V4	1.6T	59.7	71.8	44.4	54.3	30.7	40.0	24.7	36.5
	DeepSeek R1	671B	57.1	70.7	49.6	57.8	28.7	36.7	26.2	37.2
Specialized	CodeV-R1 (distill)	7B	40.4	48.5	36.0	46.2	23.3	30.0	24.7	37.2
Specialized	CodeV-R1	7B	45.3	56.4	42.7	51.8	28.7	33.3	22.3	34.6
Base	Qwen3.5-4B	4B	46.8	63.7	38.2	46.1	16.7	23.3	10.2	12.7
Base	Qwen3.5-9B	9B	48.0	64.7	40.1	51.8	20.0	26.7	11.4	19.9
ChipMATE (Ours)	ChipMATE-Python-4B	4B	77.6	80.1	75.3	80.1	41.3	46.7	34.2	45.3
ChipMATE (Ours)	ChipMATE-Python-9B	9B	82.4	83.5	77.3	81.3	50.0	53.3	43.4	49.4

Table 2. ChipMATE-Python-9B achieves 5.4%–15.2% higher pass@1 than ChipMATE-Verilog-9B across all four benchmarks — suggesting that LLMs can readily leverage pre-trained Python priors once they receive targeted fine-tuning on hardware-behavior simulation.

Python vs. Verilog Agent

The Python reference-model agent consistently outperforms the Verilog agent. Our multi-agent workflow lifts Verilog generation toward the level of the stronger Python agent — showing that reference-model accuracy sets the upper bound for the entire pipeline.

Pass@k of three ChipMATE-4B variants across benchmarks: the Python agent achieves the highest accuracy, the Verilog agent the lowest, and the multi-agent workflow lands between them.

Pass@k of three ChipMATE-4B variants. The multi-agent workflow consistently lands between the two single agents, lifting Verilog generation toward the stronger Python agent's level.

Ablation & Workflow Exploration

We trace the contribution of each technique on VerilogEval v2 pass@1, and explore the design space of the agentic workflow by sweeping the per-turn sampling budget and the maximum number of turns.

Ablation study on VerilogEval v2 pass@1: SFT yields the largest gain (+19.2% on 4B, +22.0% on 9B); single-agent RL and multi-agent RL contribute a further 1.4–6.5%; multi-agent without backtracking causes an unexpected drop of 11.6–14.3%; introducing backtracking rebounds by 16–18.6%.

Ablation study on VerilogEval v2 pass@1 — tracing each technique and training stage. BT = Backtracking. Note the −11.6% drop from naive multi-agent (no BT), and the +16–18.6% rebound once BT is enabled.

Heatmap of accuracy as a function of per-turn sampling budget N and maximum number of turns T, with peak at N=3 / T=3.

Workflow design exploration on ChipMATE-4B. Accuracy does not increase monotonically with either factor; Best-of-3 with T = 3 achieves the peak (75.6 pass@1) at minimal inference cost.

Key Findings

SFT Carries the Bulk

SFT yields the largest gain (+19.2% on 4B, +22.0% on 9B), equipping the LLMs with foundational Verilog knowledge. RL then refines.

Backtrack is the Multi-Agent Linchpin

Multi-agent without backtracking drops accuracy by 11.6–14.3% due to error compounding. Adding backtrack rebounds by 16–18.6%, surpassing single-agent by +4.4%.

Reference Model Sets the Ceiling

Across all four benchmarks the Python agent achieves 5.4–15.2% higher pass@1 than the Verilog agent. Targeted Python fine-tuning is more impactful than larger models.

3 Samples × 3 Turns is Sweet Spot

Accuracy plateaus at T = 3; larger candidate pools enlarge the cross-verifier's selection task and lock onto plausible-but-incorrect pairs. Best-of-3 / T=3 is optimal — diminishing returns thereafter.

ChipMATE

Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Introduction

Three Challenges

Three Contributions

ChipMATE: Three Components

Cross-Verification Workflow

Two-Stage Training Pipeline

Hybrid Data Generation

Cross-Verification Workflow

(1) Backtrack Mechanism

(2) Waveform → Natural Language

Prompt Design

Two-Stage Training Pipeline

X-GRPO Trajectory Sampling

Hierarchical Reward Design

Hybrid Reference-Model Data Generation

API Distillation (~25K)

IR-Based Conversion (~36K)

Targeted Augmentation (~3.4K)

Experimental Results

End-to-End Verilog Generation

Python Reference-Model Generation

Python vs. Verilog Agent

Ablation & Workflow Exploration

Key Findings

SFT Carries the Bulk

Backtrack is the Multi-Agent Linchpin

Reference Model Sets the Ceiling

3 Samples × 3 Turns is Sweet Spot

Share ChipMATE