



Di Huang

Institute of Computing Technology Chinese Academy of Sciences

## The Problem of Hardware and Software Design Automation

#### The automatic design of hardware and software has long been a pursuit of researchers



#### CHESS MACHINE of the 18th cen tury was actually run by man inside,

A Chess-Playing Machine

Electronic computers can be set up to play a fairly strong game, raising the question of whether they can "think"

by Claude E. Shannon

POR CENTURIES philosophers and it was undertaken with a serious purpose in mind. The investigation of the chesswhether or not the human brain is playing problem is intended to develop essentially a machine. Could a machine be designed that would be capable of "thinking"? During the past decade several large-scale electronic computing machines have been constructed which are capable of something very close to the reasoning process. These new computers were designed primarily to carry

playing problem is intended to develop techniques that can be used for more practical applications.

The chess machine is an ideal one to

start with for several reasons. The prob lem is sharply defined, both in the allowed operations (the moves of chess) and in the ultimate goal (checkmate) It is neither so simple as to be trivial no out purely numerical calculations. They perform automatically a long sequence such a machine could be pitted against ent giving a clear meas

ady a considerable literaubject of chess-playing ma-ng the late 18th and early s a Hungarian inventor ang von Kempelen aspe with a device known as Chess Automaton, which nent to large audiences. f papers purporting to ex-ation, including an analyti-Edgar Allan Poe, soon apof the analysts concluded human chess master con-Some years later the exact eration was exposed (see

ing satisfactory moves in such an end game, the problem is relatively simple, but the idea was quite advanced for that

AN electronic computer can be set up to play a complete game. In order to explain the actual setup of a chess machine, it may be best to start with a general picture of a computer and its

A general-purpose electronic com puter is an extremely complicated de-vice containing several thousand vacuum tubes, relays and other elements. The quite simple. The machine has four main parts: 1) an "arithmetic organ," 2) a control element, 3) a numerical memory and 4) a program memory. (In some designs the two memory functions are carried out in the same physical apparatus.) The manner of operation is exactly analogous to a human computer carrying out a series of numerical calcu lations with an ordinary desk computing machine. The arithmetic organ corres-ponds to the desk computing machine, the control element to the human opera tor, the numerical memory to the work sheet on which intermediate and final results are recorded, and the program nest attempt to design a memory to the computing routine de

In an electronic computing machine, the numerical memory consists of a large number of "boxes," each capable of hold-ing a number. To set up a problem on the computer, it is necessary to assign box numbers to all numerical quantities

... playing a fair game of chess ... ... translating from one language to another ... ... designing electrical filters and relay circuits ...

A Chess Playing Machine. 1950.

Application of recursive arithmetic to the problem of circuit systems. Summaries of the Summer Institute of Symbolic Logic. 1957. Application of Theorem Proving to Problem Solving. 1969.

 $\Pi(f)$  and a suitable code for the number n. (Both f and n are variables here.) Then the machine is to go to work and is to produce a code for a number if and only if f(n) is defined; and if f(n) is defined the coded number is to be f(n). (2) Similar to (1) but now the machine is to reproduce on the tape not only f(n) but (in coded form) all steps that the machine with the program  $\Pi(f)$  would go through in computing f(n) from n.

The author asserts that problem (1) can be solved because, using the standard notation and terminology for recursive functions introduced by Kleene, / has a Gödel number z and if f(n) is defined  $f(n) = U(\mu_y T_1(z, n, y))$ . So if one constructs the program for B-computing  $U(\mu_y T_1(z, n, y))$  one has solved (1). Problem (2) is said to be more difficult and the second paper is mostly devoted to constructing in detail a program solving (2). It seems to the reviewer, however, that since the computation that machine B goes through when programmed by  $\Pi(f)$  and fed n on the tape is also a partial recursive function of  $\Pi(f)$  and n, (2) could be settled by a short argument like that used in (1)

ALONZO CHURCH. Application of recursive arithmetic to the problem of circuit synthesis Summaries of talks presented at the Summer Institute for Symbolic Logic Cornell University, 1957, 2nd edn., Communications Research Division, Institute for Defense Analyses, Princeton, N. J., 1960, pp. 3-50, 3a-45a.

In this early paper on the application of formal logic to circuits and automata Church develops and extends a theory of restricted recursive arithmetic first presented in his review (XX 286) of an article by E. C. Berkeley. The author here presents several alternatives for the recursion schemata of that system: Let  $s = (s_1, \ldots, s_k)$ ,  $r = (r_1, \ldots, r_n)$ . The schemata for restricted recursion (A) are

$$r_i(0) = P_i[s(0)]$$
  $i = 1, ..., n$   
 $r_i(t+1) = Q_i[s(t), s(t+1), r(t)]$ 

and for wider restricted recursion (B) are

 $r_i(0) = P_{i0}[s(0)], \ldots, r_i(h) = P_{ih}[s(0), \ldots, s(h)]^*$  $+b), r(t), \ldots, r(t+h)];$ 



C), the schemata are the same as for (B) except  $= 0, \ldots, h.$ 

automata can be treated by means of restricted hich is reducible to (A). etic are studied. The synthesis problem: given a

which is an extension of recursive arithmetic, to es for a circuit which satisfies the requirement. th requirement and recursion equivalences, to s the requirement.

st for the case of one free variable (Case 1):

$$(a_{2j}), o_l(0), \ldots, o_l(b_l), o_l(t), \ldots, o_l(t+b_{2l})]$$
  
 $j = 1, \ldots, \mu, \quad l = 1, \ldots, \gamma$ 

n it a test to determine if a solution is possible thesis problem is included by Wang in XXV 373. requirement with two or more free variables is

rith quantifiers is treated juantifier. Several subcases sined. Case 4, in which the ure given about a possible

Given input string alpha and output strings beta, is it possible to automatically construct a circuit that satisfies the alpha and beta constraints?

Automatic Programming

#### Introduction

The automatic writing, checking, and debugging of computer programs are problems of great interest both for their independent importance and as useful tools for intelligent machines. This section shows how a theorem prover can be used to solve certain automatic programming problems. The formalization given here will be used to precisely state and solve the problem of automatic generation of programs, including recursive programs, along with concurrent generation of proofs of the correct ness of these programs. Thus any programs automatically written by this method have no errors.

We shall take LISP " as our example of a programming language. In the LISP language, a function is described by two entities: (1) its value, and (2) its side effect. Side effects can



Methods for describing erations, as well as methriting of programs in a guage, were presented in plicity, in this section LISP, in which a LISP the standard notion of a value but no side effect.

1969

The automatic writing, checking, and debugging of computer programs are problems of great interest both for their independent importance and as useful tools for intelligent machines.

#### **Problem Formulation**

#### **Problem Statement**

**Description I/O Verification** 







Boolean Circuits / Programs

# Define the set X of functions and cout design constraints described in natural and the sign language, is and the result is simply the low XLEN bits of the result. ADDI rd. rs1, 0 is used to implement the MV rd., rs1 assembler pseudoinstruction. Condition R (x, y) Function Description Truth Tables



**Basic Boolean Circuits** 

The set of executable operations Y

AND

ЭR

NOT

## **Example: Existing Processor Design Flow**



Verify

## **Challenge 1: Large Solution Space**



~1080 particles



"half universe"



## **Challenge 1: Large Solution Space**



## **Challenge 2: High Accuracy Requirement**





VS





CAT, DOG, DUCK

Image classification Object detection ~90% ~80%





Voice recognition ~90%

OA agent

QA agent ~90%



## **Our Attempt: Top-down Prior**



## **Our Attempt: Bottom-up Feedback**



"Then, get the target through feedback and monotonic iteration."

| Library   | Automated High-Performance<br>Library Generation |
|-----------|--------------------------------------------------|
| Compiler  | Automated Tensor Program Transcompiler           |
| Compiler  | Automated Compiler<br>Tool-Chain Design          |
| OS        | Automated OS Configuration Optimization          |
| D         | Automated HDL Generation                         |
| Processor | Automated Front-End Design                       |



| Library  | Automated High-Performance<br>Library Generation |
|----------|--------------------------------------------------|
| Compiler | Automated Tensor Program Transcompiler           |
| Compiler | Automated Compiler<br>Tool-Chain Design          |
| OS       | Automated OS Configuration Optimization          |
|          |                                                  |







**Processor** 

Automated Front-End Design

Automated HDL Generation













I want to improve the speed of creating processes, how do I modify the linux kernel's configuration options? Please provide specific options directly and give the recommend Modifying the kernel's configuration options to improve the speed of creating Linux image processes can be a complex task, and it largely depends on the specific use case and requirements ....Keep in mind that changing kernel configurations should be done with caution and with a good understanding of the potential impacts on your system. 1 CONFIG BASE SMALL: Recommended Setting: y description: ... **2 CONFIG PREEMPT:** Recommended Setting: y (for desktop systems), n (for servers) Description:... 7 CONFIG HZ: **Recommended Setting: 1000** Description:...

CONFIG\_CC\_IS\_GCC=y
CONFIG\_GCC\_VERSION=100301
CONFIG\_CLANG\_VERSION=0
CONFIG\_AS\_IS\_GNU=y
CONFIG\_AS\_VERSION=23500
CONFIG\_LD\_IS\_BFD=y
CONFIG\_LD\_VERSION=23500
CONFIG\_LLD\_VERSION=0
CONFIG\_CC\_CAN\_LINK=y

# CONFIG\_TEST\_MEMINIT is not set # CONFIG\_TEST\_FREE\_PAGES is not set CONFIG\_ARCH\_USE\_MEMTEST=y





QiMena-TensorOp [IJCAl 25], QiMena-Attention [ACL 25] QiMena-GEMM [AAAl 25], QiMena-Xpiler [OSDI 25] BabelTower [ICML 22], AutoOS [ICML 24] CodeV, QiMena-CPU [IJCAl 24, 25]



cudd
ctern \*N \_global\_\_void matmul(half \*A, half \*B, float \*D)
{
 int ix \ (blockIdx x \* blockDim.x + threadIdx.x)/32;
 int iy = (blockIdx.) \* blockDim.y + threadIdx.y);

 wmma::fragment a\_frag;
 wmma::fragment ab\_frag;

 wmma::fragment ab\_frag;

 wmma::fill\_frowent(ab\_frag, 0.0f);

 int a\_row = ix \* 16;
 int b\_row = iy \* 11;
 for (int k=0; k<512; k+=16) {
 int a\_col = k;
 int b\_col = k;
 if (a\_row < 512 && a\_col < 512 && b\_row < 512 && b\_col < 512) {
 // Load the input:
 wmma::load\_matrix\_suc(a\_frag, A + a\_col + a\_row \* 51, 512);
 wmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matrix\_suc(b\_frag, B + b\_col + b\_col \* 512, 512);
 vmma::load\_matr





extern "C" void matmul

#pragma omp paralle

for (int i = 0; i <
 for (int j = 0;</pre>

for (int p

\_\_m512i

\_\_m512i

acc = \_

\_mm512\_stor



































## **Automatically Generate the CPU from a Finite Set of IOs**

## **Test case for verification**



Get I/O spec



Problem: How to generate a CPU given a finite set of IOs (truth tables).

## **Binary Decision Diagram**





#### **Binary Decision Diagram (BDD):**

- A Boolean function can be represented as a rooted, directed, acyclic graph, which consists of several (decision) nodes and two terminal nodes.
- The two terminal nodes are labeled 0 (FALSE) and 1 (TRUE).
- Each (decision) node u is labeled by a Boolean variable and has two child nodes called low child and high child. The edge from node u to a low (or high) child represents an assignment of the value FALSE (or TRUE, respectively) to variable xi.

## **Binary <del>Decision</del>** Speculation Diagram

Problem: How to generate large-scale and accurate CPU given a limited set of IOs.





Difficulty: Fitting an almost **infinitely large** truth table.

32-bit RISC-V CPU

(~1800-bit input/output)

 $10^{10^{540}}$ 

## **Reduce the Search Space with Bool Distance**



Boolean Distance: Monte Carlo method to approximately calculate

$$Dist(f,g) = C_{\Omega}(f) + C_{\Omega}(g) - C_{\Omega}(\tau)$$

Merging the nodes with 0 Boolean Distance to narrow down the space

 $10^{10^{540}}$ 



 $10^6$ 

## **Speculate, Compute, and Verify**

#### **Speculate Circuit Diagram Generation**

**BSD** (Binary Speculation Diagram)



MCTS-based leaf nodes speculation



**Compute and Verify** Accurately pinpoint errors: if one occurs, identify the leaf node and continue sampling and expanding it.







**Iteration:** Continuously correct errors, and a rigorous theoretical proof demonstrates that by iteratively expanding the guessed leaf nodes, more accurate generation results can be achieved.

Theorem 1 (The accuracy of BSD boosts after expansion). After expanding the generated BSD

**Theorem 1** (The accuracy of BSD boosts after expansion). After expanding the generated BSD  $\mathcal{F}_k$  by any input bit  $x_i$  to  $\mathcal{F}_{k+1}$ , the accuracy of expansion ended with  $\mathcal{F}_k$  will be no larger than the accuracy of expansion ended with  $\mathcal{F}_{k+1}$ , that is,

$$Acc(\mathcal{F}_k) \le Acc(\mathcal{F}_{k+1}).$$
 (3)

## 2021: QiMeng-CPU-v1

| Object      | Output        | Gates               | Team                                | Method                           |
|-------------|---------------|---------------------|-------------------------------------|----------------------------------|
| Adder       | Circuit logic | ~200 Gates          | Nvidia<br>[Roy et al. 2021]         | Deep RL                          |
| Simple ALU  | Circuit logic | ~1000 Gates         | Univ. TW<br>[Chen et al. 2020]      | Decision Tree                    |
| Simple ALU  | Circuit logic | ~2500 Gates         | Univ. Tokyo<br>[Rai et al. 2021]    | Assemble<br>Learning             |
| 8- bit CPU  | Circuit logic | ~1000 Gates         | Univ. NY<br>[Blocklove et al. 2023] | Deep Learning                    |
| 32- bit CPU | 2021          | ~4,000,000<br>Gates | ICT, CAS                            | Binary<br>Speculation<br>Diagram |

The design scale is increased by thousands of times

## 2021: QiMeng-CPU-v1

#### Frontend design of a 32-bit RISC-V CPU (4 million gates) in 5 Hours



- 2021.12: Tapout of QiMeng-CPU-v1
- 2022.05: testing, running Linux OS and SPEC
   CPU2000 successfully, on par with Intel 486





## 2024: QiMeng-CPU-v2

**QiMeng-CPU-v2**, an automatically designed **superscalar** CPU core, achieving performance comparable to the **ARM Cortex-A53**. **Automated instruction-level parallelism**: By introducing internal processor state information, it predicts data dependencies between instructions to achieve parallel execution across instructions.



Decouple inter-instruction data dependencies



Predict dependent data



**382**× compared to QiMeng-CPU-v1













## The Challenges of Constructing Prior with LLMs

#### **Data Scarcity**



#### **Large Semantic Gap**



#### **Observations and Methods**



- There is an **asymmetry** in the conversion between formal languages (code) and informal languages (natural language), where summarization is easier than generation.
- Multi-level summarization helps bridge the semantic gap between Verilog and natural language.

#### **CodeV**

TABLE II

COMPARISON OF OUR CODEV SERIES AGAINST VARIOUS BASELINE MODELS. RESULTS ARE CITED FROM THE ORIGINAL PAPER.

| Typo            | Model                 | Model     | Open         | Open VerilogEval-Machine (%) |        |         | VerilogEval-Human (%) |        |         | RTLLM v1.1 (%) |       |
|-----------------|-----------------------|-----------|--------------|------------------------------|--------|---------|-----------------------|--------|---------|----------------|-------|
| Type            | Woder                 | size      | source       | pass@1                       | pass@5 | pass@10 | pass@1                | pass@5 | pass@10 | Syntax         | Func. |
|                 | GPT-3.5               | 18        | ×            | 60.9                         | 75.0   | 79.9    | 33.5                  | 45.9   | 50.0    | 79.3           | 51.7  |
|                 | GPT-4                 | -         | ×            | 60.0                         | 70.6   | 73.5    | 43.5                  | 55.8   | 58.9    | 100.0          | 65.5  |
|                 | StarCoder [35]        | 15B       | $\checkmark$ | 46.8                         | 54.5   | 59.6    | 18.1                  | 26.1   | 30.4    | 93.1           | 27.6  |
| Base LLMs       | CodeLlama [36]        | <b>7B</b> | ✓            | 43.1                         | 47.1   | 47.7    | 18.2                  | 22.7   | 24.3    | 86.2           | 31.0  |
|                 | DeepSeek-Coder [37]   | 6.7B      | ✓            | 52.2                         | 55.4   | 56.8    | 30.2                  | 33.9   | 34.9    | 93.1           | 44.8  |
|                 | CodeQwen [38]         | 7B        | ✓            | 46.5                         | 54.9   | 56.4    | 22.5                  | 26.1   | 28.0    | 86.2           | 41.4  |
|                 | Qwen2.5-Coder [39]    | 7B        | $\checkmark$ | 66.2                         | 79.2   | 83.9    | 34.6                  | 45.6   | 51.0    | 89.6           | 41.4  |
|                 | ChipNeMo [40]         | 7B        | ×            | 43.4                         | -      | E       | 22.4                  | Ē      |         | <b>(3</b> )    | -     |
|                 | RTLCoder-Mistral [41] | <b>7B</b> | $\checkmark$ | 62.5                         | 72.2   | 76.6    | 36.7                  | 45.5   | 49.2    | 96.6           | 48.3  |
|                 | RTLCoder-DS [41]      | 6.7B      | $\checkmark$ | 61.2                         | 76.5   | 81.8    | 41.6                  | 50.1   | 53.4    | 93.1           | 48.3  |
|                 | BetterV-CL [42]       | <b>7B</b> | ×            | 64.2                         | 75.4   | 79.1    | 40.9                  | 50.0   | 53.3    | 8              | -     |
| Fine-Tuned LLMs | BetterV-DS [42]       | 6.7B      | ×            | 67.8                         | 79.1   | 84.0    | 45.9                  | 53.3   | 57.6    | <b>₩</b> 3     | _     |
|                 | BetterV-CQ [42]       | <b>7B</b> | ×            | 68.1                         | 79.4   | 84.5    | 46.1                  | 53.7   | 58.2    | <del>-</del>   | -     |
|                 | CraftRTL-CL [43]      | <b>7B</b> | ×            | 78.1                         | 85.5   | 87.8    | 63.1                  | 67.8   | 69.7    | 93.9           | 52.9  |
|                 | CraftRTL-DS [43]      | 6.7B      | ×            | 77.8                         | 85.5   | 88.1    | 65.4                  | 70.0   | 72.1    | 84.3           | 58.8  |
|                 | CodeV-Verilog-CL      | 7B        | <b>√</b>     | 78.1                         | 86.0   | 88.5    | 45.2                  | 59.5   | 63.8    | 93.1           | 62.1  |
| C-1-WW-:1       | CodeV-Verilog-DS      | 6.7B      | ✓            | 77.9                         | 88.6   | 90.7    | 52.7                  | 62.5   | 67.3    | 89.7           | 55.2  |
| CodeV-Verilog   | CodeV-Verilog-CQ      | <b>7B</b> | ✓            | 77.6                         | 88.2   | 90.7    | 53.2                  | 65.1   | 68.5    | 93.1           | 55.2  |
|                 | CodeV-Verilog-QC      | 7B        | ✓            | 80.1                         | 87.9   | 90.5    | 59.2                  | 65.8   | 69.1    | 96.6           | 51.7  |
|                 | CodeV-All-CL          | 7B        | <b>√</b>     | 78.5                         | 85.6   | 87.6    | 46.6                  | 58.8   | 62.5    | 96.6           | 55.2  |
| C- 1-V A11      | CodeV-All-DS          | 6.7B      | <b>√</b>     | 79.8                         | 86.0   | 86.7    | 53.0                  | 63.3   | 67.2    | 96.6           | 51.7  |
| CodeV-All       | CodeV-All-CQ          | <b>7B</b> | ✓            | 79.9                         | 88.3   | 91.1    | 54.1                  | 65.1   | 68.6    | 93.1           | 58.6  |
|                 | CodeV-All-QC          | <b>7B</b> | ✓            | 81.9                         | 89.9   | 92.0    | 56.6                  | 67.9   | 71.4    | 96.6           | 55.2  |

#### Achieves a leading position in VerilogEval and RTLLM benchmarks

## CodeV-R1



#### CodeV-R1

```
different[0], neighbor is in[1]. But the example says that out different[2] is in[2] XOR in[3]. So the code for out different would be:
out_different[3] = 0 (no neighbor to the left)
out_different[2] = in[2] ^ in[3]
out_different[1] = in[1] ^ in[2]
out_different[0] = in[0] ^ in[1]
Wait, but the example says that in[3]'s neighbor to the left is in[0]. Wait, the example says: \"For this part, treat the vector as wrapping around, so in[3]'s neighbor to the left
is in[0].\" Wait, that's conflicting. The example says that in[3]'s neighbor to the left is in[0], but according to the previous understanding, the neighbor to the left of in[3] is
in[2], which is lower index. So this is a problem. Let me re-read the problem statement.
                                                                                             'Re-read the problem
Problem statement for out different:
(3) out_different: Each bit of this output vector should indicate whether the corresponding input bit is different from its neighbor to the left is in[0]. Represented in [1] should indicate if in[0] is different from in[3]. For this part, treat the vector as wrapping around, so in[3]'s neighbor to the left is in[0].
d indicate if in[2] is different from in[3]. For this part, treat the vector as wrapping around, so in[3]'s neighbor to the left is in[0].
Ah, so the neighbor to the left is the higher index. So for in[3], the neighbor to the left is in[2], not in[0]. But the example says that out different[2] is in[2] different from i
n[3], which aligns with that. So perhaps the wrapping around is not the case here. The example says that in[3]'s neighbor to the left is in[0]. Wait, that's conflicting. Wait, the p
roblem says: \"For this part, treat the vector as wrapping around, so in[3]'s neighbor to the left is in[0].\" So when the vector is wrapped around, the left neighbor of in[3] is in
[0]. So the neighbor to the left is the higher index, but if there is none, then wrap around to the right. So for each bit i, the neighbor to the left is (i-1) mod 4. Wait, no. Let
me think again.
In the case of wrapping around, the neighbor to the left of in[3] is in[0], which is the rightmost bit. So the left neighbor is considered the higher index Reflect there is n
o higher index, so the left neighbor wraps around to the right. So the left neighbor of in[3] is in[0]. So the code for out different[i] is in[i] XOR in[(i-1) mod 4]. Wait, but how
to handle that. Let's see:
For i=3: neighbor to the left is in[0] (wrapping around). So out different[3] = in[3] ^ in[0]
i=2: neighbor to the left is in[3], so out different[2] = in[2] ^ in[3]
i=1: neighbor to the left is in[2], so out_different[1] = in[1] ^ in[2]
i=0: neighbor to the left is in[1], so out different[0] = in[0] ^ in[1]
Wait, but the example says that out different[2] is in[2] different from in[3], which is correct. So the code for out different is:
                                                                                                                                     CodeV-R1 Reasoning
out different[3] = in[3] ^ in[0]
out different[2] = in[2] ^ in[3]
out_different[1] = in[1] ^ in[2]
```

CodeV-R1: Reasoning out\_different[0] = in[0] ^ in[1]

## Tiny Model Size, Small Computation, Ultimate Performance





https://huggingface.co/zhuyaoyu/CodeV-R1-Qwen-7B https://huggingface.co/collections/zhuyaoyu/codev-series-683f16115c357416e17b1bb5













## The Dilemma of Transcompilation



## The Dilemma of Transcompilation



## **QiMeng-Xpiler**



[1] Tenspiler: A Verified-Lifting-Based Compiler for Tensor Operations

## **Transcompiler across Different Systems**

| Source Method | Compilation Accuracy            |        |        |      | Computation Accuracy |        |        |      |            |
|---------------|---------------------------------|--------|--------|------|----------------------|--------|--------|------|------------|
|               | Method                          | CUDA C | BANG C | Hip  | C with VNNI          | CUDA C | BANG C | Hip  | C with VNN |
|               | GPT-4 Zero-Shot                 | -      | 0      | 82.7 | 9.5                  | -      | 0      | 82.7 | 4.2        |
|               | OpenAl of Zero-Shot             | 200    | 0      | 85.7 | 61.9                 | -      | 0      | 82.7 | 60.7       |
|               | GPT-4 Few-Shot                  | -      | 50.6   | 97.0 | 84.5                 | -      | 7.7    | 96.4 | 30.4       |
| CUDAC         | OpenAl of Few-Shot              | 23     | 51.8   | 98.2 | 85.1                 |        | 48.2   | 98.2 | 55.4       |
|               | Falcon We SMT                   | 23     | 82.1   | 98.2 | 88.1                 |        | 34.2   | 98.2 | 38.3       |
|               | Falcon w/o SMT + Self-Debugging |        | 87.5   | 98.8 | 89.3                 | _      | 414    | 98.2 | 58.9       |
|               | FALCON                          |        | 100    | 100  | 100                  | -      | 91.7   | 100  | 95.2       |
|               | GPT-4 Zero-Shot                 | 74.4   | -      | 76.8 | 0                    | 0      |        | 0    | 0          |
|               | OpenAl of Zero-Shot             | 27.4   | -      | 97.0 | 9.5                  | 0      | -      | -0   | 4.2        |
|               | GPT-4 Few-Shot                  | 69.0   | -      | 66.1 | 23.8                 | 6.5    | -      | 6.5  | 13.1       |
| BANG C        | OpenAl of Few-Shot              | 71.4   | -      | 97.0 | 41.7                 | 10.1   | -      | 7.7  | 23.2       |
|               | Falcon w/o SMT                  | 85.1   | -      | 84.5 | 47.6                 | 77.4   | -      | 78.6 | 41.1       |
|               | Falcon We SMT + Self-Debugging  | 88.1   | -      | 88.7 | 50.6                 | 77.4   | -      | 78.6 | 41.1       |
|               | FALCON                          | 100    | -      | 100  | 100                  | 95,8   | -      | 97.0 | 95.2       |
|               | GPT-4 Zero-Shot                 | 97.0   | 0      |      | 23.8                 | 97.0   | 0.     | -    | 5.4        |
|               | OpenAI o l Zero-Shot            | 96.2   | 0      | -    | 45.8                 | 98.2   | 0      | -    | 4.2        |
|               | GPT-4 Few-Shot                  | 97.0   | 35.1   | -    | 85.1                 | 97.0   | 5.4    | -    | 24.4       |
| Hip           | OpenAl of Few-Shot              | 98.8   | 42.3   | -    | 88.7                 | 98.2   | 9.0    | -    | 30.4       |
|               | Falcon 9/o SMT                  | 98.2   | 60.7   | -    | 65.5                 | 97.6   | 52.4   | -    | 57.1       |
|               | Falcon w/o SMT + Self-Debugging | 96.6   | 62.5   | _    | 66.1                 | 98.7   | 52.4   | -    | 57.1       |
|               | FALCON                          | 100    | 100    | -    | 100                  | 100    | 86.9   | -    | 96.4       |
|               | GFT-4 Zero-Shot                 | 57.1   | 0      | 60.1 | -                    | 8.3    | 0      | 8.9  | -          |
|               | OpenAI o1 Zero-Shot             | 66.1   | 0      | 97.0 | -                    | 10.1   | 0      | 96.4 | -          |
|               | GPT-4 Few-Shot                  | 81.5   | 41.7   | 14.4 | -                    | 14.5   | 6.0    | 12.5 | -          |
| with VNNI     | OpenAl of Few-Shot              | 87.5   | 55.4   | 97.0 | -                    | 51.2   | 10,7   | 96.4 | -          |
|               | Falcon Wo SMT                   | 95.8   | 78.0   | 87.5 | -                    | 83.9   | 58.3   | 85.7 | -          |
|               | Falcon We SMT + Self-Debugging  | 97.0   | 84.5   | 89.3 | -                    | 83.9   | 58.3   | 85.7 | -          |
|               | ENLOW                           | 100    | 99.4   | 100  |                      | 98.2   | 88.7   | 99.4 | -          |

#### **Experimental Results on Different Processors**

| Deformable Attention<br>( ~ 200LoCs) |                                    | 833.00                        | NG C             | C with VNNI -><br>CUDA C |                    |  |
|--------------------------------------|------------------------------------|-------------------------------|------------------|--------------------------|--------------------|--|
| 100                                  |                                    | Costs                         | Performance      | Costs                    | Performance        |  |
| Senior<br>Coder                      | Manual<br>w/ Falcon<br>Time Saving | ~6 d<br>4.5 + 0.5 h<br>~28.8× | 100%<br>69.20%   | ~1 d<br>2.1 h<br>~11.4×  | 100%<br>132.50 %   |  |
| Junior<br>Coder                      | Manual<br>w/ Falcon<br>Time Saving | ~30 d<br>4.5 + 3 h<br>~96.0×  | 49.85%<br>65.17% | ~3 d<br>2.1 h<br>~34.3×  | 75.76%<br>132.50 % |  |

**Productivity Improvement** 



#### Performance Comparison with Different PyTorch Backend Libraries

Enable rapid migration between software ecosystems:

- Automatic program translation across different processors (e.g., NVIDIA GPU, Cambricon MLU, AMD MI, Intel DLBoost) and programming models (e.g., SIMT, SIMD).
- Real-world C/CUDA-to-BANG program translation with a functional correctness rate exceeding 91.7%.
- Performance can reach up to 2× that of manually optimized vendor-provided code.
  - Programming efficiency can be improved by up to 96× compared to manual development.

## **Conclusion: From Top-down and Bottom-up to Self-Evolution**

| Library   | Automated High-Performance<br>Library Generation |  |  |
|-----------|--------------------------------------------------|--|--|
| Compiler  | Automated Tensor Program Transcompiler           |  |  |
| Compiler  | Automated Compiler<br>Tool-Chain Design          |  |  |
| OS        | Automated OS Configuration Optimization          |  |  |
| D         | Automated HDL Generation                         |  |  |
| Processor | Automated Front-End Design                       |  |  |

## Prior





https://qimeng-ict.github.io

## **Conclusion: From Top-down and Bottom-up to Self-Evolution**

| Library   | Automated High-Performance<br>Library Generation |
|-----------|--------------------------------------------------|
| Compiler  | Automated Tensor Program Transcompiler           |
| Compiler  | Automated Compiler<br>Tool-Chain Design          |
| OS        | Automated OS Configuration Optimization          |
| Duococco  | Automated HDL Generation                         |
| Processor | Automated Front-End Design                       |

#### **Prior**



**Feedback** 

