<table>
<thead>
<tr>
<th>著者</th>
<th>ポスト・カタコウゼ・ケンハシロ⑨スケキマスナカホ</th>
<th>岡本院・高橋・葉谷内麻仁・河野数</th>
</tr>
</thead>
<tbody>
<tr>
<td>タイトル</td>
<td>Phase-mode pipelined parallel multiplier</td>
<td></td>
</tr>
<tr>
<td>原稿のタイトル</td>
<td>IEEE Transactions on Applied Superconductivity</td>
<td></td>
</tr>
<tr>
<td>卷</td>
<td>11</td>
<td></td>
</tr>
<tr>
<td>号</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>ページ</td>
<td>541-544</td>
<td></td>
</tr>
<tr>
<td>年</td>
<td>2001</td>
<td></td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://hdl.handle.net/10097/48260">http://hdl.handle.net/10097/48260</a></td>
<td></td>
</tr>
</tbody>
</table>

doi: 10.1109/77.919402
Phase-Mode Pipelined Parallel Multiplier

Takeshi Onomi, Kiyoshi Yanagisawa, Masashi Seki, and Koji Nakajima

Abstract—We propose a pipelined parallel multiplier in phase-mode logic. The multiplier can be composed of combinations of gates which are the basic devices of the phase-mode logic. Experimental operations of the ICF gate and the Adder cell for the multiplier are reported. The proposed multiplier has a Wallace-tree structure comprising trees of carry save adders for the addition of partial products. This structure has a regular layout, hence it is suitable for a pipeline scheme. In the final stage of multiplication, a fast carry lookahead adder is used for generating a multiplication result. Using a Verilog-HDL simulation, we show that the parallel multiplier with 2.5kA/cm² Nb/AlOₓ/Nb junctions can operate over 10 GHz.

Index Terms—multiplier, phase mode, pipeline, single flux quantum.

I. INTRODUCTION

Recently, requirements of high-speed computation are increasing on the various areas such as nuclear science, weather forecasting, information processing of communication systems, molecular science, development of complex computer systems, etc. A single-flux-quantum (SFQ) logic has a great potential for such high-speed digital computations [1]-[3]. We have proposed phase-mode logic, which is an SFQ logic for a digital computation.

In this paper, we propose a pipelined parallel multiplier in phase-mode logic. Firstly, we report experimental results of the basic gate of phase-mode logic and the adder cell which are used for designing the multiplier. Secondly, a design of a carry lookahead (CLA) adder is described. This CLA adder plays an important role in the final stage of multiplication. Thirdly, we propose a multiplier having trees of carry save adders for the addition of partial products. This structure has a regular layout, hence it is suitable for a pipeline scheme. Finally, we discuss the performance of the multiplier. Using a Verilog-HDL simulation, we show that the parallel multiplier with 2.5kA/cm² Nb/AlOₓ/Nb junctions can operate over 10 GHz.

II. ICF GATE AND ADDER CELL

In this section, we describe the experimental test results of the basic gate and the adder cell which are the most basic circuits of pipelined parallel adder and multiplier.

A. ICF (INHIBIT controlled by fluxon) gate

We have proposed ICF(INHIBIT controlled by fluxon) gate as the basic device of the phase-mode logic. Since the first proposal, some types of ICF gates have been proposed. In this section, we describe the ICF gate designed by RSFQ scheme.

Fig. 1 shows the ICF gate and its Moore diagram. This gate has the same structure of D Flip-Flop with complementary outputs in RSFQ logic [4]. However, some modifications of the D Flip-Flop are needed to realize the perfect operation of the ICF gate. The modifications are the addition of Reset and C terminals. The operations of ICF gate are as follows: *If the internal state of the gate is "0", that is, no SFQ is in the loop including J2, J3, J7, L2, and J6, an SFQ from X outputs to A terminal through J4 and J7. *If the internal state is "0", an SFQ from Y is stored in the loop consisting of J2, J3, J7, L2, and J6 after passing through J1 and J7. *If the internal state is "1", an SFQ from X is combined with the stored SFQ, and outputs to B by J1, J3, J4, and J8 switching. *If the internal state is "1", an SFQ from Re is combined with the stored SFQ, and outputs to C by J10 and J6 switching. *If the internal state is "0", an SFQ from Re is emitted by J9.

B. Adder Cell

A serial input adder cell is achieved by feeding A output of an ICF gate back to Y input. Fig. 3 shows the adder cell using an ICF gate. Fig. 4 shows the low-speed test result of a
fabricated adder cell. This result shows the proper operation of the adder cell. The measured bias margin is ±7% which is lower than designed value (±39%).

Fig. 2. ICF gate fabricated by NEC 2.5kA/cm² Nb/AlOx/Nb standard process. (a) Microphotograph. (b) Low-speed test result. Three lower traces show output voltages of the SFQ/DC(T-flip-flop) pulse detectors.

Fig. 3. Adder cell. (a) Circuit diagram. (b) Moore diagram. (c) Symbol.

Fig. 4. Low-speed test result of the fabricated adder circuit.

III. CARRY LOOKAHEAD ADDER

A carry lookahead (CLA) adder described in this section plays an important role in a final stage of multiplication. In phase-mode logic, a parallel ripple carry adder has been proposed [1]. A ripple carry adder has the simplest structure, however it has a weak point of an increase of the operating time which is proportional to the number of bits. Most of parallel-adder circuits using semiconductor devices are based on a carry-lookahead architecture. In the RSFQ logic, a fast pipelined parallel CLA using only two types of cells (inverter and D-flip-flop) has been proposed [5]. A CLA adder has regularity in its structure and therefore it is suitable for a pipelined scheme. In this section, we describe a design of the carry lookahead adder using the ICF gates and adder cells described in the previous section. The CLA adder includes three arithmetical blocks which are Preprocessing, Carry Lookahead, and Postprocessing. In the following sections, the timing of the system is controlled by a traditional method of Phase-Mode logic [1]. The system operates asynchronously by using one timing signal being attached to one word. This method has the disadvantage of relatively lower-speed operation than synchronized circuits. However, this method is simple and does not need much attention to timing design. Timing signals to each of the bits on a pipeline stage are provided by signal distribution trees.

A. Preprocessing

\[
A(a_0b_0, \ldots, a_nb_n), B(b_0b_1, \ldots, b_{n-1}b_n), \text{and } C(c_0c_1, \ldots, c_{n-1}c_n) \text{ denote } N \text{ bits augend, addend, and carry, respectively.}
\]

The Preprocessing block generates the \(P\) (propagate) signal and the \(G\) (generate) signal represented by following equations

\[
\begin{align*}
P_i &= a_i \land b_i \quad (1) \\
G_i &= a_i \lor b_i \quad (1')
\end{align*}
\]

Namely, \(g_i=1\) represents the generation of the carry \(c_i\), and \(p_i=1\) represents the generation of the carry \(c_i\) when \(c_i=1\). \(P\) and \(G\) signals can be generated by a PG generator shown in Fig. 5.

\[
Q \text{ signal } (q_i=a_i \land b_i) \text{ in Fig.5 is used for generating Sum in the Postprocessing block.}
\]

B. Carry Lookahead

This block generates carry signals by using \(P\) and \(G\) signals. The carry \(c_i\) is represented by equation

\[
c_i = g_i + p_i \cdot c_{i-1} \quad (3)
\]

In a block including continuous bits from bit \(j\) to bit \(i\), \(P\) and \(G\) signals can be defined as \(p_j\) and \(g_j\). \(p_j\) signifies that a carry will propagate from bit \(j\) to bit \(i\). Similarly \(g_j\) denotes that a carry is generated in at least one of the bit positions from \(j\) to \(i\) inclusive and propagated to bit position \(i\). The carry \(c_{i-1}+g_{i-1}\) can be calculated efficiently by using the operator (\(\Delta\)) introduced by Brent and Kung [6]. The operator is used:

\[
P_{i,j} = p_{i+k+1} \land g_{i+k+1} \Delta (p_{k+1,j}, g_{k+1,j}) \quad (4)
\]

which is defined as

\[
P_{i,j} = p_{i+k+1} \land p_{k+1,j} \quad \text{and } g_{i,j} = g_{i+k+1} + p_{i+k+1}g_{k+1,j} \quad (5)
\]

The calculations of \(P=P_1\), \(P_k\) and \(G=G_1+G_k\), \(P_s\) can be achieved by using three ICF gates as shown in Fig.6. Various carry calculation methods using the \(\Delta\) operator have been proposed. Fig.7 shows the some methods having a tradeoff between speed and number of circuit elements. Fig.7(a) shows the notations of the \(\Delta\) operator cell and the dummy cell having 'only the data shiR function. Fig.7(b) and (c) are Sklansky's method [7] and Kogge and Stone's method [8], respectively. These methods have small stages of pipeline, hence they can achieve high-speed carry calculations. On the other hand, Brent and Kung's method [6] shown in Fig.7(d)
has small circuit elements. Table I shows the comparison of these methods in a 32-bit CLA.

C. Postprocessing
A sum of bit $i$ ($s_i$) is obtained by equation

$$s_i = p_i \oplus c_{i-1}. \quad (6)$$

Using $q_i$ generated by Preprocessing, a sum can be calculated as $s_i=q_i\oplus r_{i-1}$. This operation can be achieved by using Sum generator shown in Fig. 8.

![Delta operator cell](image)

**Fig. 6.** $\Delta$ operator cell.

**TABLE I**

<table>
<thead>
<tr>
<th>Scheme</th>
<th>Depth of tree</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sklansky</td>
<td>5</td>
</tr>
<tr>
<td>Kogge-Stone</td>
<td>5</td>
</tr>
<tr>
<td>Brent-Kung</td>
<td>9</td>
</tr>
</tbody>
</table>

![Comparison of 32-bit CLA algorithms](image)

IV. PIPELINED PARALLEL MULTIPLIER

In this section, we describe a design of the multiplier with a Wallace-tree structure [9]. This structure has a simple and regular layout therefore can easily comprise the combinations of ICF gates and adder cells in the previous section. The multiplier includes three arithmetical blocks which are a generation of partial products, an addition of partial products, and a generation of the multiplication result.

A. Generation of partial products
An AND cell shown in Fig. 10 is used for generating a partial product of a multiplication. AND cells forms an AND array shown in Fig. 11. The AND array sends SFQs (partial products) to each of bit lines serially.

![AND cell](image)

**Fig. 10.** AND cell. (a)Symbol of the cell using ICF gate. (b)Notation. (c)Moore diagram.

![AND array](image)

**Fig. 11.** AND array (4-bit).

B. Addition of partial products
We propose a multiplier with a tree structure using a carry save adder (CSA). The CSA can be realized by using a full adder circuit with an additional adder cell to store the carry output as shown in Fig. 12. The CSA has usually three inputs and two outputs. The CSA proposed in this section can easily expand into the cell with more inputs as shown in Fig. 12(b). After a partial products input to the CSA, the result of the addition is sent to outputs by reset signal. Fig. 13 shows the CSA cell using an ICF gate. Fig. 14 shows a 7-input CSA array which varies the addition of seven numbers into the one of three numbers.

![CSA cell](image)

**Fig. 12.** (n,m)CSA. n and m denote input number and output number, respectively. (a) (3,2)CSA. (b) (7,3)CSA.

![CSA cell](image)

**Fig. 13.** CSA cell. (a)Symbol of the cell using ICF gate. (b)Notation. (c)Moore diagram.

C. Generation of multiplication result
After additions of partial products are executed in order, the addition of two numbers remains finally. Using the carry lookahead adder in the previous section, the multiplication result can be obtained.

Fig. 15 shows an example of a 32x32-bit multiplier using (7,3) CSA.
The delay time of CSA is related to a time $O(n)$ being proportional to a number of serial inputs and a time $O(\log(n))$ depending on bit number of the adder. Accordingly, the processing time of one-stage CSA is represented by

$$T = nT_a + mT_{add},$$

where $n$ is a number of inputs, $m$ is a bit number of the adder, $T_a$ is a time interval between input pulses, and $T_{add}$ is a delay time of carry propagation per one bit. If the multiplier is designed by using a 3-input CSA array, the processing time per stage of the pipeline is minimum, therefore, the throughput is maximum. However, the tree structure is large and complicated because of an increase of the number of pipeline stages. On the other hand, if the number of inputs is increased, the scale of trees can be reduced by the decrease of pipeline stages. However, the throughput is decreased. Namely, it means a tradeoff between operation speed and integration scale. While we can not expect the maximum throughput, the integration scale of CSA trees can be decreased by using a 7-input CSA. Table II, Table III and Table IV shows the estimations of 32-bit multiplier without CLA block, CLA adder, and Kogge-Stone CLA block, respectively. These estimations were carried out by Verilog-HDL simulations assuming $J = 2.5\, \text{kA/cm}^2$. A final multiplication result was obtained by a 64-bit Kogge-Stone CLA adder. The maximum throughput of the CLA is estimated to be over 15 GOPS by the Verilog-HDL simulation. Hence, total throughput of the multiplier is limited by the throughput of CSA trees shown in Table II. As a result, a 32-bit multiplication over 10 GHz can be achieved.

### Table II

<table>
<thead>
<tr>
<th>Maximum input number of CSA</th>
<th>Depth of the tree</th>
<th>Throughput (GOPS)</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>9</td>
<td>13.3</td>
</tr>
<tr>
<td>7</td>
<td>5</td>
<td>6.7</td>
</tr>
<tr>
<td>32</td>
<td>3</td>
<td>2.3</td>
</tr>
</tbody>
</table>

### Table III

<table>
<thead>
<tr>
<th>Block</th>
<th>Number of ICF gates</th>
</tr>
</thead>
<tbody>
<tr>
<td>AND Array</td>
<td>1024</td>
</tr>
<tr>
<td>32-input CSA Tree</td>
<td>561</td>
</tr>
<tr>
<td>7-input CSA Tree</td>
<td>1110</td>
</tr>
<tr>
<td>3-input CSA Tree</td>
<td>2541</td>
</tr>
</tbody>
</table>

### Table IV

<table>
<thead>
<tr>
<th>Word length</th>
<th>Depth of the tree</th>
<th>Delay (ps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>8bit</td>
<td>3</td>
<td>304</td>
</tr>
<tr>
<td>16bit</td>
<td>4</td>
<td>376</td>
</tr>
<tr>
<td>32bit</td>
<td>5</td>
<td>446</td>
</tr>
<tr>
<td>64bit</td>
<td>6</td>
<td>517</td>
</tr>
</tbody>
</table>

VI. CONCLUSION

We have proposed a pipelined parallel multiplier using phase-mode logic. The multiplier is comprised of combinations of ICF gates which are the basic devices of the phase-mode logic. Experimental operations of the ICF gate and the Adder cell for the multiplier have been confirmed. The proposed multiplier has a Wallace-tree structure comprising trees of carry save adders for the addition of partial products. This structure has regularity in its layout, hence it is suitable for a pipelined scheme. On the final stage of multiplication, a fast carry lookahead adder is used for generating a multiplication result. Using a Verilog-HDL simulation, we have shown that the parallel multiplier with 2.5 kA/cm$^2$ Nb/AlOx/Nb junctions can achieve fast multiplication over 10 GHz.

### REFERENCES


