# Ferroelectric FETs-Based Nonvolatile Logic-in-Memory Circuits

Xunzhao Yin<sup>®</sup>, *Student Member, IEEE*, Xiaoming Chen<sup>®</sup>, *Member, IEEE*, Michael Niemier, *Senior Member, IEEE*, and Xiaobo Sharon Hu<sup>®</sup>, *Fellow, IEEE* 

Abstract-Among the beyond-complementary metal-oxidesemiconductor (CMOS) devices being explored, ferroelectric field-effect transistors (FeFETs) are considered as one of the most promising. FeFETs are being studied by all major semiconductor manufacturers, and experimentally, FeFETs are making rapid progress. FeFETs also stand out with the unique hysteretic  $I_{ds}$ - $V_{gs}$  characteristic that allows a device to function as both a switch and a nonvolatile (NV) storage element. We exploit this FeFET property to build two categories of fine-grained logic-in-memory (LiM) circuits: 1) ternary content addressable memory (TCAM) which integrates efficient and compact logic/processing elements into various levels of memory hierarchy; 2) basic logic function units for constructing larger and more complex LiM circuits. Two writing schemes (with and without negative supply voltages respectively) for FeFETs are introduced in our LiM designs. The resulting designs are compared with existing LiM approaches based on CMOS, magnetic tunnel junctions (MTJs), resistive random access memories (ReRAMs), ferrorelectric tunnel junctions (FTJs), etc., that afford the same circuit-level functionality. Simulation results show that FeFET-based NV TCAMs offer lower area overhead than MTJ (79%) and CMOS (42% less) equivalents, as well as better search energy-delay products (EDPs) than TCAM designs based on MTJ (149x), ReRAM (1.7x), and CMOS (1.3x) in array evaluations. NV FeFET-based LiM basic circuit blocks are also more efficient than functional equivalents based on MTJs in terms of propagation delay (4.2x) and dynamic power (2.5x). A case study for an FeFET-based LiM accumulator further demonstrates that by employing FeFET as both a switch and an NV storage element, the FeFET-based accumulator can save area (36%) and power consumption (40%) when compared with a conventional CMOS accumulator with the same structure.

Index Terms—Ferroelectric FET (FeFET), logic-in-memory (LiM), nonvolatile (NV) memory

## I. INTRODUCTION

**T** IS becoming increasingly difficult for complementary metal–oxide–semiconductor (CMOS) technology scaling to provide high performance and energy efficiency that emerging applications demand. Furthermore, information

Manuscript received March 9, 2018; revised August 3, 2018; accepted September 5, 2018. Date of publication October 4, 2018; date of current version December 28, 2018. This work was supported in part by the Semiconductor Research Corporation and DARPA and in part by the SRC STARnet centers, MARCO and DARPA. The work of X. Chen was supported by an Innovative Project of Institute of Computing Technology, CAS under Grant 5120186140. (*Corresponding author: Xunzhao Yin.*)

X. Yin, M. Niemier, and X. S. Hu are with the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556 USA (e-mail: xyin1@nd.edu; mniemier@nd.edu; shu@nd.edu).

X. Chen is with the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China (e-mail: chenxiaoming@ict.ac.cn).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2018.2871119

processing applications related to data mining, scientific computing, video/image streaming, etc., will continue to stress the processor-memory hierarchy. Recent work [1], [2] suggests that in order for future microprocessors to match traditional Moore's Law performance scaling trends, 576 terabits of data must be moved from registers/memory to logic *every second*. If each operand moves a distance of 1 mm (i.e., over 10% of the die), 58 W of a 65-W power budget would be allocated to just data transfers. However, if this distance can be reduced by  $10 \times$ , 90% of a 65-W power budget could be devoted to computations. Colocating processor elements and memory would obviously have a significant, positive impact. To address these challenges, researchers are looking to emerging devices, innovative circuits and architectures, and combinations thereof.

When looking at ways in which memory and logic elements can be brought closer together, we first consider "coarsegrained" efforts that encompass separate logic and memory elements that are in closer proximity to each other. Near-data processing [3]/processing-in-memory (PIM) prototypes have been heavily pursued since the 1990s [4], [5]. While projections suggested that many application classes could benefit from PIM systems, commoditized products did not materialize a trend due no small part to the economics of manufacturing logic in a dynamic random access memory process (or vice versa) [3]. More recently, 3-D integration is paving a new path toward realizable PIM systems. As examples, studies suggest that systems such as Micron's hybrid memory cube [6] could reduce execution time and system energy by  $15 \times$  and  $18 \times$ , respectively, for MapReduce [7], while the N3XT project suggests  $1000 \times$  improvements in energy efficiency for abundant data applications [8].

On the contrary to "coarse-grained" efforts, in this paper, we study how the emerging technologies can impact the performance, energy efficiency, and the area of "fine-grained" logic-in-memories (LiMs) circuits, which could tightly integrate processing and storage elements together. Frequently, based on emerging technologies, these LiM structures integrate nonvolatile (NV) storage elements with the logic itself. Here, we present two categories of LiMs: ternary content addressable memories (TCAMs) and basic logic function units. TCAMs perform parallel searches for a given piece of data against a table of stored data, and return information as to whether a match occurs. TCAMs have obvious utility in networking hardware and other applications, e.g., in routers, database search applications, and associative memories [9]. Basic Boolean logic function LiM structures employ the emerging devices as both an NV storage element and a variable resistor, and perform basic logic functions such as NAND, NOR, etc., based on the inputs as well as the bits stored in the emerging devices. These LiM structures might be repeatedly used over

1063-8210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.

the course of a given computation, e.g., for a sum-of-absolute differences (SAD) calculation commonly used in compression, motion detection, and so on [10].

We are especially interested in how ferroelectric field-effect transistors (FeFETs) [11] that 1) are compatible with current CMOS technologies [12] and 2) have been experimentally demonstrated [12]-[17], can lead to more efficient LiMs. Researchers have been investigating LiM designs based on resistive random access memories (ReRAMs), and spin-transfer torque random access memories (STT-RAMs) [18], [19]. Both devices use high-resistance states (HRSs) and low-resistance states (LRSs) to encode binary states. However, these technologies face challenges. For example, STT-RAM-based memories may have low variable resistance (from  $10\Omega$  to  $100k\Omega$  in general [19], [20]), low HRS/LRS ratios, and two terminal structures. These shortcomings can lead to relatively high energy consumption and extra transistors for write operations and to maintain acceptable output swings. In contrast, FeFETs are three-terminal devices. By tuning the thickness of the ferroelectric (FE) material at the gate, hysteresis can be introduced into a device's I-V characteristic allowing for a 1-T NV storage element. FeFET can also exhibit high ON–OFF ratios  $(I_{ON}/I_{OFF} \sim 10^6 [21], [22])$  and provide inherent gain and higher  $I_{ON}$  than a MOS field-effect transistor (MOSFET) as its structure is otherwise similar to a MOSFET [13], [23], [24]. Only one access transistor per FeFET is needed to facilitate writes, so that write/sense circuitry can be reduced.

We propose two writing schemes for FeFETs, with and without negative voltages respectively, and incorporate them into our LiM designs. Building off of the preliminary FeFET LiM designs from [25]–[27], we first consider FeFET TCAMs in terms of its structure, operations, and layout at the cell level. We then evaluate TCAM arrays with varying word widths as well as row numbers against other TCAM arrays [i.e., based on CMOS, magnetic tunnel junctions (MTJs), and ReRAMs]. We also examine the energy efficiency of FeFET-based TCAMs in the context of an enhanced graphics processing unit (GPU) architecture which utilizes TCAMs as an associative memory [28]. Our results show that an FeFET-based TCAM can save up to 42% area than a conventional 16-T CMOS-based TCAM with similar search energy and delay. In addition, an FeFET-based TCAM array with 64 rows can offer a maximum benefit of  $7.5 \times / 149 \times$ in energy-delay product (EDP) versus ReRAM-/MTJ-based designs.

Besides the TCAM design, we propose FeFET-based LiMs for basic logic function units by exploring two design styles: 1) dynamic current mode logic (DyCML) and 2) dynamic logic (DL). As a case study, we present the two LiM design styles for a full adder (FA). Our designs are compared with other equivalent LiM designs at the circuit level. Notably, FeFET-based approaches are more efficient than MTJ-based designs when considering metrics such as propagation delay  $(4.2\times)$  and dynamic power  $(2.5\times)$ . Compared with CMOS equivalents, FeFET-based designs still exhibit modest improvements in the aforementioned metrics while also offering nonvolatility and reduced device count. In addition, for an accumulator design based on the DL FA LiM design, the FeFET-based accumulator needs 35% fewer transistors and consumes 40% less power than a conventional CMOS accumulator.



Fig. 1. (a) FeFET structure and its equivalent circuit representation showing FE capacitance and the capacitance of the underlying MOSFET. (b) FeFET I-V curves with tunable hysteresis (from [26]).

The rest of the paper is organized as follows. Section II provides relevant background of FeFETs and presents two write schemes for FeFETs. In Section III, we discuss existing work based on FeFETs and related work regarding fine-grained LiM designs based on other emerging devices. Section IV describes the design of FeFET TCAMs, including their cell structures, operations and layouts, and the TCAM architecture for the application level evaluations. Section V discusses FeFET-based LiM designs based on the two designs styles. In Section VI, we present evaluation results, while Section VII concludes.

#### II. BACKGROUND

In this section, we describe some basics of FeFETs, including the FeFET device, the FeFET simulation model, and two general writing schemes for FeFETs.

## A. FeFET Device

An FeFET is built by stacking an FE layer on the gate of a MOSFET as shown in Fig. 1(a). The equivalent circuit is also shown in Fig. 1(a), where the FE capacitance ( $C_{\text{FE}}$ ) couples with the capacitance of the underlying MOSFET ( $C_{\text{MOS}}$ ). Per [11], there is a negative change in polarization of the FE layer with respect to the electric field, leading to a negative FE capacitance (i.e.,  $C_{\text{FE}} < 0$ ). A large  $C_{\text{FE}}/C_{\text{MOS}}$  ratio stabilizes the FE layer in the negative capacitance region, and therefore, the FE layer does not retain remnant polarization. This leads to a voltage step-up action in the device, which can result in steep-switching behavior. This type of device is referred to as a negative capacitance field-effect transistor (NCFET), and is being explored by both academia and industry [21], [29], [30].

As the FE layer thickness increases, and the  $C_{\rm FE}/C_{\rm MOS}$  ratio is sufficiently low, the polarization of the FE layer can be retained, leading to hysteretic behavior in an NCFET's transfer characteristic and, hence, nonvolatility. Such an NCFET with hysteresis is called an FeFET, and recently, experimental progress has been demonstrated with promising performance and utility in embedded NV memories for low-cost Internet-of-Things (IoT) applications [12]. Per Fig. 1(b), device hysteresis can span over positive and negative gate–source voltages ( $V_{\rm gs}$ ), and remains at high or low current in the absence of a gate–source voltage (i.e.,  $V_{\rm gs} = 0$ ). Per [31], electrostatic coupling between an FeFET's channel and drain on  $C_{\rm FE}$  and  $C_{\rm MOS}$  can alter the position and width of the hysteresis loop,

making the hysteresis tunable by the voltages at drain and gate.

The  $I_{\rm ON}/I_{\rm OFF}$  ratio of FeFETs corresponding to the two logic states ( $I_{\rm ON}$ ,  $I_{\rm OFF}$  represent logic "0" and "1," respectively) can be up to 10<sup>6</sup> due to the inherent gain of the underlying MOSFET [22]. This allows FeFETs to act as switches instead of variable resistors. Also, the FeFET's three-terminal structure separates the writing or polarization switching path (by applying sufficient positive/negative  $V_{\rm gs}$ ) from the reading or state sensing path (via the drain–source current). This provides more flexibility and less complexity in the design space when considering application-driven circuit and device optimizations versus other two-terminal NV devices (e.g., MTJs).

#### **B.** FeFET Simulation Model

In this paper, we adopt an FeFET model [15] which is based on the time-dependent Landau–Khalatnikov (LK) equation [32] and compatible with the simulation program with integrated circuit emphasis (SPICE). The LK equation describes the polarization-electric field behavior of an FE layer

$$E = \alpha P + \beta P^3 + \gamma P^5 + \rho \frac{\mathrm{d}P}{\mathrm{d}t} \tag{1}$$

where  $\alpha$ ,  $\beta$ , and  $\gamma$  are the static coefficients and  $\rho$  is a kinetic coefficient associated with the FE material. These coefficients in the model are calibrated to experimental data on hafnium zirconium oxide ( $\alpha = -7 \times 10^9$  m/F,  $\beta = 3.3 \times 10^{10}$  m<sup>5</sup>/F/coul<sup>2</sup>,  $\gamma = -2 \times 10^9$  m<sup>9</sup>/F/coul<sup>4</sup>, and  $\rho = 0.25$ ) when the FE thickness is 5.7 nm. The FeFET behavior is simulated by combining the self-consistent LK equation with the 45-nm predictive technology model (PTM) [33]. Fig. 1(b) illustrates a set of representative  $I_{ds}-V_{gs}$  curves of an FeFET with tunable hysteresis given by this model.

#### C. FeFET Writing Schemes

It is clear from the hysteresis shown in Fig. 1 that an FeFET can be written by applying a positive or negative  $V_{gs}$  to turn the conduction state to ON or OFF, respectively (voltage-based writing), and store this NV state as logic within the FeFET in the absence of voltage supply.

We realize two different writing schemes for FeFETs involving different voltage supply requirements as shown in Fig. 2. In Fig. 2(a), the source of the FeFET is grounded, and by applying positive/negative  $V_{DD}$  to the gate, the polarization of the FE material within the FeFET is changed, and logic "1"/"0" is written into the device [25], [26]. This writing scheme (referred to as WS1) uses a single access transistor associated with the gate terminal, and is relatively straightforward and simple. However, WS1 requires an extra negative voltage (i.e.,  $-V_{DD}$ ) to achieve the negative  $V_{gs}$ , demanding additional routing overhead. Fig. 2(b) shows a different writing scheme (referred to as WS2), where the source of the FeFET is connected to the inverted gate voltage during writing. Though WS2 has extra area overhead and possibly larger performance and energy impacts due to the parasitic capacitance at the source terminal, it fully eliminates the need for an additional negative supply voltage. The need for a negative voltage results in two shortcomings. First, overall routing cost due to the extra supply voltage is increased. Second, when cascading logic gates, every gate outputs have to be converted to negative voltage by using a voltage shifter if their subsequent gates



Fig. 2. Writing schemes for FeFETs. (a) WS1: FeFET writing scheme with negative voltage supply. (b) WS2: FeFET writing scheme without negative voltage supply. (c) Normalized polarization of FE layer during the write. The first row is gate–source voltage, and the second row is the normalized polarization.

are FeFET-based LiMs, which again increases the overall area and energy cost.

From Fig. 2(c), we could see that the polarization of the FeFET is determined by the gate-source voltage of the device, and the resulting  $I_{ds}$  versus  $V_{gs}$  characteristics of the FeFET is illustrated in Fig. 1(b). The difference between WS1 and WS2 only lies on the voltages applied to the gate and source terminals, but both write schemes do not have any difference in the gate-source voltage  $V_{gs}$  of the device (either positive  $V_{DD}$  or negative  $V_{DD}$ ). Thus, WS1 and WS2 do not change device characteristics shown in Fig. 1(b), and there is no threshold voltage shift for the underlying MOSFET structure between the two write schemes.

#### III. RELATED WORK

In this section, we first discuss about the existing work based on FeFETs, and then briefly review related work on fine-grained LiM circuit designs using MTJs or ReRAMs, as well as their limitations compared with FeFETs.

FeFETs have been actively studied in two categories: 1) low power circuits designs that leverage the steep slope property of FeFETs due to the negative capacitance coupling effect within the gate-stack within the devices [29] (this type of device is referred to as a NCFET); 2) memory-based circuit elements that leverage the NV storage property of FeFETs due to their hysteretic characteristics, e.g., lookup tables [34], memory designs [21], and so on. In this paper, we explore the utility of FeFETs' role as both storage elements and switches, and build the memory-based circuit designs, i.e., LiMs with MOSFETs and FeFETs.

Regarding TCAMs, Fig. 3(a) illustrates a conventional 16-T CMOS NOR-type TCAM cell design. An NV, 4-T-2MTJ cell circuit wa sproposed in [35], which consumes 40%/14% of the area of a 12-T/16-T CMOS-based TCAM, essentially due to the fact that the MTJs are placed on top of the MOSFETs. Due to its small output swing stemming from an MTJ's low tunneling magnetoresistance ratio, a sense amplifier (SA) and separate access transistors are added to the this TCAM



Fig. 3. Schematic of TCAMs based on different technologies. (a) 16-T NOR-type CMOS cell. (b) 9-T-2MTJ cell. (c) 2-T-2R cell.

cell to achieve full output swing, resulting in a 9-T-2MTJ TCAM design [36] [see Fig. 3(b)]. ReRAM-based TCAM designs have also been studied [18], [37], [38]. A 2-T-2R TCAM [Fig. 3(c)] has a compact structure and has been utilized as an example to overcome the existing challenges of degraded sensing margins caused by low HRS/LRS ratios of ReRAMs [18]. This design has also been used in an enhanced GPU architecture [39].

Besides the NV TCAM designs, another kind of LiM design employs the concept of current mode logic (CML) and a pair of emerging devices such as MTJs, ferrorelectric tunnel junctions (FTJs), etc., as a general circuit design style for basic logic functions. Namely, the design consists of two NV devices storing complementary resistive states as logic values, and a CML tree implementing the desired logic function. Hybrid circuits based on CMOS, MTJs, and FTJs have been considered for fine-grained LiMs due to nonvolatility, fast access capability, and high-write endurance [40]–[42]. NV LiM FAs were designed based on CMOS/MTJ and CMOS/FTJ technologies [41]–[44], and lower delay, dynamic power, and static power were reported when compared with CMOS equivalents. All of the designs share one common feature: they all employ a pair of NV devices as storage elements and part of the logic tree, as they determine the output by comparing the currents flowing through the two devices. FeFETs can be readily used to construct LiMs following the same CML style as other emerging device based LiMs. However, FeFETs' three-terminal structure also offers an FeFET-specific design style that makes FeFETs area, energy efficient for LiMs. We will show both design styles for FeFETs in Section V.

Compared with FeFETs, MTJs, ReRAMs, etc., have different device characteristics, and these characteristics lead to different, yet "weak" performance and power metrics in building LiMs. First, MTJs/ReRAMs exhibit relatively low resistance ratios (between 100%–250% for MTJs and  $10^1$ – $10^4$ for ReRAMs) compared with FeFETs, implying that: 1) the current flowing through the devices is weak with respect to drive capability and still contributes to leakage power due to



Fig. 4. Two FeFET TCAM cells. (a) With WS1. (b) With WS2. Precharge pMOS and SA are included.

the low resistance value, and 2) in order to achieve full voltage swing, the devices can only be used in CML-style circuits when employed in LiM designs (by storing complementary bits and performing a logic function via sensing the differential currents flowing through them). Second, these devices have two terminals, which requires additional transistors for read and write operations. However, given that an FeFET has a sufficiently high resistance ratio (10<sup>6</sup>) and three terminals (for separate writing/reading paths), it can serve as both a switch and an NV storage element that is both area and energy efficient. In Sections VI-C and VI-D, we compare the FeFET-based LiM designs employing different writing schemes against technology-based LiM designs to quantitatively capture the advantages of FeFETs.

## IV. FEFET-BASED TCAM DESIGN

In this section, we present our FeFET-based TCAM designs. Based on the two FeFET writing schemes, we present two TCAM cells and describe the differences between them via "apples-to-apples" comparisons with respect to structures, operation schemes, layouts, and other metrics. We then compare the FeFET-based TCAMs with other technology-based designs to highlight the benefits that FeFETs bring over other NV devices. Finally, we employ our designs in a TCAM-centric array architecture and evaluate the designs in Section VI.

## A. FeFET-Based TCAM Cell Design

The concept of our FeFET-based TCAM design, along with other FeFET-based LiM circuits are initially proposed in [26]. We discussed the basic FeFET-based TCAM design employing WS1 in [25] as shown in Fig. 4(a), which consists of two parallel FeFETs that are connected to the matchline (ML) via two transistors. In addition to storing the complementary bits of a logic value, the two FeFETs can also both store logic "0" which represents the "don't care" state. In the cell schematic, the transistors  $M_1/T_1$  and  $M_2/T_2$  serve as two pull down paths for ML to ground. The inputs to the transistors  $T_1$  and  $T_2$ (SL and  $\overline{SL}$ ) together with the memory state stored in  $M_1$ and  $M_2$  (S and  $\overline{S}$ ) determine whether the pull down paths are ON or OFF and provide an XNOR output  $\overline{S \oplus SL}$  at ML.

The design in Fig. 4(a) requires a negative supply voltage to be applied to bitlines when writing a logic "0" in FeFETS. The negative supply would lead to additional supply rail and extra routing overhead. We propose a new TCAM cell design to eliminate the need of a negative supply. Specifically, we



Fig. 5. FeFET-based TCAM cell simulation waveforms. Input refers to the waveforms of bitlinesBL and  $\overline{BL}$ .

modify the TCAM cell structure based on WS2 as to utilize the bitlines (BL and  $\overline{BL}$ ) for writing. Per Fig. 4(b), the TCAM cell keeps the transistor  $M_1/T_1$  and  $M_2/T_2$  the same as in Fig. 4(a), serving as the pull down paths, but from the ML to BL/BL, respectively, instead of to ground. In this new design, BL/BL serves as the ground for the discharging current, and the negative voltage needed in Fig. 4(a) is eliminated.

#### B. Search and Write Operations

The TCAM designs shown in Fig. 4 reflect the ideas of two different writing schemes described in Section II-C, and differ in the need for a negative voltage supply. In this section, we describe the search and write operations for both TCAMs.

The two TCAM designs both employ the same matchline connection to the comparison transistors [i.e., the connection of ML to  $T_1$  and  $T_2$  in both Fig. 4(a) and (b)]. They perform the search operation as follows: when *CLK* is low, matchline ML is precharged to a high level; and when *CLK* is high, the cell compares the input with the data stored in the FeFETs. If there is a match, both pull down paths are OFF, and the matchline ML is not discharged and stays high. If there is a mismatch, at least one of the pull down paths is ON, resulting in the discharge of the ML. In the "don't care" state, the ML always stays high, regardless of the input data. Note that in Fig.4(b), the bitlines are set to zero by switching the bitline driving buffers during the search operation. The input and output waveforms of a single TCAM cell for the search operation are shown in Fig. 5 as both designs perform the same operation. To illustrate that FeFETs can retain states, and the TCAM cell still functions as intended during a power supply interruption, we periodically set  $V_{DD} = 0$  in our simulation. In all cases, the cell functions exactly as intended/as it did before the power supply interruption.

The two TCAM designs employ different writing schemes through the different connections at the source terminals of the FeFETs. For the TCAM cell shown in Fig. 4(a), to perform a wordwise writing operation, the wordline (WL) is activated for the word to be written (i.e., setting the voltage of WL to  $V_{DD}$ ), and the voltages of the bitlines (BL and BL) are set according to the input data (i.e.,  $V_{DD}$  for logic "1" and  $-V_{DD}$  for logic "0") to switch the FE polarization within an FeFET.  $-V_{DD}$ is applied to the wordlines of unselected words to ensure that the gate–source voltages of the access transistors in those words remain nonpositive during the writing operation, so that no write disturbance occurs to those words. The searchlines (SL/SL) are driven to ground by the searchline buffers during the write operation to eliminate static current. This design requires  $-V_{DD}$  for writing (WS1).

For the TCAM cell shown in Fig. 4(b), the source terminals of  $M_1$  and  $M_2$  are connected to BL and BL, respectively, in order to eliminate the negative supply voltage. BL and BL are also used for the input data. In other words, the voltages of BL and BL are set according to the input data, i.e., applying  $V_{\text{DD}}$  to BL and 0 to BL for logic "1," and applying 0 to BL and  $V_{DD}$  to BL for logic "0," to switch the state in the FeFETs. In this cell, the wordlines of unselected words are set to 0 [instead of  $-V_{DD}$  for the design of Fig. 4(a)] to ensure that no write disturbance occurs to the unselected words. Note that in order to write a "don't care" state, two separate wordlines are employed in the cell since the writing operation has two steps: write logic "0" into one FeFET, and then write logic "0" to the other FeFET. This design eliminates the need for the negative voltage supply, making it possible to cascade logic gates without level shifters.

#### C. Layout and Area

To determine whether the proposed FeFET-based TCAMs can truly be competitive with functional equivalents based on CMOS and/or other emerging technologies, it is necessary to make "apples-to-apples" comparisons with other approaches in terms of area, latency, and energy. For the area metric, both FeFET-based designs require six transistors per TCAM cell, and an FeFET has similar area as a conventional MOSFET [see the structure of an FeFET shown in Fig. 1(a)]. The layouts of  $2 \times 2$  TCAM cells are shown in Fig. 6(a) and (b) for the two proposed designs, respectively. Note that the TCAM cell with WS2 has a larger area than the one with WS1 due to the separate wordlines WL0 and WL1.

Fig. 7 compares the FeFET TCAM cell sizes with other TCAM designs from the literature. Based on the "push rule" static random access memory (SRAM) scaling trend (i.e., 124F<sup>2</sup> at 65 nm and 171F<sup>2</sup> at 45 nm for CMOS SRAM area estimation) [45], [46], the area of a 16-T CMOS TCAM is projected to be 1.12  $\mu$ m<sup>2</sup>. Based on the layouts shown in Fig. 6, the FeFET-based TCAM cell size with WS1 and WS2 is estimated to be 58% and 86% of that of the 16-T CMOS design, respectively. When comparing with other emerging technologies (e.g., MTJ and ReRAM), we observe that the FeFET-based TCAM cell with WS1 and WS2 is 21% and 30% of that of the 4-T-2MTJ TCAM due to more advanced technology node, smaller device area, and compatibility with CMOS process, while ReRAM-based TCAMs have slightly smaller areas due to the reduced transistor counts (i.e., 2-T-2R TCAMs). The data points in Fig. 7 confirm this conjecture. Though existing FeFET-based TCAMs do not offer area advantage over ReRAM-based TCAMs, FeFET-based TCAMs can be superior in terms of energy and delay, which will be discussed in Section VI.

## D. TCAM Array Architecture

To ensure fair comparison of energy and delay, we evaluate all technologies in the context of similar TCAM array architectures that contain the same components. Specifically, we use the TCAM structure illustrated in Fig. 8. The array consists



Fig. 6. Layout of  $2 \times 2$  TCAM cells. (a) TCAM cell with WS1. (b) TCAM cell with WS2.  $\lambda$ : half-feature size *F*.



Fig. 7. Comparisons of TCAM cell sizes. The CMOS TCAM area projection is based on the scaling trend of push-rule SRAM according to the International Solid-State Circuits Conference trends from [46].

of the TCAM core, the input buffer/driver, the output SA (an inverter in our case), the clock signal, and the output encoder. The TCAM core contains M words with a word length



Fig. 8. Architecture of an  $M \times N$  TCAM array. Red wordlines for WS2.

of N bits. The matchlines (MLs) and wordlines (WLs) are placed horizontally, while the searchlinesSL/SL and bitlines (BL/BLs) are placed vertically within the TCAM cell grid. The searchlines and bitlines are driven by the input buffer and at the end of each matchline, an SA detects the voltage of the matchline, and outputs the indicator of match/mismatch to the encoder, which sends a "hit" signal and the corresponding address of the matched entry.

#### V. FEFET-BASED LIM CIRCUIT DESIGNS

In this section, we discuss two other circuit design styles, DyCML [47] and DL [48], that are amenable to LiM structures. We first discuss how to employ FeFETs to realize DyCML LiM. Then, we will present FeFET-based DL LiM, a more compact and FeFET-specific LiM design.

## A. FeFET-Based DyCML LiM Design

The DyCML design style has been exploited by many emerging NV devices (e.g., MTJ [40], [41], [43], [44], [49] and FTJ [42]) to realize NV LiM circuits. Besides the low power consumption and high performance advantages, DyCML's property of using the complementary signals is also desirable for the two-terminal devices.

FeFETs, when considered as switches, are also suitable for DyCML for building NV LiM circuits, and could offer additional improvements with respect to performance and energy efficiency compared with other NV device based LiMs. However, since the writing mechanisms of FeFETs are different from MTJs and FTJs, we cannot readily replace MTJs or FTJs in their DyCML-based LiM circuits with FeFETs, and new circuit designs are needed. Fig. 9 illustrates our proposed FeFET-based DyCML LiM circuit structure for both write schemes WS1 and WS2. The structure consists of four basic parts: 1) a clocked pull-up network (top center part) where  $M_3$  and  $M_4$  facilitate precharging, and  $M_5$  and  $M_6$  perform latching operations to maintain the circuit output postevaluation; 2) a logic network that implements the desired logic functionality; 3) a dynamic current source (lower center part); and 4) an NV storage based on two FeFETs plus two/four access transistors depending on the writing schemes. WS1 requires only two access transistors (drawn in black), while WS2 requires two additional transistors (shown in red) to monitor the source voltages of the FeFETs.

Depending on the complementary bits stored in the two FeFETs and the implementation of the logic network, the



Fig. 9. General circuit structure of FeFET-based DyCML LiM circuits. The two transistors shown in red are required for WS2.



Fig. 10. Schematic of FeFET-based DyCML LiM 1-bit FA. The red transistors are the extra ones for WS2.

pull-up network generates the corresponding complementary outputs. Our proposed FeFET-based LiMs enable various functions such as NOR/OR and NAND/AND, etc., based on different structures in the logic network. The insets in Fig. 9 show the NAND/AND and NOR/OR LiM designs, respectively.

Though our FeFET-based DyCML LiM structure appears similar to the MTJ-based LiM structure proposed in [41], [49], the writing mechanisms of the two structures are different, which leads to different connections in the access transistors associated with the storage devices. In [49] and [41] MTJ-based LiM requires two inverters whose outputs are connected via the two serial MTJs to implement the writing operation (current-based), while FeFET-based requires two access transistors to pass a bias voltage to the FeFETs (voltage-based).

Based on the FeFET-based DyCML LiM circuit structure mentioned above, we have designed NV FAs, as illustrated in Fig. 10. Fig. 11 shows the simulation waveforms corresponding to the FeFET-based FA with WS1, which demonstrates the correct functionality. During the precharge phase, wordlines (WL0 and WL1) are activated, and input B is written to FeFETs. Note that in WS2, in order to avoid static current caused by short-circuit current paths, the wordline associated with writing "0" is activated when CLK is at a low level, and the wordline for writing "1" is activated when CLK is at a high level. Then, the FA adds inputs A, B and  $C_i$ , and outputs S and  $C_o$  in the evaluate phase. Based on WS1 or WS2, this LiM FA requires 4 FeFETs and 28 MOSFETs (24 for DyCML and 4 for FeFET writing) or 4 FeFETs and 32 MOSFETs (24 for DyCML and 8 for FeFET writing), respectively.



Fig. 11. Simulation waveforms of FeFET-based DyCML LiM 1-bit FA.

Unlike its CMOS equivalent, the FeFET-based DyCML FA employs FeFETs in the pull-down network as both switches and memory, and obtains nonvolatility at the expense of additional access transistors (which can be reduced by half if the FeFETs storing the same bits share an access transistor). Compared with the MTJ-based LiM FA in [41], our proposed designs have less devices but much higher performance (greater than four times). More detailed comparisons will be given in Section VI-C. (In the context of application-level utility, we refer to the case studies for SAD in [40] and hardware security [50]–[54] that begin to consider the benefits of nonvolatility/local gate storage for specific problems of interest).

# B. FeFET-Based Dynamic Logic Design

The unique properties of FeFETs offer new opportunities for constructing LiM circuits that are not favorable for other emerging devices. Specifically, we consider designing DL style-based LiM circuits. CMOS DL gates find utility when improved performance and reduced area are demanded (e.g., in ARM Cortex A8 processors [55]). A DL circuit consists of a pull-up network that is simply a pMOS transistor with a clocked gate, an nMOS pull-down network that is similar in composition to the ones implemented in CMOS, and a clocked nMOS device that connects the pull-down network and ground [48]. By applying a clock signal, DL circuits use a sequence of precharge and conditional evaluation phases to realize complex logic functions. Transistor counts can essentially be reduced to half of DyCML's, logic delay is improved, and static power dissipation is eliminated. However, most emerging NV devices (e.g., MTJs and FTJs) cannot leverage the advantages offered by DL since they behave as variable resistors and suffer from considerable leakage current even when in a high-resistance state. On the contrary, FeFETs can be employed to cutoff conducting paths, which makes DL circuits more appealing.

We propose a generic FeFET-based DL LiM circuit structure employing the two writing schemes as shown in Fig. 12. A NAND gate and a NOR gate are also shown as representative examples. Note that we assume one of the two inputs is stored locally by leveraging the nonvolatility of FeFETs. Using conventional DL as context, FeFETs (along with associated



Fig. 12. General structure of FeFET-based DL circuits. The red transistors are the extra ones needed for WS2.



Fig. 13. Schematic of FeFET-based DL 1-bit FA. The red transistor is the access transistor needed for WS2.

access transistors) can be distributed in the pull-down network (i.e., with other *N*-channel devices) and can serve as both a logic switch and an NV storage element. Specifically, for WS1, bit *S* stored in the FeFET is written via the access transistor, which is controlled by wordline WL as well as the external input *Y*. Input *Y* is set to have either a positive or negative gate–source voltage for the FeFET to change its state to "1" or "0," respectively, thus achieving NV bit storage based on device hysteresis, albeit at the expense of an access transistor. For WS2, another access transistor (the device shown in red in Fig. 12) is needed at the source of the FeFET to deliver the inverted *Y* input. In this case, input *Y* is set to either  $V_{\text{DD}}$  or 0, which is consistent to the output swing of the circuit.

Fig. 13 shows the schematic of an FeFET-based DL 1-bit FA (the schematic in black is the FA for WS1, while the schematic including black and red transistors is for WS2). It is similar to a conventional DL FA, but the transistors associated with input B are replaced by the FeFET-based NV memory elements. As the memory elements store the same bit, the access transistor can be shared by the three FeFETs, which reduces the transistor count. Fig. 14 shows the simulation waveforms of the FeFET-based DL FA. All possible input combinations (with different stored bits) have been tested. Note that in WS2, if CLK is at low level, and



Fig. 14. Simulation waveforms of FeFET-based DL LiM 1-bit FA.

WL turns on the access transistors, a short-circuit current path would be formed from  $V_{DD}$  to  $\overline{Y}$ , causing significant static power. Thus, the write operation should occur either when CLK is high or when all the inputs are zero to avoid short circuit current. (Otherwise, additional transistors should be added along all the pull-down network paths that connect to  $\overline{Y}$ , and should be turned OFF to cutoff the short-circuit paths during the write operation.) One example, LiM circuit employing WS2 will be given in the case study. As will be seen in Section VI-C, due to the reduced transistor count and the DL-style employed, this FeFET-based NV LiM FA achieves better dynamic power efficiency as well as delay than other NV LiM FAs.

## VI. EVALUATION

In this section, we first discuss about the write time energy for the emerging devices, and then present a detailed performance and energy study of our FeFET-based TCAM array as well as LiM FA circuits, and compare them with equivalent designs based on CMOS, MTJ, FTJ, and ReRAM technologies. We examine a number of figures of merit including performance, energy consumption, device count, and nonvolatility property. We also present data for: 1) a TCAM-based GPU [56] as a case study to evaluate the energy efficiency of FeFET-based TCAMs in the context of an associative memory-based computing system and 2) an FeFET DL-based accumulator to evaluate the potential feasibility of FeFET-based LiM in building logic/sequential circuit blocks as well as the performance and energy improvements over other technologies.

## A. TCAM Array Evaluation

As illustrated in Section II-A, a three-terminal FeFET utilizes separate write and read paths to employ a voltage-based write mechanism, and can therefore significantly reduce the write-related metrics. We simulated a single FeFET device in Hewlett Simulation Program with Integrated Circuit Emphasis (HSPICE) for the write time and write energy, and extracted the data for MTJ and ReRAM from the DESTINY simulator [57]. Table I shows the comparisons between the FeFET and other emerging devices. FeFETs consume much less time and energy for write than ReRAM and MTJ devices.

TABLE I WRITE TIME AND ENERGY FOR EMERGING DEVICES



Fig. 15. Sixty-four-bit TCAM latencies in different sizes.

As the technology process and device modeling techniques advance, the write metrics of these devices will be improved, but FeFETs will still benefit the most from the voltage-based write mechanism.

As noted in Section IV-D, we replace the generic TCAM cell in Fig. 8 with technology-specific designs. All delay/energy evaluations are conducted via HSPICE simulation.

We compare four TCAM designs based on different devices: CMOS, FeFET, ReRAM, and MTJ, with the last three being NV. For FeFET-based TCAMs with two different writing schemes (WS1 and WS2), the FeFET model discussed in Section II-B and a 45-nm PTM model [33] are used. We assume minimum-sized transistors for the TCAM cell and SA. For the conventional CMOS-based TCAM [Fig. 3(a)], we use the same 45-nm PTM model for the sake of comparisons across technologies and minimum transistor sizes that were used for the FeFETs. Comparisons at 22-nm technology are also considered as in [26]. For the ReRAM-based TCAM, we adopt the 2-T-2R design [Fig. 3(c)], a common example used in the literature [18]. We assume 20 M $\Omega$  for HRS and 20 k $\Omega$  for LRS to implement the ReRAM-based TCAM, and simulations based on these HRS and LRS values show similar and even better search energy per bit over existing literature [37], [38]. For MTJ-based TCAMs, we employ the 9-T-2MTJ TCAM that uses a single-end SA and a single-pass transistor to achieve full swing output per Fig. 3(b) [36]. We assume MTJs with a parallel resistance  $(R_p)$  of 3 k $\Omega$ , and a magnetoresistance ratio of 120% [20] in the 9-T-2MTJ TCAM. All the simulations are based on 1-ns pulses with 50% duty cycle, and energy is calculated by multiplying the measured currents flowing through the voltage supplies associated with search operations with supply voltages during a single pulse.

Here, we summarize the delay and energy comparisons for the four different TCAMs assuming a 64-bit word with different numbers of rows. We choose a 64-bit word as it is of sufficient size for many applications such as network switches and routers [58]. Fig. 15 shows the search delays of 64-bit TCAMs based on different technologies at different sizes. The delay is measured for the worst case, where only



Fig. 16. Sixty-four-bit TCAM search energies and EDP in different sizes.

1-bit mismatches. For small-sized arrays, the ReRAM- and MTJ-based TCAM had lower delay due to the larger discharging current and the small load capacitance (one transistor per bit for MTJ-based TCAM) at the matchline, respectively. However, for large-sized arrays, the delay of the buffers which are used to drive the bitlines and searchlines across the array increases, and moreover, the in-cell SA of MTJ-based TCAM slows down, resulting in a rapidly growing total delay of MTJ-based TCAM versus other TCAM arrays. The reason that the FeFET-based TCAM with WS1 is faster than CMOS-based TCAM is that the FeFET has a larger  $I_{ON}$  as well as a better  $I_{ON}/I_{OFF}$  ratio, which leads to a larger discharging current upon a mismatch. However, with WS2, FeFET-based TCAM has bitlines buffers associated with the current paths, resulting in larger parasitic capacitance, and thus larger delay.

The total search energy per operation consists of two parts: the buffer energy and the cell energy. As TCAM size increases, the buffer sizes grow as well to drive the large TCAM array, thus the buffer energy increases. The cell energy depends on the schematics of the TCAM designs. For FeFETand ReRAM-based TCAMs, the cell consumes precharging energy; for CMOS- and MTJ-based TCAMs, the cell consumes precharging energy plus static energy due to the leakage of SRAM and conducting current associated with constantly ON paths, respectively. Fig. 16 shows the 64-bit TCAM search energy for different TCAMs. Note that MTJ-based TCAM is always conducting large static current due to its in-cell SA and low resistance values, causing significant energy consumption as high as 820 fJ per bit. To allow easier viewing of the graphs, the data for MTJ-based TCAM are not included in Fig. 16. FeFET-based TCAMs have similar latencies and energies as they have similar capacitance at the matchline and searchlines. FeFET-based TCAMs are also denser and NV than CMOS-based TCAM. According to Figs. 15 and 16, the FeFET-based TCAM with WS1 have EDPs that are  $1.7 \times$  (64-row) and  $149 \times$  (64-row) better than ReRAM and MTJ-based designs, respectively, and the FeFET-based TCAM with WS2 have EDPs that are  $1.5 \times$  and  $133 \times$  better than ReRAM- and MTJ-based designs, respectively.

## B. Case Study: TCAM-Based Associative Memory in a GPU

To further demonstrate the benefit of the proposed TCAMs, we evaluate the TCAM-based associative memory employed in an AMD Southern Island GPU device for energy reduction in the context of GP-GPU applications. A low-power GPU architecture introduced in [56] integrates a TCAM array with



Fig. 17. Framework integrating FPUs with TCAM as associative memory systems. The framework originates from [59].

each of the four main floating point units (FPUs) as shown in Fig. 17. The TCAM arrays store the frequently used patterns and are used as associative memory. When the system starts executing certain applications, it sends the input operands to both the FPU and TCAM block simultaneously. If a match happens, the low-power consuming TCAM array disables the corresponding high-power consuming FPU execution, and provides the actual result. With low-power TCAM designs, such a TCAM-based GPU architecture can provide unique advantages in many energy-conscious applications ranging from mobile devices to data centers.

We compare the four TCAM arrays discussed in Section VI-A in this GPU architecture. The evaluation process adopted here follows that proposed in [59]. The TCAM arrays use the minimum transistor sizes for buffer, cell and SA to save energy while satisfying the basic search functionality within the same cycle time of the FPUs. We assume 32-bit words for SQRT, 64-bit words for ADD/MUL, and 96-bit words for MAC, respectively, as they require varying numbers of input operands. Several OpenCL applications including three image processing and three general applications are run on the GPU platform, the data for the applications are from Caltech 101 data set [60], being partially trained (10%) and totally tested (100%), respectively. The trained data are then stored in the corresponding TCAM arrays.

For architecture-level evaluation, we need to evaluate the average energy consumption per operation for TCAMs at different sizes and all the FPUs. Table II summarizes the energy consumptions of TCAMs based on the four technologies. The individual FPU energy consumptions are obtained from the synthesized six-stage FPU design [56]. From the table, it can be observed that MTJ-based TCAMs consume much more energy than even the individual FPU energy, especially when the number of rows in the TCAM array is large. Thus, we will not show the MTJ-based TCAM data in later discussions. For the other three TCAM designs, their energy values suggest the potential for improved energy efficiency when they are integrated in the enhanced GPU architecture. Using MUL as an example, FeFET-3 (WS1 and WS2), CMOS-, and ReRAM-based TCAMs achieve  $14\times$ ,  $14\times$ ,  $11\times$ , and  $8.5\times$ energy efficiency at 64 rows compared with the corresponding FPU energy. The data also show that FeFET-based TCAMs consume the least energy among all the designs.

Fig. 18 shows the normalized total energy consumption of the GPU with different TCAM sizes for representative applications. The energy values are normalized to that of a GPU without the TCAM array. All energy curves have

TABLE II Energy (in Femtojoules) Per Operation for FPUs and TCAM Arrays

| Module   | FPU    | Device    | TCAM*  |         |         |  |  |
|----------|--------|-----------|--------|---------|---------|--|--|
| Wiodule  | (1.0V) |           | 4-row  | 16-row  | 64-row  |  |  |
| ADD      |        | FeFET WS1 | 63.6   | 175.6   | 714.0   |  |  |
| (64-bit) | 4742   | FeFET WS2 | 62.5   | 172.3   | 703.9   |  |  |
|          |        | CMOS      | 75.9   | 223.8   | 895.8   |  |  |
| MUL      | 0201   | MTJ       | 2149.0 | 9368.0  | 52488.0 |  |  |
| (64-bit) | 9691   | ReRAM     | 79.0   | 243.4   | 1159.6  |  |  |
|          |        | FeFET WS1 | 33.4   | 95.8    | 442.7   |  |  |
|          |        | FeFET WS2 | 32.6   | 94.2    | 436.8   |  |  |
| SQRT     | 9983   | CMOS      | 39.5   | 120.5   | 524.5   |  |  |
| (32-bit) |        | MTJ       | 1078.0 | 4699.0  | 26337.0 |  |  |
|          |        | ReRAM     | 41.5   | 135.9   | 569.7   |  |  |
|          |        | FeFET WS1 | 93.5   | 255.0   | 984.7   |  |  |
|          |        | FeFET WS2 | 91.8   | 248.1   | 966.2   |  |  |
| MAC      | 12051  | CMOS      | 110.8  | 321.7   | 1234.0  |  |  |
| (96-bit) |        | MTJ       | 3220.0 | 14039.0 | 78623.0 |  |  |
|          |        | ReRAM     | 116.2  | 350.8   | 1515.5  |  |  |

\* Only 3 row sizes (out of 7) are shown, others are omitted for space.

a similar trend for small TCAM sizes, the total energy consumption of the enhanced GPU architecture decreases as TCAM size increases, since more frequently referenced input patterns can be prestored in the TCAM and higher hit rates can be achieved, leading to fewer FPU operations. After reaching the minimum energy points, increasing TCAM size does not improve the hit rate enough to compensate for higher energy consumption associated with TCAMs, and the total energy starts to increase. Because of this, we omitted the data of larger (>64-row) sizes and smaller sizes (1-row and 2-row). From the figure, we conclude that FeFET-based TCAMs can achieve better energy efficiency than CMOS- and ReRAM-based TCAMs in the enhanced GPU architecture. Depending on the applications, the energy efficiency varies with TCAM sizes. On average, FeFET-based TCAMs with WS1 and WS2 achieve 45% and 48% energy savings over the six applications, respectively, while the average energy savings of CMOS- and ReRAM-based TCAMs are 36% and 37%, respectively (for 32-row TCAM size).

#### C. Logic-in-Memory Full Adder Evaluation

In this section, we compare and contrast the FeFET-based LiM circuits described in Section V with LiM circuits based on CMOS and other emerging technologies at similar feature sizes. We specifically consider the performance and power of the LiM FA. The metrics considered include propagation delay  $(T_d)$ , dynamic power  $(P_{\text{DYN}})$ , and static power dissipation  $(P_{\text{static}})$ . The simulation results for FeFET-based designs are obtained using HSPICE based on the FeFET device model discussed in Section II-B and 45-nm Arizona State University PTMs [33]. Data for other implementations are directly obtained from the relevant papers.

Table III summarizes the data for different FA designs. We examine the FeFET-based DyCML and DL LiM FA designs (rows 2 and 3, and rows 7 and 8, respectively), the conventional CMOS-based DyCML and DL FA (rows 4 and 9), an MTJ-based DyCML LiM FA (rows 5), and an FTJ-based DyCML LiM FA (row 6). All data are based on similar technology nodes (40 or 45 nm). For FeFET-based FAs, Table III shows the static power values when the FAs are powered ON instead of standby power which can be extremely low due to nonvolatility.

For the FeFET-based DyCML or DL FAs with WS1 and WS2 (row 2 and 3 or row 7 and 8), they have similar delay,



Fig. 18. Normalized energy consumption of TCAM-based GPU integrating different technology-based TCAMs of different sizes.

TABLE III Performance and Power of FAs

| Row | Device    | Design style  | Techonology<br>node      | Transistor count  | V <sub>DD</sub><br>(V) | <i>T</i> <sub>d</sub> (ps) | $P_{ m DYN}$<br>( $\mu$ W) | P <sub>static</sub><br>(nW) | NV or V |
|-----|-----------|---------------|--------------------------|-------------------|------------------------|----------------------------|----------------------------|-----------------------------|---------|
| 2   | FeFET WS1 | DyCML Fig. 10 | 45nm*                    | 28 MOS + 4 FeFETs | 1                      | 20.3                       | 1.10                       | 133.6                       | NV      |
| 3   | FeFET WS2 | DyCML Fig. 10 | 45nm*                    | 32 MOS + 4 FeFETs | 1                      | 18.9                       | 1.05                       | 148.8                       | NV      |
| 4   | CMOS      | DyCML [62]    | 45nm*                    | 28 MOS            | 1                      | 34.0                       | 0.95                       | 120.7                       | V       |
| 5   | MTJ       | DyCML [42]    | $40 \text{nm}^{\dagger}$ | 34 MOS + 4 MTJs   | 1.2                    | 87.4                       | 1.98                       | N.A.                        | NV      |
| 6   | FTJ       | DyCML [43]    | $40 \text{nm}^{\dagger}$ | 30 MOS + 4 FTJs   | 1.2                    | 500                        | 1.70                       | N.A.                        | NV      |
| 7   | FeFET WS1 | DL Fig.13     | 45nm*                    | 17 MOS + 3 FeFETs | 1                      | 21.8                       | 0.79                       | 73.6                        | NV      |
| 8   | FeFET WS2 | DL Fig.13     | 45nm*                    | 18 MOS + 3 FeFETs | 1                      | 20.6                       | 0.80                       | 80.0                        | NV      |
| 9   | CMOS      | DL            | 45nm*                    | 19 MOS            | 1                      | 19.4                       | 0.80                       | 79.8                        | V       |

In SPICE simulations, a 500MHz signal is applied to CLK.  $\star$ : based on ASU 45nm PTM [34].

†: based on STMicroelectronics 40nm design kit [62].

dynamic power, and static power, indicating that the writing schemes have little impact on the performance and power of the circuits. Comparing different technology-based designs, we first examine the data associated with DyCML-style FAs (rows 2–6). Considering the comparisons with a CMOS DyCML FA, one can see that the FeFET-based DyCML FAs have similar dynamic power as a conventional CMOS-based DyCML FA, which is expected since they have similar topologies. However, the FeFET-based DyCML FAs exhibit nonvolatility with minimal area/transistor overhead. When comparing FeFET-based NV FA with other NV approaches, improvements over other approaches in the published literature are observed. Notably, comparing FeFET-based DyCML FA with WS2 with MTJ/FTJ-based designs, the propagation delay of the FeFET approach is  $4.6 \times / 26.5 \times$  better, while the dynamic power is  $1.9 \times 1.6 \times$  better. The device counts of FeFET DyCML FAs are also smaller than the other approaches. Static power dissipation is unavailable for the MTJ/FTJ approaches, and hence, no comparison is made for this metric.

Comparing the FeFET-based DL FAs with FeFET-based DyCML FAs, one can see that FeFET-based DL FAs have much lower transistor count, and still offer comparable performance, power consumption, and nonvolatility. As a CMOS-based DL FA and FeFET-based DL FAs also have similar topologies, they are expected to have similar delay and dynamic power, except for nonvolatility. Moreover, improvements in terms of area, power, and performance are obtained due to the FeFET-specific DL design style when comparing with other emerging technology-based designs. Notably, the area-delay-power product of both FeFET-based DL FAs with WS1 and WS2 are  $19 \times /84 \times$  better than that of the MTJ/FTJ designs (area is assumed to be proportional to device count).

There are several reasons for the improvements associated with the FeFET-based circuits. First, FeFETs have higher Ion currents (~100 $\mu A$ ), while MTJs and FTJs have smaller  $I_{on}$ currents  $(\sim 10 \mu A)$ ,<sup>1</sup> which leads to the FeFET's improved performance over MTJs and FTJs. Furthermore, FeFETs have a high  $I_{\rm ON}/I_{\rm OFF}$  ratio (~10<sup>6</sup>), while MTJs and FTJs simply serve as tunable resistors (with just 120% magnetoresistance ratio and 220% tunnel electroresistance ratio, respectively [41], [42]). This feature enables FeFET-based LiM circuits to have stronger driving capability as well as less dynamic power consumption. In addition, the FeFET LiM structures require fewer transistors than other NV LiM circuits due to FeFET being a three-terminal device. Consequently, FeFETs' ON/OFF states are controlled by changing the gate bias via a single access transistor, while MTJ-/FTJ-based designs need to monitor tunnel resistance by reversing the current direction or applied voltage, which requires two or more write transistors. Finally, in DL, FeFETs can serve as both a storage element and a switch, while MTJs, for example, cannot.

## D. Case Study of an LiM-Based Accumulator

To further demonstrate the benefit of FeFETs in terms of nonvolatility and dual functionality as a switch and a storage element, we present a case study of a more complex LiM circuit. As FeFET WS2 allows for cascadable logic gates without the need of negative voltages, FeFET-based LiMs can be combined with other circuits to build more complex circuits that offer more compact topologies and improved energy efficiency. Here, we introduce an FeFET-based accumulator

<sup>1</sup>Note that the current numbers are extracted from [41] and [42]. With higher  $I_{on}$  currents of MTJs/FTJs demonstrated, the evaluation metrics for MTJ- and FTJ-based designs are expected to be improved.

TABLE IV Performance and Power of Accumulators

| Device | Design style    | Transistor count  | V <sub>DD</sub><br>(V) | $T_d$ (ps) | $P_{ m DYN}$<br>( $\mu  m W$ ) | P <sub>static</sub><br>(nW) | NV or V |
|--------|-----------------|-------------------|------------------------|------------|--------------------------------|-----------------------------|---------|
| FeFET  | DL + NAND latch | 37 MOS + 3 FeFETs | 1                      | 41.5       | 3.14                           | 132.8                       | NV      |
| FeFET  | DL + TX latch   | 31 MOS + 3 FeFETs | 1                      | 41.5       | 2.63                           | 116.8                       | NV      |
| CMOS   | DL + DFF        | 53 MOS            | 1                      | 38.0       | 4.41                           | 104.5                       | V       |



Fig. 19. Components of accumulator. (a) Conventional volatile accumulator. (b) NV accumulator.



Fig. 20. Schematic of FeFET-based accumulator.

that has adopted the FeFET-based DL LiM with WS2. Such an NV accumulator can be quite desirable for calculations such as SAD, which is commonly used in low-power applications, e.g., image processing.

A conventional accumulator consists of a volatile FA for addition and a D flip-flop for storage [see Fig. 19(a)]. By leveraging the storage function of FeFETs, we can build an NV accumulator by combining the FeFET-based DL FA with a latch [see Fig. 19(b)]. The schematic of the accumulator is shown in Fig. 20. Besides the latch design in the green box, three extra transistors (shown in red) are added between the pull-up and pull-down networks to avoid short-circuit current during the precharging/writing phase of the FA. When CLK is high, the FA performs the addition operation, and output S is transferred via the latch to Q and Q. With the falling edge of a CLK, the values Q and  $\overline{Q}$  are latched and written back into the FeFETs within the FA, while the FA simultaneously precharges the output S. With this operating sequence, an FeFET works as a switch during the addition operation, and an NV storage element during the write back operation.

Table IV summarizes the performance and power data for different accumulator designs. From the table, we can see that if an FeFET is used as a latch, the device count of the accumulator is reduced by roughly the amount of a latch, as only one latch is needed to cascade an FA instead of a D flip-flop. The decrease in the active transistors further reduces the power consumption of the accumulator without performance degradation per Table IV. Notably, the FeFET-based accumulator that consists of a DL adder and a transmission gate-based latch uses 36% less transistors, and consumes 40% less dynamic power compared with a conventional CMOS accumulator. Moreover, experimental results of  $10^7$  [63] and  $10^{12}$  [64] endurance cycles have been demonstrated to pave the paths toward sufficient and better device endurance for the realization of embedded storage solutions.

#### VII. CONCLUSION

In this paper, we introduce two types of NV LiM circuits based on FeFETs: one is TCAM and the other is Boolean logic gates. These circuits fully exploit the unique hysteretic behavior of FeFETs, and two writing schemes are presented for FeFETs involving different voltage supply requirements and transistor connections. Due to the FeFET's three-terminal structure, high  $I_{\rm ON}$  and high  $I_{\rm ON}/I_{\rm OFF}$  ratio, these FeFET-based circuits offer area, performance, and energy efficiency over other CMOS and emerging device-based equivalents.

FeFET-based TCAM designs with write schemes WS1 and WS2 requires 42% and 10% less area overhead than a CMOS-based TCAM, respectively. For delay and energy consumption comparison, the FeFET-based TCAM with WS1 achieves better EDPs than ReRAM  $(1.7\times)$ - and MTJ  $(149\times)$ -based designs, respectively. The FeFET-based TCAM with WS2 also achieves better EDPs than ReRAM  $(1.5\times)$ - and MTJ  $(133\times)$ -based designs, respectively. In the TCAM-based GPU evaluation, on average, FeFET-based TCAMs with WS1 and WS2 achieve more energy savings (45% and 48% respectively) than ReRAM- and CMOS-based TCAMs. The results above indicate potential benefits of FeFET-based TCAMs in related applications such as network routers and switches.

Besides TCAM designs, the FeFET-based LiMs that are based on DyCML and DL styles are also presented. In the FA study, the FeFET-based DyCML LiMs slightly outperform a volatile CMOS-based functional equivalent in terms of propagation delay and dynamic power. Simulation results suggest that the FeFET-based DyCML FA with WS1 has  $7.7 \times$  and  $38 \times$  better power-delay products (PDPs) than MTJand FTJ-based equivalents, respectively. The FeFET-based DyCML FA with WS2 has  $8.7 \times$  and  $43 \times$  better PDPs than MTJ- and FTJ-based equivalents. The FeFET-based DL FAs with WS1 and WS2 both achieve  $19 \times$  and  $84 \times$  better areadelay-power products than that of the MTJ and FTJ designs, respectively. In the accumulator study, we further examine the area, performance, and power of FeFET DL FA-based accumulator over a conventional volatile accumulator. Results show that using FeFETs as a latch saves 36% area overhead and 40% dynamic power over a CMOS accumulator, while consuming similar delays.

We should point out that all of our projections are likely pessimistic as we assume MOSFETs *in lieu* of FeFETs for nonhysteretic devices. From fabrication perspectives, nonhysteretic FeFETs should be compatible with all FeFET structures assumed here, and also offer improved switching slopes, which should lead to additional improvements with respect to delay and power [15]. In addition, we will consider more application-specific case studies (i.e., for TCAM designs [28], nanofunctions such as SAD, and circuitry required for critical path operations per [55]).

#### REFERENCES

- S. Borkar and A. A. Chien, "The future of microprocessors," *Commun. ACM*, vol. 54, no. 5, pp. 67–77, May 2011, doi: 10.1145/1941487.1941507.
- [2] H.-S. P. Wong and S. Salahuddin, "Memory leads the way to better computing," *Nature Nanotechnol.*, vol. 10, no. 3, pp. 191–194, 2015.
- [3] R. Balasubramonian *et al.*, "Near-data processing: Insights from a MICRO-46 workshop," *IEEE Micro*, vol. 34, no. 4, pp. 36–42, Jul. 2014.
- [4] C. E. Kozyrakis *et al.*, "Scalable processors in the billion-transistor era: IRAM," *Computer*, vol. 30, no. 9, pp. 75–78, Sep. 1997.
- [5] J. Draper *et al.*, "The architecture of the diva processing-in-memory chip," in *Proc. 16th Int. Conf. Supercomput. (ICS)*, New York, NY, USA, 2002, pp. 14–25, doi: 10.1145/514191.514197.
- [6] J. T. Pawlowski, "Hybrid memory cube (HMC)," in *Proc. HotChips*, 2011, pp. 1–24.
- [7] S. H. Pugsley *et al.*, "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in *Proc. ISPASS*, Mar. 2014, pp. 190–200.
- [8] M. M. S. Aly et al., "Energy-efficient abundant-data computing: The N3XT 1,000x," Computer, vol. 48, no. 12, pp. 24–33, Dec. 2015.
- [9] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, "Emerging trends in design and applications of memory-based computing and content-addressable memories," *Proc. IEEE*, vol. 103, no. 8, pp. 1311–1330, Aug. 2015.
- [10] K. Sun, "Adaptive step-size motion estimation based on statistical sum of absolute differences," U.S. Patent 6014181, Jan. 11, 2000.
- [11] S. Salahuddin and S. Datta, "Use of negative capacitance to provide voltage amplification for low power nanoscale devices," *Nano Lett.*, vol. 8, no. 2, pp. 405–410, 2007.
- [12] M. Trentzsch *et al.*, "A 28 nm HKMG super low power embedded NVM technology based on ferroelectric FETs," in *IEDM Tech. Dig.*, Dec. 2016, pp. 11.5.1–11.5.4.
- [13] K.-S. Li et al., "Sub-60 mV-swing negative-capacitance FinFET without hysteresis," in IEDM Tech. Dig., Dec. 2015, pp. 6–22.
- [14] P. Sharma *et al.*, "Impact of total and partial dipole switching on the switching slope of gate-last negative capacitance FETs with ferroelectric hafnium zirconium oxide gate stack," in *Proc. Symp. VLSI Technol.*, 2017, pp. T154–T155.
- [15] A. Aziz, S. Ghosh, S. Dutta, and S. K. Gupta, "Physics-based circuitcompatible SPICE model for ferroelectric transistors," *IEEE Electron Device Lett.*, vol. 37, no. 6, pp. 805–808, Jun. 2016.
- [16] M. H. Lee *et al.*, "Prospects for ferroelectric HfZrO<sub>x</sub> FETs with experimentally CET = 0.98 nm, SS<sub>for</sub> =42 mV/dec, SS<sub>rev</sub> = 28 mV/dec, switch-off <0.2 V, and hysteresis-free strategies," in *IEDM Tech. Dig.*, Dec. 2015, pp. 5–22.
- [17] Y. Katoh, S. Fujieda, Y. Hayashi, and T. Kunio, "Non-volatile FCG (ferroelectric-capacitor and transistor-gate connection) memory cell with non-destructive read-out operation," in *Symp. VLSI Technol. Dig. Tech. Papers*, 1996, pp. 56–57.
- [18] J. Li *et al.*, "1 Mb 0.41  $\mu$ m<sup>2</sup> 2T-2R cell nonvolatile TCAM with twobit encoding and clocked self-referenced sensing," *IEEE J. Solid-State Circuits*, vol. 49, no. 4, pp. 896–907, Apr. 2014.
- [19] L. Xue, Y. Cheng, J. Yang, P. Wang, and Y. Xie, "ODESY: A novel 3T-3MTJ cell design with optimized area density, scalability and latency," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD)*, Nov. 2016, pp. 1–8.
- [20] C. J. Lin *et al.*, "45 nm low power CMOS logic compatible embedded STT MRAM utilizing a reverse-connection 1T/1Mtj cell," in *IEDM Tech. Dig.*, 2009, pp. 1–4.
- [21] S. George et al., "Nonvolatile memory design based on ferroelectric FETs," in Proc. 53rd Annu. Design Autom. Conf., 2016, p. 118.
- [22] C. W. Yeung, A. I. Khan, A. Sarker, S. Salahuddin, and C. Hu, "Low power negative capacitance FETs for future quantum-well body technology," in *Proc. Int. Symp. VLSI Technol., Syst., Appl. (VLSI-TSA)*, 2013, pp. 1–2.
- [23] A. I. Khan, C. W. Yeung, C. Hu, and S. Salahuddin, "Ferroelectric negative capacitance MOSFET: Capacitance tuning & antiferroelectric operation," in *IEDM Tech. Dig.*, Dec. 2011, pp. 3–11.
- [24] S. Dasgupta et al., "Sub-kT/q switching in strong inversion in PbZr<sub>0.52</sub>Ti<sub>0.48</sub>O<sub>3</sub> gated negative capacitance FETs," *IEEE J. Explor. Solid-State Computat. Devices Circuits*, vol. 1, pp. 43–48, 2015.

- [25] X. Yin, M. Niemier, and X. S. Hu, "Design and benchmarking of ferroelectric FET based TCAM," in *Proc. Design, Automat. Test Eur. Conf. Exhibit. (DATE)*, 2017, pp. 1444–1449.
- [26] X. Yin *et al.*, "Exploiting ferroelectric FETs for low-power non-volatile logic-in-memory circuits," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD)*, Nov. 2016, pp. 1–8.
- [27] A. Aziz et al., "Computing with ferroelectric FETs: Devices, models, systems, and applications," in *Proc. Design, Automat. Test Eur. Conf. Exhibit. (DATE)*, 2018, pp. 1289–1298.
- [28] M. Imani, A. Rahimi, and T. S. Rosing, "Resistive configurable associative memory for approximate computing," in *Proc. Design, Automat. Test Eur. Conf. Exhibit. (DATE)*, 2016, pp. 1327–1332.
- [29] S. George *et al.*, "Device circuit co design of FEFET based logic for low voltage processors," in *Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI)*, Jul. 2016, pp. 649–654.
- [30] EE Times. (Apr. 1, 2016). FinFet's Father Forecasts Future. [Online]. Available: http://www.eetimes.com/document.asp?doc\_id=1329333
- [31] A. I. Khan, "Negative capacitance for ultra-low power computing," Ph.D. dissertation, Dept. EECS, Univ. California Berkeley, Berkeley, CA, USA, 2015.
- [32] T. K. Song, "Landau-Khalatnikov simulations for ferroelectric switching in ferroelectric random access memory application," J. Korean Phys. Soc., vol. 46, no. 1, pp. 5–9, 2005.
- [33] R. Vattikonda, W. Wang, and Y. Cao, "Modeling and minimization of PMOS NBTI effect for robust nanometer design," in *Proc. 43rd Annu. Design Autom. Conf.*, 2006, pp. 1047–1052.
- [34] K. Li, Y. Xiong, M. Tang, Y. Qin, Z. Li, and Y. Zhou, "Design and implementation of FeFET-based lookup table," in *Proc. 12th IEEE Int. Conf. Solid-State Integr. Circuit Technol. (ICSICT)*, Oct. 2014, pp. 1–3.
- [35] S. Matsunaga *et al.*, "A 3.14 μm<sup>2</sup> 4T-2MTJ-cell fully parallel TCAM based on nonvolatile logic-in-memory architecture," in *Proc. Symp. VLSI Circuits (VLSIC)*, 2012, pp. 44–45.
- [36] S. Matsunaga, A. Katsumata, M. Natsui, T. Endoh, H. Ohno, and T. Hanyu, "Design of a nine-transistor/two-magnetic-tunnel-junctioncell-based low-energy nonvolatile ternary content-addressable memory," *Jpn. J. Appl. Phys.*, vol. 51, no. 2S, p. 02BM06, 2012.
- [37] M.-F. Chang et al., "A 3T1R nonvolatile TCAM using MLC ReRAM with sub-1 ns search time," in *IEEE Int. Solid-State Circuits Conf.* (ISSCC) Dig. Tech. Papers, Feb. 2015, pp. 1–3.
- [38] C.-C. Lin et al., "A 256 b-wordlength ReRAM-based TCAM with 1 ns search-time and 14× improvement in wordlength-energyefficiencydensity product using 2.5T1R cell," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan./Feb. 2016, pp. 136–137.
- [39] M. Imani, S. Patil, and T. S. Rosing, "MASC: Ultra-low energy multiple-access single-charge TCAM for approximate computing," in *Proc. Conf. Design, Autom. Test Eur.*, 2016, pp. 373–378.
- [40] A. Mochizuki, H. Kimura, M. Ibuki, and T. Hanyu, "TMR-based logic-in-memory circuit for low-power VLSI," *IEICE Trans. Fundam. Electron., Commun. Comput. Sci.*, vol. 88, no. 6, pp. 1408–1415, 2005.
- [41] E. Deng, Y. Zhang, J.-O. Klein, D. Ravelsona, C. Chappert, and W. Zhao, "Low power magnetic full-adder based on spin transfer torque MRAM," *IEEE Trans. Magn.*, vol. 49, no. 9, pp. 4982–4987, Sep. 2013.
- [42] Z. Wang *et al.*, "A physics-based compact model of ferroelectric tunnel junction for memory and logic design," *J. Phys. D, Appl. Phys.*, vol. 47, no. 4, p. 045001, Dec. 2013.
- [43] S. Matsunaga *et al.*, "Fabrication of a nonvolatile full adder based on logic-in-memory architecture using magnetic tunnel junctions," *Appl. Phys. Express*, vol. 1, no. 9, p. 091301, Aug. 2008.
- [44] S. Matsunaga *et al.*, "MTJ-based nonvolatile logic-in-memory circuit, future prospects and issues," in *Proc. Design, Automat. Test Eur.*, 2009, pp. 433–435.
- [45] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, "A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling logic-in-memory," *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 1009–1021, Apr. 2016.
- [46] S. G. Narendra, L. C. Fujino, and K. C. Smith, "Through the looking glass? The 2015 edition: Trends in solid-state circuits from ISSCC," *IEEE Solid-State Circuits Mag.*, vol. 7, no. 1, pp. 14–24, Feb. 2015.
- [47] M. W. Allam and M. I. Elmasry, "Dynamic current mode logic (DyCML): A new low-power high-performance logic style," *IEEE J. Solid-State Circuits*, vol. 36, no. 3, pp. 550–558, Mar. 2001.
- [48] L. Wanhammar, DSP Integrated Circuits. New York, NY, USA: Academic, 1999.

- [49] E. Deng, Z. Wang, J.-O. Klein, G. Prenat, B. Dieny, and W. Zhao, "High-frequency low-power magnetic full-adder based on magnetic tunnel junction with spin-Hall assistance," *IEEE Trans. Magn.*, vol. 51, no. 11, Nov. 2015, Art. no. 1401704.
- [50] Y. Bi, K. Shamsi, J.-S. Yuan, Y. Jin, M. Niemier, and X. S. Hu, "Tunnel FET current mode logic for DPA-resilient circuit designs," *IEEE Trans. Emerg. Topics Comput.*, vol. 5, no. 3, pp. 340–352, Jul./Sep. 2017.
- [51] T. Winograd, H. Salmani, H. Mahmoodi, K. Gaj, and H. Homayoun, "Hybrid STT-CMOS designs for reverse-engineering prevention," in *Proc. 53rd ACM/EDAC/IEEE Design Automat. Conf. (DAC)*, Jun. 2016, pp. 1–6.
- [52] Y. Bi et al., "Emerging technology-based design of primitives for hardware security," ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 1, p. 3, 2016.
- [53] Y. Bi, X. S. Hu, Y. Jin, M. Niemier, K. Shamsi, and X. Yin, "Enhancing hardware security with emerging transistor technologies," in *Proc. 26th Ed. Great Lakes Symp. VLSI*, 2016, pp. 305–310.
- [54] A. Chen, X. S. Hu, Y. Jin, M. Niemier, and X. Yin, "Using emerging technologies for hardware security beyond PUFs," in *Proc. Design*, *Automat. Test Eur.*, 2016, pp. 1544–1549.
- [55] D. Williamson, "ARM Cortex-A8: A high-performance processor for low-power applications," in *Unique Chips and Systems*. Boca Raton, FL, USA: CRC Press, 2007, pp. 91–118.
- [56] A. Rahimi, A. Ghofrani, K.-T. Cheng, L. Benini, and R. K. Gupta, "Approximate associative memory for energy-efficient GPUs," in *Proc. Design, Automat. Test Eur. Conf.*, 2015, pp. 1497–1502.
- [57] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, "DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches," in *Proc. Design, Automat. Test Eur. Conf. Exhibit.*, 2015, pp. 1543–1546.
- [58] H. J. Chao and B. Liu, *High Performance Switches and Routers*. Hoboken, NJ, USA: Wiley, 2007.
- [59] M. Imani, S. Patil, and T. Rosing, "Approximate computing using multiple-access single-charge associative memory," *IEEE Trans. Emerg. Topics Comput.*, vol. 6, no. 3, pp. 305–316, Jul./Sep. 2018.
- [60] Computational Vision at CalTech. Accessed: Jan. 8, 2016. [Online]. Available: http://www.vision.caltech.edu/Image\_Datasets/Caltech1
- [61] F. Ren and D. Markoviç, "True energy-performance analysis of the MTJ-based logic-in-memory architecture (1-bit full adder)," *IEEE Trans. Electron Devices*, vol. 57, no. 5, pp. 1023–1028, May 2010.
- [62] "Design rule manual for CMOS 40 nm," STMicroelectron., Crolles, France, Tech. Rep. C40, 2012.
- [63] K. Chatterjee *et al.*, "Self-aligned, gate last, FDSOI, ferroelectric gate memory device with 5.5-nm Hf<sub>0.8</sub>Zr<sub>0.2</sub>O<sub>2</sub>, high endurance and breakdown recovery," *IEEE Electron Device Lett.*, vol. 38, no. 10, pp. 1379–1382, 2017.
- [64] C.-H. Cheng and A. Chin, "Low-leakage-current DRAM-like memory using a one-transistor ferroelectric MOSFET with a Hf-based gate dielectric," *IEEE Electron Device Lett.*, vol. 35, no. 1, pp. 138–140, Jan. 2014.



Xunzhao Yin (S'16) received the B.S. degree in electronic engineering from Tsinghua University, Beijing, China, in 2013. He is currently working toward the Ph.D. degree at the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA. Since 2013, he has been a Research Assistant at

Since 2015, he has been a Research Assistant at the University of Notre Dame, where he is currently a member at the Center for Low Energy System Technology and the Applications and Systems Driven Center for Energy Efficient Integrated

Nanotechnologies, where he is involved in the novel circuits and systems based on beyond-CMOS technologies. His current research interests include low-power circuit design and novel computing paradigms with both CMOS and emerging technologies.



Xiaoming Chen (S'12–M'15) received the B.S. and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, in 2009 and 2014, respectively.

From 2016 to 2017, he was a Visiting Assistant Professor with the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA. He is currently an Associate Professor at the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His

current research interests include hardware security, GPU-accelerated machine learning, and emerging nonvolatile devices.

Dr. Chen was a recipient of the 2015 EDAA Outstanding Dissertation Award.



Michael Niemier (S'00–M'03–SM'11) received the B.S., M.S., and Ph.D. degrees in computer engineering from the University of Notre Dame, Notre Dame, IN, USA, in 1998, 2000, and 2004, respectively.

He was a National Science Foundation Graduate Research Fellow with the University of Notre Dame. He was a Faculty Member with the Georgia Institute of Technology, Atlanta, GA, USA. He is currently an Associate Professor at the University of Notre Dame. His current research interests include designing, facilitating, evaluating architectures for emerging

technologies with a current emphasis on emerging transistor technologies.

Dr. Niemier is an active member of the program committees for DAC, DATE, ICCAD, and so on. He served as the Chair for the emerging technologies track at DAC, DATE, ICCAD, and so on. He was a recipient of the IBM Faculty Award, the Best Paper Award at the IEEE Symposium on Nanoscale Architectures in 2009, the Joyce Award for Excellence in Teaching at University of Notre Dame in 2014.



Xiaobo Sharon Hu (S'85–M'89–SM'02–F'16) received the B.S. degree from Tianjin University, Tianjin, China, in 1982, the M.S. degree from the Polytechnic Institute of New York, New York, NY, USA, in 1984, and the Ph.D. degree from Purdue University, West Lafayette, IN, USA in 1989.

She is currently a Professor at the Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA. Her current research interests include computing with

beyond-CMOS technologies, low-power system design, and cyber-physical systems. She has authored or coauthored more than 300 papers in the related areas.

Dr. Hu is the General Chair for 2018 Design Automation Conference and the Program Chair and TPC Chair of 2016 DAC and 2015 DAC, respectively. She was a recipient of the NSF CAREER Award in 1997, the Best Paper Award from Design Automation Conference in 2001, and the IEEE Symposium on Nanoscale Architectures in 2009. She served as an Associate Editor for the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, ACM Transactions on Design Automation of Electronic Systems, and ACM Transactions on Embedded Computing. He serves an Associate Editor for ACM Transactions on Cyber-Physical Systems.