











Sapan Agarwal, Alexander Hsia, Robin Jacobs-Gedrim, David R. Hughart, Steven J. Plimpton, Conrad D. James, Matthew J. Marinella Sandia National Laboratories



Hardware Acceleration of Adaptive Neural Algorithms



Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

# The Von Neumann Bottleneck





# Use Resistive Memories for Local Computation

multiplication

 $\mathbf{V} = \mathbf{I} \times \mathbf{R}$ 

 $I = G \times V \leq$ 



• A resistive memory or ReRAM is a programmable resistor

- Apply small voltages allows the conductance to be read: I = G × V
- Apply large voltages to change the resistance



## Directly Process in the Memory Itself





Analog is efficiently and naturally able to combine computation and data access

Effectively, large-scale processing in memory with a multiplier and adder at each real-valued memory location

# Crossbars Can Perform Parallel Reads and Writes



Energy to charge the crossbar is  $CV^2$ E  $\propto$  C  $\propto$  number of RRAMs  $\propto$  N×M

 $E \sim O(N \times M)$ 

# SRAM Arrays Require Charging Columns Multiple Times



SRAMs must be read one row at a time, charging M columns Each column wire length is O(N).

Energy = N Rows × M Columns × O(N) wire length Energy ~  $O(N^2 \times M)$ O(N) times worse than a crossbar!

# Want To Accelerate Many Different In Market Neural Algorithms

Backpropagation

Sparse Coding

### Liquid State Machine







# Crossbars Can Perform Parallel Reads and Writes



Energy to charge the crossbar is  $CV^2$ E  $\propto$  C  $\propto$  number of RRAMs  $\propto$  N×M

 $E \sim O(N \times M)$ 

### **General Purpose Neural Architecture**



 $O(N^2)$  operations leaving only O(N) operations

for the digital core

Train based on input vectors

# Can Run Neural Networks on this Architecture







### **Back Propagation**





# Design & Model Detailed Architecture



cmp

Reset

Switch

 $\propto$ 

M4

M7

Z

Лr

CL =

 $V_i($ 

CCIÌ

::(=)

 $\mathbf{z}$ 

 $C_f$ 

Integrating

Cap

Σ



12

# **Row & Column Driver Circuitry**









#### Array driver pass transistors



## **Compare Architectures**



1024 x1024 = 1M array operations, sum over 1 training cycle, 3 operations:

- Vector Matrix Multiply
- Matrix Vector Multiply
- Outer Product Update



\*\*\*Requires 100 M $\Omega$  on state devices

### **Neural Core Energy Analysis**





# Multiscale Model of a Neural Training Accelerator





# **#ROSS SIM**

#### https://cross-sim.sandia.gov

| 🔹 🛈 🖴 https://cross-sim.s                                          | andia.gov                  | C          | Q, Search        | 5                            | ≿∣ ₫ | 3  | Ł               | Â   |    |
|--------------------------------------------------------------------|----------------------------|------------|------------------|------------------------------|------|----|-----------------|-----|----|
| Sandia<br>National<br>Laboratories                                 | Locations                  | Contact Us | Employee Locator | Sear                         | ch   |    |                 |     | Q  |
|                                                                    | CROSS SIM                  |            |                  |                              |      |    |                 |     |    |
| Cueseken Cir                                                       |                            |            |                  |                              |      |    | <b>(</b>        |     |    |
| Crossbar Sir                                                       | nulator                    |            |                  |                              | t    | ¥  | You<br>Libe     | ••  | 9  |
|                                                                    |                            |            |                  |                              |      |    |                 |     |    |
| <b>#RO</b> S                                                       | S SII                      | VI         |                  |                              |      |    |                 |     |    |
|                                                                    |                            |            |                  |                              |      |    |                 |     |    |
| About CrossSir                                                     | n                          |            |                  |                              |      |    |                 |     |    |
| CrossSim is a crossbar simula<br>crossbars for both neuromorph     |                            | -          | V,=x,-(          |                              | 4    | -  | -,              | 1   | -  |
| digital memories. It provides a                                    | clean python API so that d | lifferent  | $V_2 = x_2$      | $\rightarrow$ W <sub>1</sub> | W    | 12 | w <sub>13</sub> | W   | 14 |
| algorithms can be built upon cr<br>properties and variability. The |                            |            | _                | $\rightarrow W_2$            | W    | 22 | W23             | N W | 24 |

#### Download

release

Download the user manual here: CrossSim\_manual.pdf Download CrossSim v0.2 here: cross sim-0.2.0.tar.gz Download example scripts here: examples.tar.gz

fast approximate numerical models including both analytic noise models as well as experimentally derived lookup tables. A slower, but

more accurate circuit simulation of the devices using the parallel spice simulator Xyce is also being developed and will be included in a future

#### Contact Us

Please email Sapan Agarwal for any questions or if you would like to contribute to the source code: sagarwa@sandia.gov

#### Selected Publications Using CrossSim

. S. Agarwal, R. B. Jacobs-Gedrim, A. H. Hsia, D. R. Hughart, E. J. Fuller, A. A. Talin, C. D. James, S. J. Plimpton, and M. J. Marinella, "Achieving Ideal Accuracies in Analog Neuromorphic Computing Using Periodic Carry," in 2017 IEEE Symposium on VLSI Technology Kyoto, Japan, 2017. Y van de Burgt, E. Lubherman, E. L. Euller, S. T. Keene, G. C. Faria, S. Aganwal, M. J. Marinella, A. Al

#### Simple Python API:

*# Do a matrix vector multiplication* result = neural\_core.run\_xbar\_mvm(vector)



# Simple API to model crossbars



# Create a neural\_core object that models a crossbar neural\_core = MakeCore(params=params)

neural\_core.set\_matrix(weights) # set the initial weights
result = neural\_core.run\_xbar\_vmm(vector) # Do a vector matrix multiply
result = neural\_core.run\_xbar\_mvm(vector) # Do the transpose, a matrix vector mult.
neural\_core.update\_matrix(vector1,vector2) # Do an outer product update

### Go from Measurement to Accuracy



Sandia

# Multi-ReRAM Synapse: Periodic Carry

If we need more bits per synapse, use multiple memristors

- Three 10 level ReRAMs could represent 1-1000!
- Adding to the weight requires reading every ReRAM to account for any carries and serially programming each ReRAM: VERY EXPENSIVE



- Use >10 levels to represent a base 10 system
- Ignore carry and program the crossbar in parallel.
- Periodically (once every few hundred cycles) read the ReRAM and perform the carry



### Periodic Carry Compensates for Write Noise





Read and reset every 100 pulses

Do 300,000 small (0.02% of weight range) updates

net of 1500 positive training pulses

Noise Sigma = 1.4% for single device

- (from  $\sigma_{noise}/G_{range} = 0.1\sqrt{\Delta G/G_{range}}$  )
- Write noise applied during updates and carries

Learn from a 0.5% Signal

### Periodic Carry Mitigates Write Nonline



# TaO<sub>x</sub> Results





A/D and D/A is modeled, Serial operations modeled

- · When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity
- When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)

### Li-Ion Synaptic Transistor for Analog Computation (LISTA)





E. J. Fuller, et al, "Li-Ion Synaptic Transistor for Low Power Analog Computing," *Advanced Materials,* vol. 29, no. 4, p. 1604310, 2017.

### Summary



- Fundamental O(N) energy scaling advantage
- Use CrossSim to co-design materials to algorithms
  - Use periodic carry to overcome noise devices
- Need high resistance 10-100 MΩ Devices
- Need low write nonlinearities

## **ROSS SIM**

https://cross-sim.sandia.gov



### **Extra Slides**



### **Overcoming the Power Limit**

Package Substrate





**Integrate Processing and Memory** 

### The Noise Limited Energy to Read a Crossbar Column is Independent of Crossbar Size





Thermal Noise =  $\left\langle \Delta I^2 \right\rangle$ =  $N \times \left( 4k_b T \times G_o \times \Delta f \right)$ 

 $SNR^{2} = \frac{(NI_{o})^{2}}{\langle \Delta I^{2} \rangle}$  $\frac{1}{\Delta f} = 4k_{b}T \times SNR^{2} \times \frac{1}{V^{2}G_{o} \times N}$ 

Measure N resistors and determine the total output current with some signal to noise ratio (SNR)<sup>\*</sup>

What is the minimum energy?

$$Energy = V^2 G_O \times N \times \frac{1}{\Delta f}$$

Power in each resistor × number of resistors

Determined by noise and SNR

If we double the number of resistors, we can double the speed to get the same energy and SNR.

This is because the noise scales as sqrt(N) while the signal scales as N

$$Energy = 4k_bT \times SNR^2$$

\*we are assuming we need some fixed precision on the output, and don't need full floating point accuracy

## Experimental Device Non-idealities

Device: Write Variability, Write Nonlinearity, Asymmetry, Read Noise Circuit: A/D, D/A noise, parasitics







A/D and D/A is modeled, serial operations modeled

- When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity
- When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)

# LISTA Results





0 10 20 30 40 Training Epoch

- Carry once every 1000 updates
- Use a single device per weight and subtract a reference current

### **Neural Core Latency Analysis**



Sandia

### Neural Core Area Analysis

