

Storage Developer Conference September 22-23, 2020

## Analog Memory-Based Techniques for Accelerating Deep Neural Networks

Hsinyu (Sidney) Tsai IBM Almaden Research Center – San Jose, CA USA

#### Outline

- Introduction Deep Neural Networks (DNN) and Analog Memory
- Phase Change Memory for DNN Training and Inference
- Energy Efficiency for Analog Memory-Based Techniques
- Summary

20

## The Rise of Al and DNN

- The rise of AI relied on improving algorithms, abundant data, and accelerating hardware.
- Deep Neural Network (DNN), as a major field of AI, has surpassed human-level accuracy in some tasks and outperformed most rule-based models.



**Object Detection** 

Speech Transcription

Language Translation

Algorithms

Data

Hardware

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

#### Forward TPU1

TODAY



#### Al Hardware: from Today to the Future

Near-term

Future...?

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

## **Deep Neural Network (DNN)**



2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

SD @

#### **Data Movement Cost**



Analog in-memory computing offers better energy efficiency and throughput

#### SD (20

С

Output



#### Hardware implementation



• How to map a neural network on a NVM crossbar array?

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

SD @



#### Hardware implementation



0

SD @

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

The weight is encoded with a conductance pair (G<sup>+</sup>, G<sup>-</sup>)



#### Hardware implementation



The product x<sub>i</sub> w<sub>ij</sub> is obtained using Ohm's Law

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.



#### Hardware implementation



The sum performed using Kirchhoff's Law

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

## **Analog Memory Options (Examples)**

• No ideal device for analog computing, but many promising candidates.



SD @

## Why Phase Change Memory (PCM)?

- Mature memory technology (large-scale demos & products)
- Large resistance contrast (allows more analog states)
- Much longer endurance than Flash
- Good physical understanding of device non-idealities, such as conductance drift



#### Outline

- Introduction Deep Neural Networks (DNN) and Analog Memory
- Phase Change Memory for DNN Training and Inference
- Energy Efficiency for Analog Memory-Based Techniques
- Summary

20

## For Training: Achieving Software Accuracy



**Problem:** Conductance changes in PCM are ...

Uni-directional

• Stochastic

•Non-linear  $\rightarrow$  asymmetric



What do we really want? For <u>training</u>

• Gentle, symmetric conductance changes



#### Our published results in DNN training w/ PCM on MNIST dataset

**2014** – IEDM  $\rightarrow$  **82%** w/ "mixed-hardware-software" experiment **2018** – *Nature*  $\rightarrow$  **98%** (e.g., software-equivalent) w/ new unit-cell

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

20

#### **Accuracy on MNIST Datasets**

 Software-equivalent training accuracy achieved with 2T2R+3T1C unit cell and "polarity inversion" technique



More Significant Pair Less Significant Pair  $W = F * (G^+ - G^-) + g^+ - g^ \downarrow G^+ \leftarrow \bigcirc G^- \leftarrow \bigcirc g^+ \qquad \leftarrow \bigcirc g^- \qquad \bigcirc g^- \qquad \leftarrow \bigcirc g^- \qquad \leftarrow \bigcirc g^- \qquad \bigcirc \bigcirc \bigcirc g^- \qquad \bigcirc \bigcirc \bigcirc g^- \qquad \bigcirc \bigcirc g^- \qquad$ 

- Symmetry → Weight update performed on g+

   g<sup>-</sup> shared among many columns

  Dynamic Range → Gain factor F (e.g. F = 3)
- Non-Volatility → Weight transferred to PCMs infrequently (every 1000s of images)
- "CMOS variabilities" → Counteracted by "Polarity Inversion" technique

S. Ambrogio et al., Nature, 558, 60 (2018)

SD @

## Transfer learning ImageNet to CIFAR-10/100



#### **For Inference: Addressing PCM Non-Idealities**



• Stochastic

• Non-linear  $\rightarrow$  asymmetric

What do we really want? For training

> Gentle, symmetric conductance changes



- Precise tuning
- High yield
- •No change over time

Number of pulses

#### **Our recent results in DNN inference w/ PCM**

**Jan 2019** – Adv. Electr. Mater.  $\rightarrow$  programming schemes for 4 PCM devices **June 2019** – VLSITech. Symp.  $\rightarrow$  software-equivalence in "mixed-hardware-software" experiment for Long-Short Term Memory (LSTM) **Dec 2019** – *IEDM*  $\rightarrow$  effects of PCM "resistance drift" on DNN accuracy



20

## **Programming of Multi-PCM Weights**



2020 Storage Developer Conference. © IBM Research. All Rights Reserved.



#### Closed Loop Parallel Programming with Device Variation

 An iterative RESET programming scheme to program phase change memory (PCM) is more tolerant to device-to-device variations





H.Tsai, VLSI Symposium, T82-83 (2019)

## Mapping of Multi-PCM Weights

 Mapping weights to 4 phase change memory (PCM) devices improves resilience to write noise and conductance saturation



## Long-Short Term Memory (LSTM)

- Long-Short Term Memory (LSTM) networks are use extensively for sequence modeling, e.g., speech recognition and translation
- LSTMs consist of mostly fully connected networks that are well suited for analog acceleration



Long Short-Term Memory (LSTM) cell

## Language Modeling with LSTM

- Task: Predict the probability of the next character or word
- Training is supervised, but no labeling is needed
- Performance is measured by cross-entropy loss or perplexity



# (t) Embedding Layer LSTM 1 Layer LSTM 2 Layer Output Layer Output Layer Cutput Lay

SD (20

## Mixed Hardware-Software Experiments with Long-Short-Term Memory (LSTM)

 Software-equivalent accuracy was achieved on commonly used language modeling benchmarks, with 2.5M PCM devices in weights



H.Tsai, VLSI Symposium, T82-83 (2019)

#### **Conductance Drift**

- As the amorphous state relaxes, PCM conductance gradually decreases
- PCM drift can be quantified with an exponential time dependence with a drift coefficient  $\boldsymbol{\nu}$



SD (20

## **Drift Impact and Slope Correction**

- PCM drift causes weight decrease at different rates from device to device, which increases LSTM loss over time
- Slope correction: tuning the activation function leads to signal restoration



SD (20

#### Impact of PCM Drift on ResNet-18

Impact of drift is much stronger since every weight is reused many times



2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

#### **Dependence on Network Design**

 Increasing number of hidden layers or size of hidden layer leads to increased drift resilience – IF slope correction is used

ResNet-18 Conv Network for CIFAR10



# Noise-aware DNN training for Analog HW

- ResNet-34 trained on CIFAR-10 and ImageNet datasets
- Additive noise re-training can improve robustness of model and recover loss in inference accuracy



V. Joshi, Nature Communications (2020)

#### Outline

- Introduction Deep Neural Networks (DNN) and Analog Memory
- Phase Change Memory for DNN Training and Inference
- Energy Efficiency for Analog Memory-Based Techniques
- Summary

20

# Hardware Approach for Energy Efficiency SD®

- 1) Parallelism is key
- 2) Avoiding ADC (Analog-to-Digital Conversion) saves time, power & area
- 3) Do the necessary computations (squashing functions) but be as "approximate" as you can
- Develop efficient and reconfigurable routing strategies to get vectors of data from the bottom of one array to the edge of the next one

#### AI hardware acceleration with analog memory: micro-architectures for low energy at high speed

This paper presents innovative micro-architectural designs for multi-layer Deep Neural Networks (DNNs) implemented in crossbar arrays of analog memories. Data are transferred in a

IBM J. R&D, IEEE vol. 63, pp. 8:1-8:14, 1 Nov.-Dec. 2019.





2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

#### Where are we on the Roadmap?

- NVIDIA V100: 0.1 TOPs/sec/W
- Google TPU1: 2.3 TOPs/sec/W
  - Inference Only
  - NOT including data movement
- IBM internal Analog designs:
  - MNIST : 15.2 TOPs/sec/W
  - PTB LSTM : 14 TOPs/sec/W



Al roadmap from IBM Al Hardware Center announcement www.ibm.com/blogs/research/2019/02/ai-hardware-center/

SD @

## How to Improve Energy Efficiency?

1) Reduce average NVM conductance  $\rightarrow$  reduces array currents during Multiply-Accumulates

 $\rightarrow$  Current focus of various material and device design efforts

2) Reduce technology node

**90nm -> 14nm** Benefits just from scaling routing energy

Area efficiency for inference: 10–70TOPs/sec/mm<sup>2</sup>

(vs.~0.3TOPs/sec/mm2 forTPU v1: In-Datacenter Performance Analysis of aTensor Processing Unit)

IBM J. R&D, IEEE vol. 63, pp. 8:1-8:14, 1 Nov.-Dec. 2019.



2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

SD<sub>20</sub>

#### Conclusion

- NVM-based crossbar arrays can accelerate Deep Machine Learning compared to GPUs
  - Multiply-accumulate performed at the data  $\rightarrow$  saves power and time
  - But conventional NVM devices (like PCM) are imperfect...
- Recent training results
  - Mixed-hardware-software experiments → software-equivalent training accuracy (S. Ambrogio et al, *Nature*, 558, 60 (2018))
- Recent inference results
  - Programming strategies for 4-PCM-based weights (C. Mackin et al., *Adv. Electr. Mater.*, 1900026 (2019))
  - Mixed-hardware software experiments on LSTM (H. Tsai et al., VLSI Tech. Symp. (2019))
  - Impact of resistance drift in PCM (S. Ambrogio et al., IEDM, 6.1 (2019))
- Power projections based on real circuit designs
  - 100x better energy efficiency (+ 100x speedup) on fully-connected layers (for LSTM and other networks) (H.-Y. Chang et al., *IBM J. R&D*, (2019))

#### htsai@us.ibm.com

2020 Storage Developer Conference. © IBM Research. All Rights Reserved.

#### **Acknowledgements**



















#### Lewis

Hosokawa

Kohji





Jeff Burns

Geoffrey Burr

Stefano Ambrogio Narayanan

Hsinyu Charles Tsai Mackin

An Chen

Katie Spoon

Shelby







**Bulent** Kurdi

Jeff Welser

Heike Riel



Dario

Gil

Sudhir Gowda

**Management Support** 

Wilfried Haensch



Kumar

Vijay Narayanan

