# Vivienne Sze

MASSACHUSETTS INSTITUTE OF TECHNOLOGY



## Computing challenge for self-driving cars

## **SELF-DRIVING CARS USE CRAZY** AMOUNTS OF POWER, AND IT'S **BECOMING A PROBLEM**



Shelley, a self-driving Audi TT developed by Stanford University, uses the brains in the trunk to speed around a racetrack autonomously.

NIKKI KAHN/THE WASHINGTON POST/GETTY IMAGES

Cameras and radar generate ~6 gigabytes of data every 30 seconds

Self-driving car prototypes use approximately 2,500 Watts of computing power

Generates wasted heat and some prototypes need water-cooling

SOURCE: WIRED, FEB 2018



## Robots consuming < 1 Watt for actuation









#### Low energy robotics

- Miniature aerial vehicles
- Lighter than air vehicles
- Miniature satellites
- Micro unmanned gliders



SOURCE: CMU



SOURCE: MIT, HARVARD



SOURCE: MIT, HARVARD



## Existing processors consume too much power

< 1 Watt



> 10 Watt

### Transistors are NOT getting more efficient



SOURCE: INTEL, PRESS REPORTS, BOB COLWELL, LINLEY GROUP, IB CONSULTING, THE ECONOMIST

Slowdown of Moore's Law and Dennard Scaling General purpose microprocessors are not getting faster or more efficient

\*MAXIMUM SAFE POWER CONSUMPTION



## Power dominated by data movement

| Operation:          | Energy<br>(pJ) | Relative Energy Cost                                 |
|---------------------|----------------|------------------------------------------------------|
| 8b Add              | 0.03           |                                                      |
| 16b Add             | 0.05           |                                                      |
| 32b Add             | 0.1            |                                                      |
| 16b FP Add          | 0.4            |                                                      |
| 32b FP Add          | 0.9            |                                                      |
| 8b Mult             | 0.2            |                                                      |
| 32b Mult            | 3.1            |                                                      |
| 16b FP Mult         | 1.1            |                                                      |
| 32b FP Mult         | 3.7            |                                                      |
| 32b SRAM Read (8KB) | 5              |                                                      |
| 32b DRAM Read       | 640            |                                                      |
|                     |                | 1 10 10 <sup>2</sup> 10 <sup>3</sup> 10 <sup>4</sup> |

SOURCE: HOROWITZ, ISSCC 2014



Memory access is orders of magnitude higher energy than compute



## Autonomous navigation uses a lot of data

#### Semantic Understanding

- High frame rate
- Large resolutions
- Data expansion





2 MILLION PIXELS

 $10 \times -100 \times MORE PIXELS$ 

Geometric Understanding

Growing map size



## Visual-inertial localization

#### Image sequence



#### IMU INERTIAL MEASUREMENT UNIT



\*SUBSET OF SLAM ALGORITHM (SIMULTANEOUS LOCALIZATION AND MAPPING)

#### Visual-Inertial Odometry (VIO)

## Determines location/orientation of robot from images and IMU

#### Localization



## Localization at under 25 mW



JOINT WORK WITH SERTAC KARAMAN

First chip that performs complete Visual-**Inertial Odometry** 

Front-End for Camera (Feature detection, tracking, and outlier elimination)

Front-End for IMU (Pre-integration of accelerometer and gyroscope data)

**Back-End Optimization** of Pose Graph

Consumes 684× and 1582× less energy than mobile and desktop CPUs, respectively

| FE           |                  |                  |  |
|--------------|------------------|------------------|--|
| /FE<br>ntrol | Shared<br>Memory | Graph            |  |
|              | Marginal         | Horizon States   |  |
| e Stereo     |                  | Linear<br>Solver |  |

5.0 MM

| Technology      | 65nm CMOS    | Supply        | 1 V         |
|-----------------|--------------|---------------|-------------|
| Chip area (mm²) | 4.0 × 5.0    | Resolution    | 752 × 480   |
| Core area (mm²) | 3.54 × 4.54  | Camera Rate   | 28 – 171 fp |
| Logic Gates     | 2,043 kgates | Keyframe Rate | 16 – 90 fps |
| SRAM            | 854KB        | Average Power | 24 mW       |
| VFE Frequency   | 62.5 MHz     | GOPS          | 10.5 – 59.1 |
| BE Frequency    | 83.3 MHz     | GFLOPS        | 1 – 5.7     |

SOURCE: ZHANG, RSS 2017; SULEIMAN, VLSI 2018





### Key methods to reduce data size



JOINT WORK WITH SERTAC KARAMAN

Navion: Fully integrated system — no off-chip processing or storage

SOURCE: SULEIMAN, VLSI 2018

## Understanding the environment

#### Depth Estimation



HIDDEN LAYER



#### Semantic Segmentation



State-of-the-art approaches use Deep Neural Networks which require up to several hundred millions of operations and weights to compute!

> 100× more complex than video compression



#### Properties we can leverage



#### Operations exhibit high parallelism $\rightarrow$ high throughput possible

#### Memory Access is the Bottleneck

\* MULTIPLY-AND-ACCUMULATE

- Worst Case: all memory R/W are DRAM accesses
  - Example: AlexNet has 724M MACs → 2896M DRAM accesses required

#### Properties we can leverage





CONVOLUTIONAL REUSE (PIXELS, WEIGHTS)

#### Operations exhibit high parallelism → high throughput possible

IMAGE REUSE (PIXELS)

FILTER REUSE (WEIGHTS)



IMAGES



### Exploit data reuse at low-cost memories





\*MEASURED FROM A COMMERCIAL 65nm PROCESS

Specialized hardware with small (< 1kB) low cost memory near compute

Normalized Energy Cost\*

Farther and larger memories consume more power

#### Deep neural networks at under 0.3 W



*Exploits data reuse* for 100× reduction in memory accesses from global buffer and 1400× reduction in memory accesses from off-chip DRAM

4mm

SOURCE: CHEN, ISSCC 2016





## Where to go next: planning and mapping



Robot Exploration: decide where to go by computing Shannon Mutual Information

## Challenge is data delivery to all cores

Process multiple beams in parallel





JOINT WORK WITH SERTAC KARAMAN

#### Data delivery from memory is limited





| re | Ν |  |
|----|---|--|
|    |   |  |
|    |   |  |
| re | 3 |  |
|    |   |  |
| re | 2 |  |
|    |   |  |
| re | 1 |  |

### Specialized memory architecture



Break up map into separate memory banks and use a novel storage pattern to minimize read conflicts when processing different beams in parallel

Compute the mutual information for an entire map of 20m × 20m at 0.1m resolution in under a second  $\rightarrow$  a 100× speed up versus CPU at 1/10<sup>th</sup> of the power



## Summary

Efficient computing is critical for advancing the progress of autonomous robots, particularly at the smaller scales → Critical step to making autonomy ubiquitous!

In order to meet computing demands in terms of power and speed, need to redesign computing hardware from the ground up → Focus on data movement!

## Algorithms

Specialized hardware opens up new opportunities for the codesign of algorithms and hardware → Innovation opportunities for the future of robotics!

