

# **Virtual Acceleration Engine**



Michael J. Miller CTO



- Virtualization is a replication of hardware resources via software
- Virtualized resources ease deployment and improve HW utilization in datacenters and across networks
  - Compute, storage, networking, applications
- First there was time sharing of CPUs & OS
  DEC, CDC, IBM, Unix, Tymeshare, VMWare, ...

#### Then storage

- EMC, NFS, iCloud, SharePoint, DropBox, Box, ...
- Next was networking
  - VPN, VLAN, NFV, OpenFlow, SDN, OVS, …

#### Now applications

Containers, Open Virtual Formats (OVF), .....

#### Flexibility is great, BUT it comes at a cost!

- Virtualizing can impact HW efficiency, throughput, latency and power
- Opportunity for flexible hardware acceleration





## Hardware Acceleration Addresses Bottlenecks, However...

#### Hardware accelerators speed up execution of well-defined tasks

- Overcome compute, memory latency and bandwidth challenges
- Move the data and compute resources closer together
- Heterogeneous solutions with GPU and FPGA in datacenters today

#### Today's challenges

- SW that is dependent on HW acceleration is limited in portability
- CPU & DRAM perform poorly accessing random unstructured data
- Getting the data to the HW adds overhead and latency
- Hardware accelerators are often an after thought → less optimal performance

#### Today's solutions

- High level: OpenCL on FPGAs and GPUs, CUDA on GPUs, etc.
- Very narrow: TCP Offload Engines

#### What about accelerating other <u>embedded</u> tasks?

 Embedded data search (network addresses), packet classification, data analytics, anomaly detection, flow tracking analysis, security analysis, etc.



## Embedded Packet Filtering Function in Various Network Interface Stacks

- Same virtualized Packet Filter function everywhere
  - Very desirable to have unified control software across all platforms
  - Provides for different cost, performance and capacity





## Embedded Packet Classifier Function in Hierarchical Switch Architectures

- Same virtualized packet classifier function and API at each level
  - Different latency, throughput and capacity
  - Minimizes total cost of development/ownership to manage all levels





- Supports trend towards "virtual everything"
  - Open Virtual Switch, Virtual Machines, Network Virtual Function, SDN

#### Abstract Virtual Function

- Defined precisely at a functional level
- Given the same inputs, the same results will be produced
- Transparent to implementation
- Maps well into hardware
- Embeddable (makes use of natural boundaries)

#### Common API and RTL Module Interfaces

- API provided as a software library model
- Same across all implementations of the VAE
- Supports adaptation layers to <u>existing</u> higher level application code API

#### Scalable implementation

- Lowest level performance & highest capacity: software on CPU core
- Highest performance with FPGA (or ASIC) plus MoSys Si
- Can be ported to future or alternative hardware solutions



### Virtual Accelerator Engine Scalability Software Programmable, Hardware Performance

- **VAE example opportunities in:** 
  - Matching, filtering, classification, sorting, searching, structured computation
- Scalable across many platforms
  - From "C" → FPGA IP core w/BRAM → FPGA w/BE2 → FPGA w/PHE
  - Same high level software interface across all platforms
  - Same RTL Module Interface for FPGA and ASICs





## **Control Stack for Virtual Accelerator Engine**





## MoSys Programmable HyperSpeed Engine (PHE) Device Accelerating Algorithms + Data Structures

#### Monolithic Silicon

- High density 1T-SRAM 128MB + ECC
- 32 high speed RISC cores

#### Tightly coupled cores & memory

- Direct connect via cross point Switch
- No cache or TLB → no miss variability

#### Optimized Instruction Set

- Hash, Compressed Trie etc.
- Packed bit fields
- 24b x 24b Multiplier

#### High parallelism

- Up to 8 way thread cores
- 8 way threaded SRAM
- 16 way Thread 1T-SRAM

#### Low latency memory access

- 6ns to 25ns
- Up to 4x faster than DRAM
- 2 Level hierarchy possible





- Provides a wide range of performance/capacity: 100 to 1 possible
  - Enables a wider range of product SKUs
  - Easier adoption of HW acceleration
  - Migration path for the future Si and Hardware

#### Reduced time to deployment of new features

Software definable function without the cost of RTL design timelines

#### High level platform portability

- Higher layers of code are not limited to specific hardware
- Use available HW VAE or software version of VAE
- Enables "graceful fall back"

#### SW investment is protected

- Insulated from hardware implementation by common API
- Programmers can take advantage of HW without knowing details
  - Today's software engineers are are focused on a higher level than firmware or RTL
  - Allows programmers to focus on the bigger picture



# **Virtualized Graph Memory Engine**



#### **Abstract Graphs Can Be Used For Many Problems**





## Virtualized Graph Memory Engine Block Diagram

- Graph Memory Engine is composed of:
  - Graph Walker including the Graph Memory
  - Computation Engine which computes the next Edge as a function of input vector





- Edges connect two nodes (n, m) with a unique edge value for a given node n
- Multiple graphs can be stored in the memory

```
add_node(n, comp_edge_operation)
add_edge(n, m, edge_value)
add_default_edge(n, m)
has_node(n)
has_edge(n, edge_value)
adjacent(n, m)
list = neighbors(n)
list = edge_values(n)
func = comp_edge_function(n)
action = action(n, edge_value)
action = default_action(n)
delete_edge(n, edge_value)
delete_node(n)
```

- // add node n with next action
- // add an edge from n to m
- // add a default edge from n to m
- // test if node n exists
- // test if there is an edge from n
- // test if n and m are connected
- // list of adjacent edges to n
- // list of all edge values
- // returns compute edge function
- // returns action associated with edge
- // returns action for default edge



- MoSys Packet Classifier Platform utilizes an Adaption Layer
  - Various Adaption Layers for building and maintaining search graphs
  - Allows for options in function, throughput, latency and capacity





### **Example Multi-Field Match**





# **MoSys Packet Classification Platform**



Supplied by MoSys



\*\*

**Clone of Xilinx VCU1525 Card** 

## MoSys "Cheetah" Accelerator Dev Kit 2 x 100G Network Accelerator

 XCVU9P Xilinx FPGA (other assembly options possible) 4 x 16GB DDR4 DIMMS 2 x QSFP28 MoSys BE3 or PHE MoSys PHE 1.5GHz I Gb of 1T-SRAM + 32 RISC cores the second se 0 0 0 mm n n n n n n n and a man a man a man a 1. Innernennennen



## Future VAEs: Data Analytics?





#### Virtual Accelerator Engines provide:

- Software at hardware speeds
- Wide range of performance/capacity: 100 to 1 possible
- OEM platform portability
- Preservation of software investments
- Future proof to improvements in hardware acceleration

#### MoSys will be creating application specific platforms

- Using VAE technology for scalability
- For applications that are dominated by random memory access and unstructured data
- Require high throughput and low latency
- Applicable for embedded functions:
  - Searching, Filtering, Matching, Sorting, Security, Data analytics, Compute on Sparse or Random data

#### Equipment such as:

 Embedded Switching/Routing, Smart NICs, Security appliances, Application Severs, HPC, 5G edge, defense/aerospace, test/measurement equipment, datacenter acceleration



# Thank You,

# **Questions?**