# BLAZAR Programable HyperSpeed Engine Intelligent Data Accelerators (PHE)

The **BLAZAR Family of Accelerator Engines** support high bandwidth, fast random memory access rates and *embedded* <u>In</u> <u>Memory Functions (IMF)</u> that solve critical memory access challenges for memory bottlenecked applications like network search, statistics, buffering, security, firewall, 8k video, anomaly detect, genomics, ML random forest of trees, graph/tree/list walking, traffic monitoring.

The **Programmable HyperSpeed Engine (PHE)** is the highest performance devices and provides hardware and software acceleration options to tune the application to reach HyperSpeed performance.

The PHE includes all the in memory functions of the Bandwidth Engine 3 RMW (Reference BE3-RMW Product Brief)

- Two types of In Memory Functions: BANDWIDTH Functions for high throughput Data Movement applications and COMPUTE Functions for statistics, filtering, data manipulation and logical operations.
- IMFs lowers latency up to 4x and increases available access rates 6x by avoiding memory bottlenecks
- Total of over 79 IMFs
- The 1Gb of memory replaces up to 8 QDR/RLDRAM memory devices
- Memory architecture allows up to 32 simultaneous accesses
- With the 32 RISC Processor Engines, applications can be accelerated by offloading functions from the FPGA into the Programable HyperSpeed Engine (PHE) freeing up FPGA resources to add features
- Total aggregate of <u>24 Billion memory operations per second</u>
- Support application acceleration for aggregate throughput rates ranging from 50Gb/s to over 400Gb/s per device.

# PRODUCT BRIEF



## **KEY FEATURES / PRODUCT OPTIONS**

- High Bandwidth, low pin count serial interfaceHighly efficient reliable transport command and data protocol optimized for 90% efficiency
- Eases board layout and signal integrity, no trace length matching required, operates over connectors
- 1Gb of 1T-SRAM (16Mx72b)
- High access rate SRAM class memory to 24B access/s with tRC, from 0.67ns to 3.2ns.
- Latencies as low as 8ns internal, 40ns external (pin to pin)
- **Bandwidth IMFs** BURST sequential read and write functions for Data Movement nearly doubles bandwidth
- **Compute IMFs** Atomic RMW functions for statistics, metering, filtering, mutex operations.
- Reduction of I/O up to 7X and avoids stale data problems
- 32 Programmable, 1.5Ghz, Multi-threaded RISC Processor Engines (PE) with 256 threads
  - User programmable (IMF) In-Memory Embedded <u>Bandwidth and Compute</u> <u>Functions</u> or Algorithms
- Aggregate throughput rates ranging from 50Gb/s to over 717Gb/s per device
- Highest Single Chip Bandwidth up to 717 Gbps throughput

# **MoSys ACCELERATOR ENGINE Elements**

MoSys Engines have Unique Memory Architectures that can replace SRAM/RLDRAM memories and <u>embeds In Memory</u> <u>Functions (IMF)</u> that execute many times faster. A single embedded function can replace several traditional memory accesses.



**PROGRAMABLE HYPERSPEED ENGINE IN MEMORY FUNCTIONS (IMF)** 





MoSys makes available a new class of memory called **EFAM** (<u>Embedded In-Memory Function Accelerator Memory</u>).

Acceleration is achieved by embedding two types of <u>in-</u><u>memory functions (IMF): Bandwidth and Compute Functions</u> that execute much faster in-memory than could be executed outside of the memory.

IMF lowers latency up to 4X and increases available access rates 6X by avoiding memory bottlenecks.

Understanding the speed benefits of the embedded functions, combined with speed and parallel capability of the memory architecture, gives the Hardware/Software architect speed, latency and performance options to maximize performance.

## **Types of IMFs**

- Bandwidth BURST Functions

  Optimized for sequential Data
  - Movement
- Compute RMW Functions
  - Optimized compute Read/Modify/Write

We provide optimized hardware *fixed* IMFs for Burst and RMW with over 79 functions

### Fixed In Memory BANDWIDTH Functions

The BURST Functions are focused on DATA MOVEMENT where increased bandwidth throughput is paramount. The separate Blazar I/O busses and architecture supports full duplex simultaneous read and writes.

The BURST Multi-Read/Multi-Write In-Memory Functions can combine up to 8 READS or 8 WRITES (x72b) into a single BURST function increasing data transfers per command. This reduces the command and address overhead, nearly <u>doubling</u> <u>the amount of data that can be moved with that same</u> <u>bandwidth</u>.

For example, buffering where transactions are larger than 72b

## **Typical Applications**

- Networking Search
- Statistics
- Buffering
- Security/Firewall
- Anomaly Detection
- Big Data Analysis

- 8K Video
- Genomics
- Graph/tree/list walking
- Monitoring
- High Speed Data Collection
- QDR/RLDRAM replacement

The PHE allow users to define their own application specific functions that are loaded into the Cluster Array for execution

- Programmable User Defined Bandwidth and Compute (or algorithmic) Functions
- Each of the 32 RISC Processor Engines can be executing different or the same function

### Fixed In Memory COMPUTE Functions

The RMW Functions are focused on accelerating atomic DATA MODIFICATION where coherent updates are needed by reducing the I/O overhead and insuring correct results with a short ECC protected pipeline utilizing data forwarding.

Provides an alternative to a long pipeline which sends one command to READ a memory location, a second operation to MODIFY the value, and a third command to WRITE the new value back to the memory location.

Saves time not having to move data in/out of memory and external modification time. RMWs are atomic, thus ensuring correct multi-threaded behavior.

I/O reduction up to 7X and avoids stale data problems.

#### Programable In Memory USER DEFINED Functions

Each group of IMFs delivers different increments of performance acceleration. The PHE, with the User Defined Embedded in memory computing Functions, allows a design to achieve what we call <u>HyperSpeed</u>.

Using the 32 PE RISC processors, you can move functions into the Engine that are high use and time consuming, or specialize computation or matching algorithms that may not only be time consuming, but require considerable RTL resources.

And, by *putting multiple copies of a function into the Cluster Array, significant performance is achieved* through the power of parallel processing.

Or, simply include functions/algorithms that consume a lot of FPGA RTL resources. NOW, you can DO MORE in the FPGA!







## SOFTWARE DEFINED...HARDWARE ACCELERATED

#### SOFTWARE DEFINES PERFORMANCE OPTIONS

Each version of the Blazar family of accelerator engines offer many different speed options to the software architect to <u>Define</u> where in memory various elements of the software should reside based on how best to accelerate system performance.

- Data tables can use the 1Gb of high speed memory
- BURST data read/writes and computational R/M/W could best be executed in the engine by a single function call
- Using the 32 PE RISC processors, special algorithms can be embedded in the Cluster Array for faster execution, or several copies of the algorithm for parallel processing can be performed
- Simultaneous function execution in multiple partitions and/or banks of the main memory. Simultaneously, user defined
  functions can be executing in each PE within its cluster memory. And, if needed, function execution priorities can be defined
  with Domain Priority feature on the BE3 and PHE.

#### HARDWARE ACCELERATES THE EXECUTION

- Serial High speed I/O up to 28 Gbps provides high bandwidth and low pin count, simplifying the hardware design
- High speed (tRC=0,67 to 3.3ns), parallel capable memory
- 32 PE RISC processors running at 1.5 Ghz, 8 threads per PE, 256 total threads per device





# **CLUSTER ARCHITECTURE**

A Cluster has 4 RISC Processor Engines (PE) and Data Memory

- Each RISC Processor Engine
  - 1.5 GHz
  - 128 Internal Registers (IR)
  - 1k x 72b Instruction Memory (IM)

| PE |
|----|
| IR |
| IM |



The 4 PEs are connected to Data Memory by a cross point switch

- Random Access Data Memory
  - 4k x 72b
  - Connected to the 4 Cluster PEs with a cross point switch
    - 0.67ns tRC to the 4 PEs in the Cluster
    - Allow simultaneous access by all four PEs

# **CLUSTER ARRAY ARCHITECTURE**

Eight (8) Clusters are combined to form the Cluster Array







## SOFTWARE DEFINES PERFORMANCE OPTIONS

The PHE offers many different speed options to the software architect to <u>Define where in memory various elements of the</u> <u>software should reside based on how best to accelerate system performance.</u>

- Data tables can use the 1Gb of high speed memory
- BURST data read/writes
- COMPUTE R/M/W could best be executed in the engine by a single function call
- Using the 32 PE RISC processors,
  - Special USER DEFINED algorithms can be embedded in the Cluster Array for faster execution, or several copies of the algorithm can be loaded for parallel processing
  - Up to 256 threads
- Simultaneous function execution in multiple partitions and/or banks of the main memory. Simultaneously, user defined functions can be executing in each PE within its cluster memory.
- Optional 4 Domains to set function execution priority

#### HARDWARE ACCELERATES THE EXECUTION

- Serial High speed I/O up to 28 Gbps provides high bandwidth and low pin count, simplifying the hardware design
- High speed (tRC=0,67 to 2.7ns), parallel capable memory
- 32 PE RISC Processor Engines running at 1.5 Ghz, 8 threads per PE, 256 total threads per device

## Software Define - Hardware Accelerated

Software and System Architects can improve application performance by accelerating the memory access and utilizing the In Memory Compute Functions.

The different Accelerator Engine devices allow application tuning to achieve increasing levels of performance up to our most powerful engine... the Programable HyperSpeed Engine with 32 Processor Cores.

|         |        |                                                                                                                                          | Package        | e Interface |          |            |        |     | Memory |      | Access Rate    | Commands /Functions |           | unctions      |
|---------|--------|------------------------------------------------------------------------------------------------------------------------------------------|----------------|-------------|----------|------------|--------|-----|--------|------|----------------|---------------------|-----------|---------------|
|         | Part   |                                                                                                                                          | Pkg Size       | Lanes       | Rate     | e per Lane |        | BW  | tRC    | Size | Billion        |                     |           | Custom        |
|         | Number | Description                                                                                                                              | mm             | Tx/Rx       | 10-12.5G | 15G        | 25-28G | Gb  | ns     | Gb   | Transaction/s  | R/W                 | RMW / ALU | 32 RISC Cores |
| BURST   | MSR620 | Bandwidth Engine 2 Burst<br>Serial 0.5Gb High Access Memory                                                                              | FCBGA<br>19x19 | 16          | ✓        |            |        | 320 | 3.2    | 0.5  | 3.3            | ✓                   |           |               |
|         | MSR630 | Bandwidth Engine 3 Burst<br>Serial 1Gb High Access Memory                                                                                | FCBGA<br>27x27 | 16          | ~        | ~          | ~      | 717 | 2.7    | 1    | 6.5            | ✓                   |           |               |
|         |        |                                                                                                                                          |                |             |          |            |        |     |        |      |                |                     |           |               |
| ×       | MSR820 | Bandwidth Engine 2 RMW<br>Serial 0.5Gb High Access Memory<br>with ALU for RMW functiions                                                 | FCBGA<br>19x19 | 16          | ~        |            |        | 320 | 3.2    | 0.5  | 3.3            | ✓                   | ~         |               |
| RMW     | MSR830 | Bandwidth Engine 3 RMW<br>Serial 1Gb High Access Memory with<br>ALU for RMW functiions                                                   | FCBGA<br>27x27 | 16          | ~        | ~          | ~      | 717 | 2.7    | 1    | 6.5            | ~                   | ~         |               |
|         |        |                                                                                                                                          |                |             |          |            |        |     |        |      |                |                     |           |               |
| Program | MSPS30 | Programmable Accelerator Engine<br>Serial Interface, 1Gb Memory, 32<br>RISC Processor cores for custom<br>algorithms, compute, functions | FCBGA<br>27x27 | 16          | ~        | >          | ~      | 717 | 2.7    | 1    | 24<br>Internal | √                   | ~         | $\checkmark$  |

LEARN MORE: <u>www.mosys.com</u> <u>https://mosys.com/blazar-family-of-accelerator-engines/</u>

## **ACCELERATING FPGA APPLICATIONS**



With more and more demand on performance and the increasing need for additional features, an FPGA's resources can easily be consumed.

The Blazar PHE gives the hardware/software architects many options to speed application performance or create unique system architectures that execute faster.

Options to increase the performance of an FPGA design with BLAZAR Engines:

- Offload as many functions as possible
- Simplify software by moving complex functions or algorithms that consume execution time and RTL
- Take advantage of the High Speed Serial Interface to the high random access rate 1Gb memory (tRC 2.7 ns device dependent)

Install multiple copies of same algorithm/functions and

scheduler will find available processors



Info: www.mosys.com Email: Sales@mosys.com

