Systolic Arrays, Vector Processors and FPGAs (programmable fast rate arrays)

05 Mar

Systolic arrays

In computer architecture, a systolic array is a pipe network arrangement of processing units called cells. It is a specialized form of parallel computing, where cells (i.e. processors), compute data and store it independently of each other.

A systolic array is composed of matrix-like rows of data processing units called cells. Data processing units DPUs are similar to central processing units (CPU)s, (except for the usual lack of a program counter, since operation is transport-triggered, i.e., by the arrival of a data object). Each cell shares the information with its neighbours immediately after processing. The systolic array is often rectangular where data flows across the array between neighbour DPUs, often with different data flowing in different directions. The data streams entering and leaving the ports of the array are generated by auto-sequencing memory units, ASMs. Each ASM includes a data counter. In embedded systems a data stream may also be input from and/or output to an external source.

An example of a systolic algorithm might be designed for matrix multiplication. One matrix is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.

Systolic arrays are arrays of DPUs which are connected to a small number of nearest neighbour DPUs in a mesh-like topology. DPUs perform a sequence of operations on data that flows between them. Because the traditional systolic array synthesis methods have been practiced by algebraic algorithms, only uniform arrays with only linear pipes can be obtained, so that the architectures are the same in all DPUs. The consequence is, that only applications with regular data dependencies can be implemented on classical systolic arrays. Like SIMD machines, clocked systolic arrays compute in “lock-step” with each processor undertaking alternate compute | communicate phases. But systolic arrays with asynchronous handshake between DPUs are called wavefront arrays. One well-known systolic array is Carnegie Mellon University’s iWarp processor, which has been manufactured by Intel. An iWarp system has a linear array processor connected by data buses going in both directions.


The systolic array paradigm, data-stream-driven by data counters, is the counterpart of the von Neumann paradigm, instruction-stream-driven by a program counter. Because a systolic array usually sends and receives multiple data streams, and multiple data counters are needed to generate these data streams, it supports data parallelism. The name derives from analogy with the regular pumping of blood by the heart.

H. T. Kung and Charles E. Leiserson published the first paper describing systolic arrays in 1978; however, the first machine known to have used a similar technique was the Colossus Mark II in 1944.


An application Example – Polynomial Evaluation

Horner’s rule for evaluating a polynomial is:

y = (…(((an * x + an − 1) * x + an − 2) * x + an − 3) * x + … + a1) * x + a0

A linear systolic array in which the processors are arranged in pairs: one multiplies its input by x and passes the result to the right, the next adds and passes the result to the right:

Advantages and Disadvantages


* Faster

* Scalable


* Expensive

* Highly specialized for particular applications

* Difficult to build

Super Systolic Array

The super systolic array is a generalization of the systolic array. Because the classical synthesis methods (algebraic, i. e. projection-based synthesis), yielding only uniform DPU arrays permitting only linear pipes, systolic arrays could be used only to implement applications with regular data dependencies. By using simulated annealing instead, Rainer Kress has introduced the generalized systolic array: the super systolic array. Its application is not restricted to applications with regular data dependenci


The KressArray is the reconfigurable version of the super systolic array. More information about the background may be obtained from the articles about Systolic array, Reconfigurable Computing, Configure Compiler, super systolic array and Configure/Software Co-Compiler.

Because of the wide applicability of the super systolic array its reconfigurability makes sense: the Kress Array, having been pioneered by Rainer Kress for reconfigurable computing.

Vector Processor

A vector processor, or array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items. The vast majority of CPUs are scalar.

Vector processors first appeared in the 1970s, and formed the basis of most supercomputers through the 1980s and into the 1990s. Improvements in scalar processors, particularly microprocessors, resulted in the decline of traditional vector processors in supercomputers, and the appearance of vector processing techniques in mass market CPUs around the early 1990s. Today, most commodity CPUs implement architectures that feature instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Common examples include MMX, SSE, and AltiVec. Vector processing techniques are also found in video game console hardware and graphics accelerators. In 2000, IBM, Toshiba and Sony collaborated to create the Cell processor, consisting of one scalar processor and eight vector processors, which found use in the Sony PlayStation 3 among other applications.

Other CPU designs may include some multiple instructions for vector processing on multiple (vectorised) data sets, typically known as MIMD (Multiple Instruction, Multiple Data). Such designs are usually dedicated to a particular application and not commonly marketed for general purpose computing.


Vector processing development began in the early 1960s at Westinghouse in their Solomon project. Solomon’s goal was to dramatically increase math performance by using a large number of simple math co-processors under the control of a single master CPU. The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per “cycle”, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was re-started at the University of Illinois as the ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics, the “failed” ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing.

The first successful implementation of vector processing appears to be the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer (ASC). The basic ASC (i.e., “one pipe”) ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported “two pipes” or “four pipes” with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR was otherwise slower than CDC’s own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.

The vector technique was first fully exploited in the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight “vector registers,” which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS – a respectable number even as of 2002.

Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (Fujitsu, Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon-based Floating Point Systems (FPS) built add-on array processors for minicomputers, later building their own minisupercomputers. However Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognizing the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor.

Vector processing techniques have since been added to almost all modern CPU designs, although they are typically referred to as SIMD. In these implementations, the vector unit runs beside the main scalar CPU, and is fed data from programs that know it is there.


In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, many CPUs have an instruction that essentially says “add A to B and put the result in C”. The data for A, B and C could be—in theory at least—encoded directly into the instruction. However things are rarely that simple. In general the data is rarely sent in raw form, and is instead “pointed to” by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time. As CPU speeds have increased, this memory latency has historically become a large impediment to performance

In order to reduce the amount of time this takes, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next “fetches” the values at those addresses, and the next does the math itself. With pipelining the “trick” is to start decoding the next instruction even before the first has left the CPU, in the fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of operations much faster than if it did so one at a time.

Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. They are fed instructions that say not just to add A to B, but to add all of the numbers “from here to here” to all of the numbers “from there to there”. Instead of constantly having to decode instructions and then fetch the data needed to complete them, it reads a single instruction from memory, and “knows” that the next address will be one larger than the last. This allows for significant savings in decoding time.

To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language you would write a “loop” that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:

execute this loop 10 times

read the next instruction and decode it

fetch this number

fetch that number

add them

put the result here

end loop

But to a vector processor, this task looks considerably different:

read instruction and decode it

fetch these 10 numbers

fetch those 10 numbers

add them

put the results here

There are several savings inherent in this approach. For one, only two address translations are needed. Depending on the architecture, this can represent a significant savings by itself. Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten. The code itself is also smaller, which can lead to more efficient memory use.

But more than that, a vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can improve performance by avoiding stalls.

As mentioned earlier, the Cray implementations took this a step further, allowing several different types of operations to be carried out at the same time. Consider code that adds two numbers and then multiplies by a third; in the Cray, these would all be fetched at once, and both added and multiplied in a single operation. Using the pseudo code above, the Cray did:

read instruction and decode it

fetch these 10 numbers

fetch those 10 numbers

fetch another 10 numbers

add and multiply them

put the results here

The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.

Not all problems can be attacked with this sort of solution. Adding these sorts of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding.

In fact, vector processors work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in general, found in places such as weather prediction centres and physics labs, where huge amounts of data are “crunched”.

FPGA (Field-programmable gate array)

A Field-programmable Gate Array (FPGA) is an integrated circuit designed to be configured by the customer or designer after manufacturing—hence “field-programmable”. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC) (circuit diagrams were previously used to specify the configuration, as they were for ASICs, but this is increasingly rare). FPGAs can be used to implement any logical function that an ASIC could perform. The ability to update the functionality after shipping, partial re-configuration of the portion of the design and the low non-recurring engineering costs relative to an ASIC design (notwithstanding the generally higher unit cost), offer advantages for many applications

FPGAs contain programmable logic components called “logic blocks”, and a hierarchy of reconfigurable interconnects that allow the blocks to be “wired together”—somewhat like many (changeable) logic gates that can be inter-wired in (many) different configurations . Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.

In addition to digital functions, some FPGAs have analog features. The most common analog feature is programmable slew rate and drive strength on each output pin, allowing the engineer to set slow rates on lightly loaded pins that would otherwise ring unacceptably, and to set stronger, faster rates on heavily loaded pins on high-speed channels that would otherwise run too slow. Another relatively common analog feature is differential comparators on input pins designed to be connected to differential signaling channels. A few “mixed signal FPGAs” have integrated peripheral Analog-to-Digital Converters (ADCs) and Digital-to-Analog Converters (DACs) with analog signal conditioning blocks allowing them to operate as a system-on-a-chip. Such devices blur the line between an FPGA, which carries digital ones and zeros on its internal programmable interconnect fabric, and field-programmable analog array (FPAA), which carries analog values on its internal programmable interconnect fabric.

FPGA design and programming

To define the behavior of the FPGA, the user provides a hardware description language (HDL) or a schematic design. The HDL form is more suited to work with large structures because it’s possible to just specify them numerically rather than having to draw every piece by hand. However, schematic entry can allow for easier visualisation of a design.

Then, using an electronic design automation tool, a technology-mapped netlist is generated. The netlist can then be fitted to the actual FPGA architecture using a process called place-and-route, usually performed by the FPGA Company’s proprietary place-and-route software. The user will validate the map, place and route results via timing analysis, simulation, and other verification methodologies. Once the design and validation process is complete, the binary file generated (also using the FPGA company’s proprietary software) is used to (re)configure the FPGA.

Going from schematic/HDL source files to actual configuration: The source files are fed to a software suite from the FPGA/CPLD vendor that through different steps will produce a file. This file is then transferred to the FPGA/CPLD via a serial interface (JTAG) or to an external memory device like an EEPROM.

The most common HDLs are VHDL and Verilog, although in an attempt to reduce the complexity of designing in HDLs, which have been compared to the equivalent of assembly languages, there are moves to raise the abstraction level through the introduction of alternative languages. National Instrument’s LabVIEW graphical programming language ( sometimes referred to as “G” ) has an FPGA add-in module available to target and program FPGA hardware. The LabVIEW approach drastically simplifies the FPGA programming process.

To simplify the design of complex systems in FPGAs, there exist libraries of predefined complex functions and circuits that have been tested and optimized to speed up the design process. These predefined circuits are commonly called IP cores, and are available from FPGA vendors and third-party IP suppliers (rarely free, and typically released under proprietary licenses). Other predefined circuits are available from developer communities such as OpenCores (typically released under free and open source licenses such as the GPL, BSD or similar license), and other sources.

In a typical design flow, an FPGA application developer will simulate the design at multiple stages throughout the design process. Initially the RTL description in VHDL or Verilog is simulated by creating test benches to simulate the system and observe results. Then, after the synthesis engine has mapped the design to a netlist, the netlist is translated to a gate level description where simulation is repeated to confirm the synthesis proceeded without errors. Finally the design is laid out in the FPGA at which point propagation delays can be added and the simulation run again with these values back-annotated onto the netlist.

Basic process technology types

* SRAM – based on static memory technology. In-system programmable and re-    programmable. Requires external boot devices. CMOS.

* Antifuse – One-time programmable. CMOS.

* PROM – Programmable Read-Only Memory technology. One-time             programmable because of plastic packaging.

* EPROM – Erasable Programmable Read-Only Memory technology. One-time     programmable but with window, can be erased with ultraviolet (UV) light. CMOS.

* EEPROM – Electrically Erasable Programmable Read-Only Memory technology.            Can be erased, even in plastic packages. Some but not all EEPROM devices   can be in-system programmed. CMOS.

* Flash – Flash-erase EPROM technology. Can be erased, even in plastic   packages. Some but not all flash devices can be in-system programmed.       Usually, a flash cell is smaller than an equivalent EEPROM cell and is therefore less expensive to manufacture. CMOS.

* Fuse – One-time programmable. Bipolar.

Major manufacturers

Xilinx and Altera are the current FPGA market leaders and long-time industry rivals. Together, they control over 80 percent of the market, with Xilinx alone representing over 50 percent.

Both Xilinx and Altera provide free Windows and Linux design software.

Other competitors include Lattice Semiconductor (SRAM based with integrated configuration Flash, instant-on, low power, live reconfiguration), Actel (antifuse, flash-based, mixed-signal), SiliconBlue Technologies (extremely low power SRAM-based FPGAs with option integrated nonvolatile configuration memory), Achronix (RAM based, 1.5 GHz fabric speed) who will be building their chips on Intels’ state-of-the art 22nm process, and QuickLogic (handheld focused CSSP, no general purpose FPGAs).

In March 2010, Tabula announced their new FPGA technology that uses time-multiplexed logic and interconnect for greater potential cost savings for high-density applications.

Bit Slicing

Bit slicing is a technique for constructing a processor from modules of smaller bit width. Each of these components processes one bit field or “slice” of an operand. The grouped processing components would then have the capability to process the chosen full word-length of a particular software design. Bit slice processors usually consist of an arithmetic logic unit (ALU) of 1, 2, 4 or 8 bits and control lines (including carry or overflow signals that are internal to the processor in non-bitsliced designs). For example, two 4-bit ALUs could be arranged side by side, with control lines between them, to form an 8-bit,16-bit,32-bit words (so the designer can add as many slices he wants to make it to manipulate longer words lengths). A microsequencer or Control ROM would be used to execute logic to provide data and control signals to regulate function of the component ALUs. Examples of bit-slice microprocessor modules can be seen in the Intel 3000 family, the AMD’s Am2900 family the National Semiconductor IMP-16 and IMP-8 family, and the 74181.

During the era this technique was most common (mid-1970s through 1980s), there was some debate over how much bus width was necessary in a given computer system, and silicon chip technology and parts were generally much more expensive than today. Using multiple simpler (and cheaper) ALUs was seen as a way to increase computing power in a cost effective manner. 32-bit architectures were being discussed but few were in production. 16-bit processors were common but expensive, and the 8-bit processors, such as the Z80 were widely used in the nascent home computer market. Combining components to produce bit slice products allowed engineers and students to create more powerful and complex computers at a more reasonable cost, using off-the-shelf components that could be custom-configured. The complexities of creating a new computer architecture were greatly reduced when the details of the ALU were already specified (and debugged). The main advantage in the late 60’s to mid 80’s being though, that bit slicing made it economically possible in smaller processors to use bipolar transistors, which switch much faster than NMOS or CMOS transistors. This allowed for much higher clockrates, for applications where speed was needed. For example DSP functions or matrix transformation, or as in the Xerox Alto the combination of flexibility and speed, before discrete CPUs was able to deliver that

Bit slicing (although it was not called that) was also used in computers before integrated circuits. The first bit-sliced machine was EDSAC 2, built at the University of Cambridge Mathematical Laboratory in 1956-8. Another example was the Storage Address Registers (StARs) of the IBM 1401, which were built on IBM Standard Modular System cards containing a one-bit slice of the four registers in the basic machine.

In more recent times, the term bitslicing was re-coined by Matthew Kwan to refer to the technique of using a general purpose CPU to implement multiple parallel simple virtual machines using general logic instructions to perform Single Instruction Multiple Data operations. This technique is also known as SWAR, SIMD Within A Register.

This was initially in reference to Eli Biham’s 1997 paper A Fast New DES Implementation in Software, which achieved significant gains in performance of DES by using this method.

Go to Exercise # 3

Leave a comment

Posted by on March 5, 2011 in Topic 2


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: