05 Mar

SIMD (Single Instruction/Multiple Data)

SIMD stands for Single Instruction Multiple Data. It is a way of packing N (usually a power of 2) like operations (e.g. 8 adds) into a single instruction. The data for the instruction operands is packed into registers capable of holding the extra data. The advantage of this format is that for the cost of doing a single instruction, N instructions worth of work are performed. This can translate into very large speedups for parallelizable algorithms.

Both PowerPC and ia-32 architectures have SIMD extensions to their vector architectures. On PowerPC, the extension is called AltiVec. On ia-32 the vector architecture extensions have been gradually introduced, at first as the Intel MultiMedia eXtensions (MMX) and then later as the Intel Streaming SIMD Extensions (SSE, SSE2, SSE3). Examples of common areas where SIMD can result in very large improvements in speed are 3-D graphics (Electric Image, games), image processing (Quartz, Photoshop filters), video processing (MPEG, MPEG2, MPEG4), and theater-quality audio (Dolby AC-3, DTS, mp3), and high performance scientific calculations. SIMD units are present on all G4, G5 or Pentium 3/4/M class processors.

Why do we need SIMD?

SIMD offers greater flexibility and opportunities for better performance in video, audio and communications tasks which are increasingly important for applications. SIMD provides a cornerstone for robust and powerful multimedia capabilities that significantly extend the scalar instruction set.

How do AltiVec Features Compare with SSE, SSE2 & SSE3?

AltiVec and SSE/SSE2/SSE3 are similar in some ways. They are both Single Instruction Multiple Data (SIMD) vector units with what are formally 128-bit register files. A single instruction (e.g. add) encodes for the parallel addition of all elements in one register to the like elements in another register. Indeed, approximately 60% of the instructions in the AltiVec ISA have direct counterparts on the Intel SSE/SSE2/SSE3 architecture. There are some differences, however:

AltiVec SSE, SSE2 & SSE3
  • 32 separate Registers
  • max throughput: 8 Flops / cycle
  • 32-bit saturated arithmetic
  • unsigned compares
  • throughput of 1/cycle for all instructions
  • IEEE-754 (Java subset) compliant
  • 8 XMM registers
  • max throughput: 4 Flops / cycle
  • no 32-bit saturated arithmetic
  • no unsigned compares
  • throughput of one every other cycle for most instructions
  • Fully IEEE-754 compliant

Am I going to get 38.2 GFlops on a Dual 2.5 GHz G5 for everything?

No. The actual performance depends on the function and the algorithm used. The theoretical peak performance of a 2.5 GHz dual processor G5 machine is calculated as:

(2.5 x 109 cycles / s) * (8 FP ops / cycle) * (2 processors) = 40 GFLops

Thus, it is possible that you may write a function that performs even better than 38.2 GFlops. Other functions may never reach this speed. We advertise 38.2 GFlops because that is the speed of the fastest function we have tested. It comes from a convolution function that is among the many vectorized functions in Accelerate.framework, a standard part of MacOS X. Below is a small table of a few Accelerate.framework functions and the average number of GFLops we measure for them over a number of runs on a 2.0 GHz dual processor machine.

convolution (2048 x 256) 38.2
complex 1024 FFT 23.0
real 1024 FFT 19.8
dot product (1024) 18.3

Why should a developer care about SIMD?

SIMD can provide a substantial boost in performance and capability for an application that makes significant use of 3D graphics, image processing, audio compression or other calculation-intense functions. Other features of a program may be accelerated by recoding to take advantage of the parallelism and additional operations of SIMD. Apple is adding SIMD capabilities to Core Graphics, QuickDraw and QuickTime. An application that calls them today will see improvements from SIMD without any changes. SIMD also offers the potential to create new applications that take advantage of its features and power. To take advantage of SIMD, an application must be reprogrammed or at least recompiled; however you do not need to rewrite the entire application. SIMD typically works best for that 10% of the application that consumes 80% of your CPU time — these functions typically have heavy computational and data loads, two areas where SIMD excels.

Is SIMD easy to learn?

Neither SIMD environment supported by Apple requires you to write in assembly. By taking advantage of the AltiVec C Programming Model and the Intel C Programming Model, developers may leverage their experience with C, C++ or Obj C for easier entry into SIMD.

MIMD (Multiple Instruction stream, Multiple Data stream)

MIMD multiprocessing architecture is suitable for a wide variety of tasks in which completely independent and parallel execution of instructions touching different sets of data can be put to productive use. For this reason, and because it is easy to implement, MIMD predominates in multiprocessing.

Processing is divided into multiple threads, each with its own hardware processor state, within a single software-defined process or within multiple processes. Insofar as a system has multiple threads awaiting dispatch (either system or user threads), this architecture makes good use of hardware resources.

MIMD does raise issues of deadlock and resource contention, however, since threads may collide in their access to resources in an unpredictable way that is difficult to manage efficiently. MIMD requires special coding in the operating system of a computer but does not require application changes unless the programs themselves use multiple threads (MIMD is transparent to single-threaded programs under most operating systems, if the programs do not voluntarily relinquish control to the OS). Both system and user software may need to use software constructs such as semaphores (also called locks or gates) to prevent one thread from interfering with another if they should happen to cross paths in referencing the same data. This gating or locking process increases code complexity, lowers performance, and greatly increases the amount of testing required, although not usually enough to negate the advantages of multiprocessing.

Similar conflicts can arise at the hardware level between processors (cache contention and corruption, for example), and must usually be resolved in hardware, or with a combination of software and hardware (e.g., cache-clear instructions).

In computing, MIMD (Multiple Instruction stream, Multiple Data stream) is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as computer-aided design/computer-aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of either shared memory or distributed memory categories. These classifications are based on how MIMD processors access memory. Shared memory machines may be of the bus-based, extended, or hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.

Shared Memory Model

The processors are all connected to a “globally available” memory, via either a software or hardware means. The operating system usually maintains its memory coherence.

From a programmer’s point-of-view, this memory model is better understood than the distributed memory model. Another advantage is that memory coherence is managed by the operating system and not the written program. Two known disadvantages are: scalability beyond thirty-two processors is difficult, and the shared memory model is less flexible than the distributed memory model.

There are many examples of shared memory (multiprocessors): UMA (Uniform Memory Access), COMA (Cached Only Memory Access) and NUMA (Non-Uniform Memory Access).


MIMD machines with shared memory have processors which share a common, central memory. In the simplest form, all processors are attached to a bus which connects them to memory.


MIMD machines with hierarchical shared memory use a hierarchy of buses to give processors access to each other’s memory. Processors on different boards may communicate through inter-nodal buses. Buses support communication between boards. With this type of architecture, the machine may support over nine thousand processors.

Distributed memory

In distributed memory MIMD machines, each processor has its own individual memory location. Each processor has no direct knowledge about other processor’s memory. For data to be shared, it must be passed from one processor to another as a message. Since there is no shared memory, contention is not as great a problem with these machines. It is not economically feasible to connect a large number of processors directly to each other. A way to avoid this multitude of direct connections is to connect each processor to just a few others. This type of design can be inefficient because of the added time required to pass a message from one processor to another along the message path. The amount of time required for processors to perform simple message routing can be substantial. Systems were designed to reduce this time loss and hypercube and mesh are among two of the popular interconnection schemes.

As examples of distributed memory(multicomputers): MPP (massively parallel processors) and COW (Clusters of Workstations). The first one is complex and expensive: lots of super-computers coupled by broad-band networks. Examples: hypercube and mesh interconections. COW is the “home-made” version for a fraction of the price.

Hypercube interconnection network

In an MIMD distributed memory machine with a hypercube system interconnection network containing four processors, a processor and a memory module are placed at each vertex of a square. The diameter of the system is the minimum number of steps it takes for one processor to send a message to the processor that is the farthest away. So, for example, the diameter of a 2-cube is 1. In a hypercube system with eight processors and each processor and memory module being placed in the vertex of a cube, the diameter is 3. In general, a system that contains 2^N processors with each processor directly connected to N other processors, the diameter of the system is N. One disadvantage of a hypercube system is that it must be configured in powers of two, so a machine must be built that could potentially have many more processors than is really needed for the application.

Mesh interconnection network

In an MIMD distributed memory machine with a mesh interconnection network, processors are placed in a two-dimensional grid. Each processor is connected to its four immediate neighbors. Wraparound connections may be provided at the edges of the mesh. One advantage of the mesh interconnection network over the hypercube is that the mesh system need not be configured in powers of two. A disadvantage is that the diameter of the mesh network is greater than the hypercube for systems with more than four processors.

SISD (Single Instruction, Single Data)

In computing, SISD (Single Instruction, Single Data) is a term referring to a computer architecture in which a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a single memory. This corresponds to the von Neumann architecture.

SISD is one of the four main classifications as defined in Flynn’s taxonomy. In this system classifications are based upon the number of concurrent instructions and data streams present in the computer architecture. According to Michael J. Flynn, SISD can have concurrent processing characteristics. Instruction fetching and pipelined execution of instructions are common examples found in most modern SISD computers.

Go to Exercise # 2

Leave a comment

Posted by on March 5, 2011 in Topic 2


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: