Non-Uniform Memory Access
Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.
NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. Their commercial development came in work by Burroughs (later Unisys), Convex Computer (later Hewlett-Packard), Silicon Graphics, Sequent Computer Systems, Data General and Digital during the 1990s. Techniques developed by these companies later featured in a variety of Unix-like operating systems, and somewhat in Windows NT.
Hierarchical memory organization
A parallel processor computer system having a large shared memory (Mps) is provided with shared memory caches (Cm) associated with the various modules of the memory system selectively connectable to each other and to the processors (P) over one or more crossbar or multi-stage interconnection networks. The memory caches (Cm) coexist with processor caches (Cp), located adjacent to each processor (P). The processor caches primarily store private data, but may also store shared-read-only data. The control logic for the shared memory parallel processor system utilizes the shared memory caches to cache shared data and improve the memory performance by reducing the memory access time for shared memory (511, 610). It also reduces the latency of a shared memory reference that has to be routed via the multiple processor multi-stage interconnection networks and increases its own throughput by effectively reducing the memory access time and avoids the need for cache coherence among the shared memory caches. Access to the shared memory caches is not restricted to one processor, or cluster of processors, but is distributed across the memory modules of the system and is accessible by all processors. Further, the shared memory caching scheme scales up with the number of processors in the system with minimum changes to addressing and other control mechanisms.
Hierarchical Memory Organization
The effective memory access time of a computer system has a substantial effect on the overall performance of the system. Therefore many techniques have been suggested to minimize this time, for example interleaving, multi-word access per cycle, caching, etc. Among these techniques caching has long been a very popular technique for this purpose.
In conventional uniprocessor or multi-processor system organizations caches have been placed in between the processor and the memory, and have been physically placed closer to the processor. This has been done to minimize the cache access time as much as possible, so that the effective memory access time could be minimized as much as possible.
Such caches usually have a much shorter access time than the next level of memory and are conventionally provided with at least limited associative access. However, cache memory is much more expensive than main store and various cache management policies are used to maintain the most frequently accessed data and instructions in cache.
In conventional multiple processor systems each processor is assigned its own cache. These processor-cache units are then connected via some interconnection facility (usually one or more shared busses) to the shared memory and can generally cache any memory location. Since each processor has its own cache, some mechanism for cache coherence needs to be supported in order to ensure that all the processors have a consistent image of the shared memory.
In order to make conventional multiple processor caches effective, the cache coherence mechanism has to be fast and therefore needs to be implemented in hardware. Cache coherence mechanisms that have been used or proposed for such systems, either employ: a centralized cache coherence manager and/or some “watch-dog” logic that monitors the traffic from the caches to the shared memory. This later technique also requires the broadcasting of all the traffic to the shared memory to all the processors’ “watch-dog” logic. Conventional multiple processor systems generally have used a dedicated, shared bus for this broadcast.
Such cache coherence mechanisms are suitable for systems that have a small number of processors (e.g. about 1 to 32 processors). But they are not feasible for systems that have a larger number of processors (e.g. 100s or 1000s of processors). The main reason for this is that the centralized cache coherence manager and the broadcast bus become a severe performance bottleneck as the number of processors in the system increases. Therefore, for shared memory systems with large numbers of processors, for example parallel processor systems, system architects have either decided to handle cache coherency via software or have avoided using a cache altogether.
Although these techniques solve the cache coherence problem, they do not reduce the shared memory’s effective access time, as well as a cached system with hardware cache coherence. Therefore, for systems with a large number of processors, there is a need for a better shared memory caching method that does not have the problems of the above two methods and that can help improve the overall performance of the system.
In shared memory parallel processor systems, a reduced shared memory access time can have a significant impact on the performance of the interconnection network and the overall system too. For example in the RP3 system (see Fig. 2) the overall system memory 205 is functionally spread across the memory modules assigned to each processor 201. In order for a processor 201 to access shared memory locations that are not in its memory module, it needs to send these references across the network 206. The time taken to satisfy this request is not only dependent on the memory access time of the shared memory reference, but also on the latency of the network 206. The latency of the network 206 is dependent on many parameters, among which are blocking in the network, queuing in the network and the busy time of its output ports.
These network parameters in turn are affected by the effective memory access time. Therefore, in such parallel systems, reducing the effective access time of the shared memory 205 can significantly improve the overall system performance.
Although each processor 201 in the RP3 system (Fig. 2) has a cache 203, this cache is mainly used to cache instructions and non-shared (i.e. private) data. Shared data can be cached in this cache 203, provided the software has obtained mutually exclusive use of this data via locking or some other arrangement in the memory 205 as will be well understood by those skilled in the art. Obtaining this mutual exclusion itself adds significant overhead to shared data references and therefore can degrade the performance of the system.
Even though there is a way to cache shared data (as mentioned above) in RP3 like processor caches, it is not desirable to cache all types of shared data. For example, a system/application may not want to cache shared locks, indices, pointers, etc. In order to improve the time taken to reference these types of shared data, RP3 like systems provide Fetch & Operation (F&O) type of instructions (e.g. Fetch & Add) and a combining network . But again one of the parameters that affects the performance of F&O type operations and combining networks is the shared memory’s effective memory reference time. Therefore improving the access time of the shared memory will also help here.
The following example gives an idea of how significantly shared memory access time can affect system performance. In the RP3 system shared data can reside in the memory module 205 that is attached to the processor (i.e. local memory) or in any other memory module (see Fig. 2). In the latter case the memory reference has to be routed via the network 206. Assuming that it takes one time unit to reference data from the cache 203, then the ratio of access times for the above mentioned two types of memory references is: EMI5.1
It is apparent that in the local shared memory access case the overhead contributed by the memory access time is 90% of the effective memory reference time. In the shared memory reference across the network case, this overhead is 56% of the effective memory reference time.
It may further be seen that improving the access time of the shared memory is very important for multiple processor systems. Since conventional methods for shared memory caching, incorporating hardware coherence mechanisms, do not scale (i.e. are feasible) as the number of processors in the system grow, new methods to cache shared data need to be developed.
The Shared Memory Cache proposed herein meets these needs very well and therefore is very attractive for systems with a large number of processors (e.g. parallel processor systems).
The caching scheme (organization) of the present invention is compared/contrasted with other caching schemes that have been previously proposed for improving the effective access time of shared memories. In order to effectively do this comparison, only known cache architectures that can support all of the following features have been considered to be relevant in view of the vast amount of published literature in the field of cache architectures etc. 1. The architecture must allow both Processor and Memory caches to coexist in the memory hierarchy. 2. Both the Processor and Memory caches are physically distributed in the system. 3. The architectures do not require any hardware, cache coherence mechanisms for the caches for the purpose of maintaining proper data coherence. 4. The caching scheme should be readily able to scale with the size of the system.
That is the architecture should be capable of supporting the same caching scheme as the number of processors increase in the system without significant revision. 5. The caching scheme should be feasible for large systems, for example, parallel processor systems with hundreds or even thousands of processors.
It is believed that the above are necessary features for any large multiple processor system from both the performance and cost point of view, these were set as minimum standards in designing the present system.
One of the present inventors generally describe the attributes of an experimental high speed multi-processor computing system having a large shared memory. The first, is tutorial in nature, and describes a system in which the present invention has particular utility. The second article broadly describes a memory element for such a system and generally describes certain cache management organizations suitable for use in such large shared memories. Both of these articles provide valuable background information for the present invention.
In high-end computing systems, for example the S/370, constituting a dual cache hierarchy have been proposed and used. In such systems, an L1 cache is attached to each processor and L2 cache is attached to several L1 caches. It should be noted that each processor has exclusive access to its own L1 cache. Similarly an L2 cache exclusively supports the L1 caches attached to it. It is possible for several such processor-L1-L2 cache subsystems to exist within a system environment.
In such a processor-L1-L2 cache system, the L2 cache size is generally larger than the cumulative size of the L1 caches. Furthermore, the L2 cache holds information that is a super set of any L1 cache attached to it. In fact the L2 information is generally a super set of the information held by all the L1 caches attached to it.
There is a fundamental difference between the processor and memory caches taught by the present invention and the L1-L2 cache scheme. This difference is that in the present invention the information held by the memory caches is not a super set of the processor cache information. For example, private data will be stored in the processor caches only, and not in the memory caches. In the L1-L2 cache scheme, this private data would be stored in both the L1 and L2 caches.
Further in the present invention the size of the memory cache need not depend on the size of the processor cache. That is unlike the L2 cache, the memory cache does not have to be larger than a processor cache, or larger than the cumulative size of all the processor caches. Also, shared read-only data, of a memory cache, can reside in several processor caches. Furthermore, shared read-only data from several memory caches, can reside in a processor cache. That is unlike the L1-L2 caches, there is no exclusive relationship between the processor and memory caches.
Only one other cache organization is known to the inventors in addition to the one proposed by the present disclosure, that satisfies all of the above requirements.
The cache organization of the present invention, referring briefly to Fig. 4, has the following advantages over the Hwang and Briggs organization. It does not need three separate networks 408 to interconnect the memory caches 405 to the processor 401 and the memory modules 407. According to the teachings of the present invention the processors 401 would be interconnected to the memory caches 405 and the memory modules 407 via the same network 408. As contrasted to the present invention the Hwang and Briggs organization uses the three separate networks 903, 911 and 912. Network 903 is used to connect the processors 906 to the memory modules 905. Network 911 connects the processors 906 to the shared memory caches 902 and uses network 912 to connect the memory caches 902 to the memory modules 905.
As will be apparent the cache organization of the present invention is considerably more cost effective than that disclosed in the Hwang and Briggs publication.
From a packaging point of view also, the cache organization of the present invention is a more effective and efficient organization, because it does not require the extra networks 911 and 912 needed by the organization as shown in Fig. 9. Further, the memory caches 405 can be packaged with the memory modules 407 that they are attached to, thus leading to significant manufacturing cost reductions.
The cache organization of the present invention is also inherently a higher performing organization than the one shown in Fig. 9. This is because, in the case of a memory cache miss, the memory reference has to be directed to the appropriate memory 905 via a network 912 and bus 910. In the organization of the present invention, the memory cache 405 is attached to its memory modules 407 directly over the very short bus 406. Generally the latency of the network and a bus will be considerably higher than that of a bus alone.
The Carrick-on-Shannon architecture proposed by Linn and Linn conceptualizes the use of a separate processor 1003 and memory caches 1007 as shown in the functional block diagram of Fig. 10.
A major difference between the present architecture and that of the Linn-Linn paper is that it (Linn-Linn paper) proposes a processor-memory cache scheme, that is based on the use of a shared bus 1005 to interconnect the several processors and memories of the system; while the herein disclosed memory architecture uses an interconnection network (e.g. a crossbar or MIN type network) to interconnect the several processors and memories of the system. This distinction is extremely important, because it determines the scalability of the parallel processor system. In practice, bus based schemes are limited to interconnect only a small number of processors (e.g. less than 65 processors). Since disclosed system is based on an interconnection network, it is scalable to connect a significantly larger number of processors (e.g. hundreds).
The Linn architecture also imposes a particular cache management policy to be used by the memory cache 1007, while the architecture of the present invention can utilize any cache management policy suitable to the system.
It should clearly be understood that it is not a simple matter to replace a bus with an interconnection network. This is because a bus based system architecture requires the use of broadcast to support Test-and-Set operations. The normal mode of operation of a bus is broadcast. But the normal mode of operation of an interconnection network is point-to-point communication. The interconnection network needs to be especially designed to support broadcast. Furthermore, the use of broadcast in interconnection networks, can have performance degradation and cost implications.
The architecture of the present invention does not require broadcast to be supported by the interconnection network. The Test-and-Set operation, used herein is atomically executed at the memory module (memory cache and memory logic) and the results are communicated directly to the processor requesting this operation. Thus, the results or negative acknowledgements are not broadcast to all the processors in the system. These results are communicated using point-to-point communication.
The Linn-Linn architecture also imposes some restrictions on the Test-and-Set (Tset) operations, for their memory caching scheme. As stated in their paper, they allow “only one Tset operation to be in progress or enquired at a memory module at any time. Any additional Tsets received would be negatively acknowledged”. They also suggest that this scheme can be modified to “accept as many Tsets as desired, as long as the same semaphore is referenced. In this situation, only the first Tset is enquired; all others on the same semaphore are simply acknowledged and discarded”.
Restricting only one Tsets operation to be enquired at the memory module at any one time is extremely limiting for a system with a large number of processors. In the present system no such restrictions are imposed. In fact, any number of Tset operations can be enquired. In particular, it should be noted that no Tset operations are discarded or negatively acknowledged. Such operations may be enquired. Each such enquired Tset operation is atomically executed by the memory cache and memory logic and the results of this operation are returned to the appropriate processor.
Furthermore, the support of other types of synchronization operations is envisioned for example the Fetch&Add. These operations are also executed atomically at the memory module and the results of these operations are returned to the appropriate processor.
The herein disclosed architecture also allows local memory modules to be attached to processors of the system. Examples of such memory modules are shown in Figs. 5 and 6. Similarly, local memory modules can also be configured for the examples shown in Figs. 7 and 8. If such a local memory is used, the present architecture would take advantage of it by storing private data and instructions in this memory. Since private data and instructions are cached by the processor cache only, system performance can be increased by avoiding accessing them from the memory across the interconnection network.
The architecture proposed by the Linn-Linn paper requires that a uniform address space be used. That is memory is not partitioned into local and shared memory.
It should be noted of course that the present architecture can also be used for a system that supports only uniformly addressed memory but this is not necessary.
Finally, the architecture proposed by Linn-Linn requires that a processor can have only one outstanding request to memory. That is the processor has to wait, to receive a response from the memory, for every memory request. It cannot execute any other request during this waiting period. This can severely limit the performance of large multi-processor systems.
The present architecture does not impose any such restrictions. Thus, the number of outstanding requests at a processor are only limited by: 1) the design of the processor: and 2) the nature of the computation being executed at the processor.
It should also be noted that the architecture of the present invention is also significantly different from cache proposals such as the one described in U.S. Patent 4,622,631 of Frank et al, in that the present architecture does not require any hardware cache coherence support. In contrast, the cache architecture disclosed in U.S. Patent 4,622,631 is primarily directed to a hardware cache coherence scheme. It also assumes that only a processor cache is used and that this cache stores both private and shared data. It will be apparent from the following description and from the high level functional block diagram of Fig. 4, that in the present architecture, each processor has its own dedicated cache 403 Cp, as well as a memory cache 405 Cm directly associated with each memory module.
It is the primary object of the present invention to provide an improved hierarchical memory system and to a method for managing shared data stored therein, which system is uniquely suited for use with a large multi-processor computing system, that automatically maintains data coherence for cached data, and wherein “private and shared read-only” data and shared data is cached in two separate caches.
It is another object of the invention to provide such a system wherein the two caches are processor caches and a memory caches where the memory caches are physically distributed throughout the multi-processor memory system, and which is easily scalable to large numbers of processors and does not require a special hardware means for maintaining data coherence.
The solution of the objects for the system and the method are described in the characterizing parts of claim 1 and claim 15 respectively.
The objects of the present invention are accomplished in general by a large distributed memory system having a plurality of separate, individually accessible memory modules wherein a separate memory cache is functionally associated with each memory module, said cache being functionally located between the memory module and any processor requiring access thereto.
According to a further aspect of the invention, the shared memory is particularly suited for use with a large multi-processor system wherein the memory caches may be shared between various processors and wherein each processor is provided with its own processor cache for storing data, private to that processor or shared on a read-only basis by all processors.
According to yet another aspect of the invention, each private processor cache is physically located adjacent to its respective processor and each shared memory cache is located physically adjacent to its own memory module or group of memory modules and wherein the individual processors and the individual memory modules are directly connectable to each other over a multistage interconnection network or a crossbar network.
The herein described hierarchical memory system architecture and methodology for use with shared memories in large, high speed multi-processor systems is designed to improve the effective access time for shared memory operations. The disclosed memory cache and processor cache organization is more effective than other known schemes which had been proposed for this purpose in the past, both from a cost and performance view. Further, the herein described systems does not require any hardware cache coherence support. Therefore, unlike conventional shared memory caching schemes and scheme discussed for the Carrick-on-Shannon architecture shown in Fig. 10, the present memory caching architecture and methodology can scale upwards as the number of processors in the system increases.
The only modifications necessary would be to increase the size of the address field or identifiers so that information may be returned to the proper requesting processor as will be well understood.
A detailed description of how external input/output devices can be interfaced to the system is not specifically set forth since virtually any I/O attachment mechanism can be accommodated within the system framework. Examples include attachment of I/O to the network(s) or directly to some of the processors.
While the invention has been described with respect to several preferred embodiments of the overall hierarchical memory system architecture, the underlying feature of the invention is the use of individual memory caches located between the memory module or modules which they serve and the communication network interconnecting the memory subsystem with the processor. It will be apparent that many modifications in form and detail may be made by those skilled in the art without departing from the essential spirit and scope of the invention as set forth in the appended claims.
Fig. 1 comprises high level functional block diagram of a conventional parallel processor system. Fig. 2 comprises a high level functional block diagram of the RP3 parallel processor system organization. Fig. 3 comprises a high level functional block diagram of a system organization incorporating the architectural principles of the present invention showing separate processor and memory caches. Fig. 4 comprises a high level functional block diagram of a variant of the organization shown in Fig. 3 wherein separate processor and memory caches can be supported and are organized as in the RP3 system architecture as shown in Fig. 2. Fig. 5 comprises a high level functional block diagram of still another variant of the multi-processor system of Fig. 3 including processor and memory caches with separate, private and shared memories. Fig. 6 comprises a high level functional block diagram of another multi-processor system configuration incorporating the teachings of the present invention showing processor and memory caches with separate, private and shared memories. Fig. 7 comprises a high level functional block diagram of another multi-processor system configuration illustrating the present invention including multiple memory modules attached to each of several memory caches, said system also being provided with appropriate processor caches.
While not shown in the figure, local memory (mp) can also be included as illustrated in Figs. 5 and 6. Fig. 8 comprises a high level functional block diagram of another multi-processor system configuration of the present invention illustrating a system and memory organization having multiple network output (memory) ports attached to separate memory caches.
While not shown in this figure, local memory can also be included as illustrated in Figs. 5 and 6. Fig. 9 (Prior Art) comprises a high level functional block diagram of the Hwang and Briggs architecture interconnecting processor and memory caches over a complex interconnection network. Fig. 10 (Prior Art) comprises a high level functional block diagram of the Carrick-on-Shannon system architecture’s processor and memory cache organization. Fig. 11 comprises a flow chart describing the operation of the control logic of a processor cache resident in a multi-processor system organization incorporating the features of the present invention. Fig. 12 comprises a flow chart describing the operation of the control logic of a memory cache resident in a multi-processor system incorporating the features of the present invention.
The herein described invention proposes that in shared memory type multiple processor systems, private data (i.e. non-shared data or shared read-only data) and shared data be cached in two separate caches. The private data cache should be organized closer to the processor, as done in conventional caching; while the shared memory cache should be organized closer to the memory. In the following description, the private data cache will be referred to as the Processor Cache and the shared data cache the Memory Cache. Further, each processor is assigned its own Processor Cache; while each shared memory bank or each network’s memory port is assigned a Memory Cache. Fig. 3 shows the organization of a generalized multiple processor 301 system that supports both of these types of caches 303 and 307.
Fig. 3 is an example of an organization of a basic system that supports such a dual cache architecture. Some other examples are shown in Figs. 4 to 8. The organization shown in these figures can be interpreted as either a logical or physical organization of a system. The organization shown in Fig. 4 demonstrates the location of the Memory and Processor Caches 405 and 403 in an RP3 type of tightly coupled multi-processor system. The organization shown in Figs. 5 and 6 illustrates the location of these caches in a system that supports separate private memories 505 and 605 and shared memories 511 and 610. The organization shown in Fig. 7 shows that a Memory Cache 707 can be attached to more than one shared memory module 709 to 710; while Fig. 8 shows that multiple interconnection network ports 806 to 807 can be interfaced to a single Memory Cache 808.
Before proceeding with a detailed description of the present invention a number of salient points of the proposed cache organization are presented.
It should be noted that in the subsequent description all of the reference numbers are keyed to their respective figures so that the first digit, or pair of digits in the case of Figs. 11 & 12, appear in that particular figure, thus, cache 303 is in Fig. 3, cache 808 is in Fig. 8 and functional block 1111 is is Fig. 11, etc.
It will be noted that in the basic organization of Fig. 3, the processor caches 303 and the memory caches 307 are physically distributed across the system, thus each of the processor caches is adjacent to the processor which it serves and each of the memory caches is adjacent to the module which it serves. There is no need for any signaling mechanism such as a special shared bus for “watch-dog” logic between the different processor and memory caches of the system such as is required in some of the prior art references for maintaining cache coherency.