The memory caches 307 are placed physically and functionally between the shared memory module or section which they serve and the rest of the system, e.g., communication network and processors. If the system’s memory modules 407 support both private and shared data as in the RP3 system and as designated as Mps in Figs. 4, 7 & 8 then each memory cache 405 is placed in front of these modules 407 and the rest of the system as shown in Fig. 4.
Referring to the memory architecture shown in Fig. 7. The memory caches 707 can be used to cache references to more than one memory module or bank of modules 709-710. However, each memory cache 707 is provided with exclusive access to its associated set of memory modules or banks 709-710.
Within the overall memory cache architecture of the present invention, there is no need for the system to support any hardware cache coherence schemes for either the processor cache or the memory cache. Cache coherence is not needed for the processor caches such as 703 in Fig. 7 because they cache private and shared read-only data and are accessible only to their associated processor, hence no other processor can change data resident therein out of sequence. It is also not needed for the proposed memory caches, e.g., 707 because each memory cache 707 is provided with exclusive access to its own memory modules 709-710.
It is to be understood however, that this does not preclude using software coherence techniques such as locks, time stamps, etc., within each of the memory cache management controls for mutually exclusive access to shared data, so that this data can be cached in a processor cache 703.
It is also not necessary for the overall multiprocessor/memory system to support processor caches such as 703 in order to support the memory caches 707. In fact, in systems that do not support processor caches 703, the memory caches 707 can also be used to cache private data if needed by utilizing certain designated areas of an individual memory cache to a particular processor.
Referring briefly to Fig. 8, a memory cache 808 can be attached to one or more network ports, e.g., 806-807, which are attached to the memory modules or banks 810-811 containing shared memory data. Similarly, the memory caches 808 do not preclude the use of buffers at the output ports 806-807 of the network 805, to hold outstanding requests to the memory caches 808 and memory modules 810-811 attached to the respective ports 806-807 through the memory cache 808. Such buffering as is well known in the art can help reduce tree saturation (hot spots) in the network.
Returning now to the overall description, in order to demonstrate the performance improvement potential that such Memory Caches have, their use in an RP3 like system is depicted in Fig. 4. The RP3 example referenced in the previous section described the various shared memory access times. If it is assumed that Memory Caches 405 (as proposed here) are used in an RP3 like architecture and that their access time is equivalent to the RP3 cache (i.e. one time unit), then shared memory access times, for a cache hit are: EMI24.1
On the other hand, the shared memory access times for a cache miss are: Local Memory shared information access time = 1 + 1 + 9 = 11 Network Memory shared information access time = 1 + 6 + 1 + 9 = 17 (The above numbers have been derived by adding the memory access time overhead, of 9 time units, to the cache hit numbers).
Although for the cache miss case it is seen that the effective shared memory access time is degraded by 6% to 10%, there is a substantial improvement of 50% to 80% for the cache hit case. In fact these results are very attractive because they indicate that the Memory Caches 405 will improve the effective shared memory access time, as long as the Memory Cache 405 hit probability is higher than 0.12.
The above example indicates that Memory Caches can be very effective even when their cache hit probability is very low. Therefore they are believed to be very attractive for multiple processor systems.
To facilitate an understanding of the operation of the preferred embodiment of the present invention, reference will be made to the operation and function of a basic multi-processor system as shown in Fig. 1.
Such a parallel processor system can be seen to contain three distinct elements: processor element (PE) 101, memory element (ME) 102 and an interconnection network 103. A parallel processor system consists of several processors and memory elements that are connected to each other via the interconnection network. One or more networks can be used for this interconnection. In order to communicate across this network, a PE 101 sends a message over line 104 to the network 103. The network routes this message to the required ME. The memory element 102 receives this message over line 105, processes it and sends a reply message over line 106 to the requesting network 103. The network then routes this message to the required PE. The PE receives this message over line 107 and processes it. It should be noted that the network can also be used to communicate between the PEs.
The details of the operation of the PE, ME and interconnection network are not relevant to the present invention and are consequently not discussed in detail. The following general description and reference to the many articles describing state-of-the-art multi-processor systems will allow those skilled in the art to practice the invention.
Parallel processor systems can support caches at the processors. One example of such a system is the RP3. The RP3 system organization is shown in Fig. 2. In the RP3 the cache 203 is managed by software, that is there is no hardware cache coherence scheme supported by the system. In the RP3 system, when the processor 201 generates a memory request, the memory request is transmitted via line 202 to the cache 203. If the memory request is cacheable and the cache memory contains the memory information requested, then the cache 203 accesses the requested information from its memory as required. The cache 203 then sends the required response back over line 202 to the processor 201.
But, if the memory request is not cacheable, or the cache memory does not contain the required information, then the request is sent via line 204 to the memory module 205 locally attached to the processor or across the network(s) 206. The memory module that receives this request accesses this information and sends an appropriate response to the requesting processor. If the information was cacheable, then the cache 203 updates its memory contents and then sends the response to the processor 201 over line 202. If the information is not cacheable, then the cache 203 does not update its memory contents, but sends the response to the processor 201 over line 202.
According to the present invention the parallel processor system is provided with and supports caches at both the processors and memory elements. One example of such a system is shown in Fig. 3. In such a system the references to memory are either marked private and shared, or they are identified by the caches 303 and 307 by examining the address range in which they map. (It should be noted that the particular scheme used is not important for the invention disclosed and described here). In such a system the processor cache 303 and the memory cache 307 do not require any hardware cache coherence support.
In the system shown in Fig. 3, when the processor 301 generates a memory request, the memory request is transmitted via line 302 to the processor cache 303. If the memory request is cacheable, it is a private or shared read-only memory reference and the processor cache’s memory contains the memory information requested, then the processor cache 303 accesses the information as required. The processor cache 303 then sends the required response to the processor 301 via line 302. But, if the memory request is not cacheable, for example a shared read-write memory reference, or the processor cache’s memory does not contain the required information, then the request is sent via line 304 to the memory module 308 across the network(s) 305. The network 305 routes the message to the network port 306 to which the required memory module is attached.
The request is intercepted by the memory cache 307 which accesses the information as required.
The “cacheability” and “shareability” characteristics of a particular memory request would conventionally be carried in special fields as will be well understood by those skilled in the art. Whether the requested data is, in fact, valid and currently in the memory cache would of course be determined by a search in the particular memory cache’s directory.
The memory cache 307 then sends the required response to the processor 301, via the network 305 and line 306. But, if the memory request is not cacheable, or if the information requested is not in the appropriate cache, then the request is sent via line 309 to the memory module 308. The memory module that receives this request accesses this information and sends an appropriate response back over line 309 to the requesting processor. If the information was cacheable, then the appropriate cache 307 or 303 updates its memory contents and then sends the response to the processor 301. It should be noted that if the request was routed via the network to the memory module, then the response from the memory module will also be generally routed via the network.
Other parallel processor organization examples supporting both processor and memory caches are shown in Figs. 4-8 as noted previously. The basic operating principle for the processor and memory caches in all these organizations is the same, as described above. The only difference in these organizations is the type of memory modules used and the location of these memory modules and the memory cache. In the discussion given below the differences in these organizations are highlighted.
In Fig. 4 the location of the memory cache 405 in an RP3 like parallel system organization is shown.
Fig. 2 shows a basic RP3 layout characterized by the shared main memory modules 205 being distributed across the whole system. As will be noted this same overall organization is shown in Fig. 4, it being noted that the cache blocks 203 are functionally equivalent to the processor caches (Cp) 403 of Fig. 4.
It will also be noted that this RP3-like system has one memory module 407 per processor and it is attached locally to the processor 401. This memory module 407 can be partitioned by the software to contain both private and shared information. Any processor 401 can access shared information in any other processor’s memory module, via the interconnection network 408. Therefore, if a memory cache 405 were to be incorporated in an RP3 type architecture, it would be placed between the memory module 407 and the connection 404 to the network 408 and the processor cache 403. The memory cache 405 is interfaced to the memory module 407 via a short bus 406. The rest of the system organization does not change.
The organization shown in Fig. 5 differs from the organization shown in Fig. 3, in that a separate memory module is used for the private memory 505 and the shared memory 511. The private memory module 505 is directly attached via line 504 to the processor cache 503, while the shared memory module 511 is directly attached via line 510 to the memory cache 509. For this organization, the memory request routing described above for Fig. 3 is modified as follows: The request is routed to the private memory 505 by the processor cache 503, only if the processor is requesting private information and the request is not cacheable or the information is not resident in the processor cache 503. The routing to the memory cache 509 is not modified nor is its operating criteria.
The organization shown in Fig. 6 differs from Fig. 5 in that the processor cache 603 is interfaced to the private memory 605 via the bus 604 used to interface to the network. In the Fig. 6 organization, the bus 604 will need to provide some module addressing capability, so that the processor cache 603 can uniquely select either the private memory 605 or the network 606.
The organization shown in Fig. 7 is similar to that shown in Fig. 3, except that a memory cache 707 is attached via line 708 to more than one memory module 709 to 710. In this case the memory cache 707 caches shared information resident in any of the memory modules 709 to 710. It should be noted here that multiple shared memory modules can also be interfaced as shown in Fig. 7, in the parallel system organization of Fig. 5 or 6.
The organization shown in Fig. 8 is similar to that shown in Fig. 7, except that multiple network ports 806 to 807 are interfaced to a single memory cache 808. The basic operation of the memory cache 808 does not change, although some provision to select the network ports 806 to 807, to accept a request from will need to be provided in the memory cache 808 control logic. The actual method used to do this selection is not important for this invention. It should also be noted here that multiple network ports can also be interfaced as as shown in Fig. 8, in the parallel systems organization of Fig. 5 or 6. This would not be possible in a Carrick-on-Shannon architecture such as shown in the Linn-Linn paper due to the nature of the serial bus.
It will also be noted that in both Figs. 7 & 8 each of the memory modules 709-710 and 810-811 are designated Mps indicating that they may contain both private and shared data. The memory modules 308 and 407 in Figs. 3 & 4 are similarly designated. This function would usually be done by simple address partitioning to mark off, for example, reserved private areas of storage. However, this function is well known in the art and does not directly relate to the present invention.
The broad organizational concepts of the present invention have been described with respect to a number of different system configurations all provided with a memory cache for each memory module or group of modules. In all cases the memory cache is functionally, and usually physically, located in close proximity to the module whose data is cached therein. The following description of Figs. 11 & 12 sets forth the broad functional sequence of operations which would be necessary to support such a memory architecture. It should be clearly understood that many variations in the details of the sequence could be implemented by those skilled in the art without departing from the spirit and scope of the invention.
An overview of the operation of the processor cache (Cp) control logic is shown in Fig. 11. It should be noted that only the control information relevant to the invention is shown in Fig. 11. The details of the cache organization and management policy e.g., replacement algorithms, store-through, store-in etc., are not important for the invention described here. This is because the invention does not impose any restrictions on these issues.
The processor cache receives a memory operation request from the processor subsystem in block 1101. The control proceeds to block 1102 where the request is checked to determine if it is intended for private memory. If not, line 1104 becomes active causing the system to proceed to block 1114 which causes the request to be sent to the shared memory module and/or the interconnection network depending upon the system configuration. If, on the other hand, the request was for a private memory request, then line 1103 would become active causing the control sequence to proceed to block 1105. A determination is made in this block as to whether or not the request in cacheable. If not, line 1107 becomes active and the control sequence proceeds to block 1113. This block causes the request to be routed to the private memory and is not processed by the cache.
If it were determined that the request was cacheable, line 1106 would become active, the control sequence would proceed to block 1108. It should be noted that in systems like the RP3 processor, a subsystem can decide if a request is cacheable or not and provide a control field in the memory request indicating this fact. However, if the system does not support this feature then the “cacheability” check made in block 1105 would be deleted. The control sequence would then proceed directly from output 1103 to block 1108 and block 1113 would similarly be deleted.
In block 1108, the cache directory is checked to determine if the requested information resides currently in the processor cache memory. If it does line 1110 becomes active and the control sequence proceeds to block 1112. In block 1112 the information is fed from the cache memory and the required cache management policy for accessing an item therein is executed. At this time any updating required in the cache or private memory is also performed. A suitable response is generated for transmitting to the originating processor. When these operations are completed, line 1115 becomes active and the control sequence proceeds to block 1118 which causes the generated response to be actually sent to the requesting processor.
If it had been determined in block 1108 that the requested information was not in the processor cache, line 1109 would become active and the system control sequence would proceed to block 1111. In this block a cache memory line is selected for storing the information requested and the required cache management policy is executed. The requested line is also fetched from private memory (e.g. if the memory operation requested was a fetch). When the requested line of information is received from the memory the control sequence proceeds to block 1117. In this block, the requested information (words) are selected from the line fetched from memory, a response to the originating processor is generated and the cache is updated as required by the resident cache management policy.
When these operations are completed the control sequence proceeds via line 1116 to block 1118 which causes the previously generated response to be transmitted back to the processor.
The above description and the sequence of Fig. 11 is slanted primarily to a “fetch” request to the memory system from the processor. As will be readily appreciated, the memory operation could just as easily be a “store”. The operational sequence shown in Fig. 11 would be essentially the same for a store operation as will be readily appreciated by those skilled in the art and is not specifically included as it is considered to be obvious.
This completes the description of the sequence of operations which would be performed within the processor cache.
Proceeding now to Fig. 12, there is shown an overview of the sequence of operation of the memory cache control logic. It should be noted that only the control information relevant to the present invention is shown in this figure. The details of the cache organization and overall cache management policy are not specifically relevant to the invention described herein because the invention does not impose any restriction on these issues. It is also to be noted that the flow chart of this figure as well Fig. 11 is relatively functional and high level, however, any skilled system designer would have no difficulty in designing hardware logic to achieve these operations within such a cache memory hierarchy, whether in the processor cache or the memory cache.
Referring now to Fig. 12, block 1201, a memory operation request is received from the processor subsystem. This would be for example from blocks 1114 or 1113 of Fig. 11. The control sequence proceeds to block 1202 where a determination is made as to whether the request is for shared memory. If it is not, line 1204 becomes active and the control sequence proceeds to block 1207. 1207 causes the request to be sent to the memory module. This would be for example, if the request were for private memory space. If the request had been for shared memory the control sequence would proceed to block 1216 via line 1203. In block 1216 a determination is made as to whether their request is cacheable. It is again noted that in multi-processor systems like the previously referenced RP3 the processor subsystem can decide/indicate if the request is or is not cacheable.
However if a particular system does not support such a feature then the check in block 1216 would be totally deleted and the control logic would proceed directly from block 1202 to block 1208. If the request is determined to be not cacheable, the control sequence would proceed via line 1206 again to block 1207 which was described previously. However, if the request is for shared memory and is cacheable the control sequence proceeds to block 1208 via line 1205. In this block, the cache directory is searched to determine if the requested information is currently resident in the memory cache. If it is determined that the information is present, the control sequence proceeds via line 1210 to block 1212. In this block the information is fetched from the cache memory and any required cache management policy is executed.
The cache and shared memory are also updated as required and finally a response to the processor is generated. Control sequence then proceeds via line 1213 to block 1218 which causes the previously generated response to be transmitted to the processor.
If the requested data were not resident in the cache as determined in block 1208, the control sequence would proceed to block 1211 via line 1209. In this block a line is chosen in the cache to store information. The required cache management procedures are executed and the requested line of information is fetched from memory (e.g. if the memory request was a fetch request). When the required line of data is received from memory, the control sequence proceeds to block 1214 wherein the required information, e.g., words, are selected from the line of data received from memory. The cache memory and controls are updated as required by the cache management policy and a response to the processor is generated.
The control sequence proceeds to block 1218 via line 1215 wherein the response is transmitted to the processor.
This completes the description of the operation of the memory cache control sequence. As stated previously, the high level functional flow chart of Fig. 12 is directed primarily to a fetch request from the processor requiring the data be accessed from memory, placed in the cache when necessary and subsequently transmitted to the processor. Slight modifications that would be necessary to serve a store operation would be obvious to those skilled in the art and accordingly such a separate flow chart is not shown nor deemed necessary.
It will further be noted that the processor cache control sequence of Fig. 11 and the memory cache control sequence of Fig. 12 would be suitable for use in any of the system architectures shown in Figs. 3-8. It is noted that any additional addressing or other control information that would be required for a memory operation request from a processor be automatically extracted and placed in the request, but would have no bearing on the operation of the specific memory or processor cache control sequences.