Cache coherence multi step instruction

8/17/2023

(This is just a guess, I don't have any inside information on this.) I know from some of my recent work that ALUs are tiny - in a 45 nm process a 64-bit fully-pipelined (1 GHz) 64-bit floating-point fused-multiply-add unit only takes a bit over 0.04 mm^2. The Through-Silicon-Vias connecting these dies force the logic die to be about as large as the DRAM dies, and although part of the logic from the DRAMs has been moved to the logic die (implementing a basic DRAM controller with a simplified external interface), my guess is that there is a lot of silicon area on the logic die that is unused. (Intel's acquisition of Altera will make Intel one of the "Developer Members" of the Hybrid Memory Cube Consortium, should Intel decide to continue with this Altera project.) Because of the long-standing difficulty of implementing high-speed logic in semiconductor processes optimized for DRAM (and the converse difficulty of implementing DRAM in semiconductor processors optimized for high-speed logic), the Hybrid Memory Cube includes a die optimized for high-speed logic at the base of a stack of DRAM dies. However, like the K computer, the Power 775 system is not inexpensive.Īn opportunity with more potential for volume is provided by the Hybrid Memory Cube. (There may be a few instructions to set up the 32 Byte command buffer, but a single instruction sends the command to any memory location in the system.) This feature is used in the RandomAccess benchmark of the HPC Challenge benchmark suite to deliver the highest performance on that benchmark - more than 4 times the performance of the full "K Computer", using 1/10th the cores and occupying about 1/40th the number of racks (~22 vs 864). On the Power 775 ("PERCS") system, these operations can be launched to any memory location in the system with a user-mode instruction. The ALUs support ADD, AND, OR, XOR, Compare-and-Swap, and a few other operations on data sizes of 8, 16, 32, or 64 bits. The IBM POWER7 processors include fixed-point ALU functionality in the memory controllers. We have gobbs of silicon space now, why not use some of it wisely. This system would not necessarily be slower on a single socket system, as the secondary ALU(s) could additionally be placed at appropriate cache levels. This isn't too unusual as memory subsystem have used a Page and Cell scheme (IOW packet sent as opposed to single strobe). This would handle MUTEX and atomic integer add.ĬAS could be implemented as well, this would require sending both the compare element and the set element. XADD, BTS, BTC could be performed using a secondary ALU in the memory subsystem. today's systems could incorporate some of this old technology to perform the most frequently used primitives. The DC02 Teletype device multiplexer could handle 128 ports with the virtual UART registers inside the RAM of the PDP8I.īack to the future. There were other odd devices that would use AND, OR, and rotate. An example was for high-speed A/D to perform an add to memory. This provided for not only the instruction stream to use the ALU, but also the I/O bus devices. On these systems the ALU was placed between the memory subsystem and that which used it. This said, they had an interesting characteristic that would apply to cache coherency on modern systems. True, these were single processor systems. I guess there may be some systems that attempt to implement some kind of multi-cpu cache coherency, but I don't see that they could be common, and certainly not universal.Īnyway, thanks for your time and for clearing up any misconceptions I may have.īack in the old days (60's-70's) I programmed on the DEC PDP8 series of computers.

I believe that the memory bandwidth would be too great to implement cache coherency between physically separate CPUs.

I think this is clear but I'll clarify questions if I can. I assume that if I have a multi-socket system that the caches between separate CPUs on separate sockets are NOT maintained coherently, hence the need for a memory barrier op-code. However, it talks about on-chip cache coherency. In a document below, it talks about maintaining cache coherency and mentions Intel as one of the manufacturers implementing cache coherency: I got into a debate with someone on Stack Overflow and I want to make sure I've got my facts straight.

0 Comments

Cache coherence multi step instruction

Leave a Reply.

Author

Archives

Categories