The Computex commerce present has kicked off in Taiwan and AMD opened the present with a bang. Final week, we mentioned rumors that AMD was making ready a Milan-X SKU for launch later this yr. The Zen Three-based CPU would supposedly supply onboard HBM and a 3D-stacked structure.
We don’t know if AMD will carry Milan-X to market in 2021, however the firm has now proven off 3D die stacking in one other approach. Throughout her Computex keynote, Lisa Su confirmed a 5900X with 64MB of SRAM, built-in on high of the chiplet die. That is along with the L3 cache already built-in into the chiplet itself, granting a complete of 96MB of L3 per chiplet, or 192MB per 5900X with two chiplets. The dies are related with through-silicon vias (TSVs). AMD claims bandwidth of over 2TB/s. That’s greater than Zen Three’s L1 bandwidth, although entry latencies are a lot greater. L3 latency is often between 45-50 clock cycles, in contrast with a Four-cycle latency for L1.
The brand new “V-Cache” die isn’t precisely the identical measurement because the chiplet under it, so there’s some extra silicon used to make sure there’s equal stress throughout the compute die and cache die. The 64MB cache is claimed to be a bit lower than half the scale of a typical Zen Three chiplet (80.7 mm sq).
This a lot L3 on a CPU is slightly nutty. We will’t examine in opposition to desktop chips, as a result of Intel and AMD have by no means shipped a CPU with this a lot cache devoted to such a small variety of cores. The closest analog on transport CPUs can be one thing like IBM’s POWER9, which provides as much as 120MB of L3 per chip — however once more, not almost this a lot per core. 192MB of L3 for simply 12 cores is 16MB of L3 per core and 8MB per thread. There are additionally sufficient variations between POWER9 and Zen Three that we are able to’t actually look to the IBM CPU for a lot on how the extra cache would enhance efficiency, although if you happen to’re curious concerning the x86-versus-non-x86 query basically, Phoronix did a assessment with some benchmarks back in 2019.
Absent an relevant CPU to seek advice from, we’ll need to take AMD’s phrase on a few of these numbers. The corporate in contrast a normal 5900X (32MB of L3 cache per chiplet, 64MB whole) to a modified 5900X (96MB of L3 cache per chiplet, 192MB whole) in Gears of Warfare 5 (+12 p.c, DX12), DOTA2 (+18 p.c, Vulkan), Monster Hunter World (+25 p.c, DX11), League of Legends (+Four p.c, DX11), and Fortnite (+17 p.c, DX12). If we set LoL apart as an outlier, that’s an 18 p.c common enhance. If we embody it, it’s a 15.2 p.c common uplift. Each CPUs have been locked at 4GHz for this comparability. The GPU was not disclosed.
That uplift is sort of as giant because the median generational enchancment AMD has been turning prior to now few years. The extra attention-grabbing query, nevertheless, is what sort of affect this strategy has on energy consumption.
AMD Has Large Caches on the Mind
It’s apparent that AMD has been doing a little work across the concept of slapping big caches on chips. The massive “Infinity Cache” on RDNA2 GPUs is a central part of the design. We’ve heard a couple of Milan-X that might theoretically deploy this sort of strategy and on-package HBM.
A method to have a look at information of a 15 p.c efficiency enchancment is that it will permit AMD to tug CPU clocks from a high clock of, say, Four.5GHz all the way down to round 4GHz at equal efficiency. CPU energy consumption will increase extra rapidly than frequency does, particularly as clocks strategy 5GHz. Enhancements that permit AMD (or Intel) to hit the identical efficiency at a decrease frequency might be helpful for enhancing x86’s energy consumption at a given clock velocity.
About six weeks in the past, we coated the roadmap leak/rumor above. On the time, I speculated that AMD’s rumored Ryzen 7000 household may combine an RDNA2 compute unit into every chiplet, and that this chiplet-level integration may be the explanation why RDNA2 is listed in inexperienced for Raphael however orange for the hypothetical Phoenix.
What I’m about to say is theory stacked on high of hypothesis and needs to be handled as such:
For years, we’ve waited and hoped that AMD would carry an HBM-equipped APU to desktop or cell. So far, we’ve been dissatisfied. A chiplet with a 3D-mounted L3 stack tied to each the CPU and GPU may supply a nifty various to this idea. Whereas we nonetheless don’t know how giant the GPU core can be, boosting the efficiency of an built-in GPU with onboard cache is a tried-and-true approach of doing issues. It’s helped Intel enhance efficiency on numerous SKUs since Haswell.
The bit above, as I mentioned, is pure hypothesis, however AMD has now acknowledged working extensively with giant L3 caches on each CPUs (through 3D stacking) and GPUs (through Infinity Cache). It’s not loopy to suppose the corporate’s future APUs will proceed the pattern in a single kind or one other.