iGPU Cache Setups Compared, Including M1 by ingve

Share This Article

Sed ut perspiciatis unde.

Like CPUs, modern GPUs have evolved to use complex, multi level cache hierarchies. Integrated GPUs are no exception. In fact, they’re a special case because they share a memory bus with CPU cores. The iGPU has to contend with CPUs for limited memory bandwidth, making caching even more important than with dedicated GPUs.

At the same time, the integrated nature of integrated GPUs provides a lot of interesting cache design options. We’re going to take a look at paths taken by AMD, Intel, and Apple.

Global Memory Latency

GPUs are given a lot of explicit parallelism, so memory latency isn’t as critical as it is for CPUs. Still, latency can play a role. GPUs often don’t run at full occupancy – that is, the amount of parallel work they’re tracking isn’t maximized. We have more on that in another article, so we’ll go right to the data.

Testing latency is also a good way of probing the cache setup. Doing so with bandwidth isn’t as straightforward because requests can be combined at various levels in the memory hierarchy, and defeating that to get clean breaks between cache levels can be surprisingly difficult.

The Ryzen 4800H’s cache hierarchy is exactly what you’d expect from AMD’s well known GCN graphics architecture. Each of the 4800H’s seven GCN-based CUs have a fast 16 KB L1 cache. Then, a larger 1 MB L2 is shared by all of the CUs. AMD’s strategy for dealing with memory bus constraints appears quite simple: use a higher L2 capacity to compute ratio than that of discrete GPUs. A fully enabled Renoir iGPU has 8 CUs, giving 128 KB per CU. Contrast this with AMD’s Vega 64, where 4 MB of L2 gives it 64 KB per CU.

Apple’s cache setup is similar, with a fast L1 followed by a large 1 MB L2. Apple’s L1 is half of AMD’s size at 8 KB, but has similar latency. This low latency suggests it’s placed within iGPU cores, though we don’t have a test to directly verify this. Compared to AMD, Apple’s L2 is a bit lower latency, which should help make up for the smaller L1. We also expect to see a 8 MB SLC, but that doesn’t really show up in the latency test. It could be the somewhat lower latency area up to 32 MB.

Then, we have Intel. Compared to AMD and Apple, Intel tends to use a less conventional cache setup. Right off the bat, we’re hitting a large cache shared by all of the GPU’s cores. It’s at least 1.5 MB in size, making it bigger than AMD and Apple’s GPU-level caches. In terms of latency, it’s somewhere between AMD and Apple’s L2 caches. That’s not particularly good, because we don’t see a smaller, faster cache in front of it. But its large size should help Intel keep more memory traffic within the iGPU block. Intel should have smaller, presumably faster caches in front of the large shared iGPU-level cache. But we weren’t able to see them through testing.

Like Apple, Intel has a large, shared chip-level cache that’s very hard to spot on a latency plot. This is strange – our latency test clearly shows the shared L3 on prior generations of Intel integrated graphics.

From this first glance at latency, we can already get a good idea of how each manufacturer approaches caching. Let’s move on to bandwidth now.

Global Memory Bandwidth

Bandwidth is more important to GPUs than to CPUs. Usually, CPUs only see high bandwidth usage in heavily vectorized workloads. For GPUs though, all workloads are vectorized by nature. And bandwidth limitations can show up even when cache hitrates are high.

AMD and Apple’s iGPU private caches have roughly comparable bandwidth. Intel’s is much lower. Part of that is because Alder Lake’s integrated graphics have somewhat different goals. Comparing the GPU configurations makes this quite obvious:

	FP32 ALUs	Clock Speed	FP32 FMA Vector Throughput
AMD Ryzen 4800H, Vega 7	448 (out of 512 possible)	1.6 GHz	1433.6 GFLOPs
Intel Core i5-12600K, Xe GT1	256	1.45 GHz	742.4 GFLOPs
Apple M1, 7 Core iGPU	896 (out of 1024 possible)	1.278 GHz?	2290.2 GFLOPs

AMD’s Renoir and Apple’s M1 are designed to provide low end gaming capability to thin and light laptops, where a separate GPU can be hard to fit. But desktop Alder Lake definitely expects to be paired with a discrete GPU for gaming. Understandably, that means Intel’s iGPU is pretty far down on on the priority list when it comes to power and die area allocation. Smaller iGPUs will have less cache bandwidth, so let’s try to level out the comparison by using vector FP32 throughput to normalize for GPU size.

Intel’s cache bandwidth now looks better, at least if we compare from L2 onward. Bytes per FLOP is roughly comparable to that of other iGPUs. Its shared chip-level L3 also looks excellent, mostly because its bandwidth is over-provisioned for such a small GPU.

As far as caches are concerned, AMD is the star of the show. Renoir’s Vega iGPU enjoys higher cache bandwidth to compute ratios than Intel or Apple. But its performance will likely be dependent on cache hitrate. L2 misses go directly to memory, because AMD doesn’t have another cache behind it. And Renoir has the weakest memory setup of all the iGPUs here. DDR4 may be flexible and economical, but it’s not winning any bandwidth contests. Apple and Intel both have a stronger memory setup, augmented by a big on-chip cache.

Local Memory Latency

GPU memory access is more complicated than on CPUs, where programs access a single pool of memory. On GPUs, there’s global memory that works like CPU memory. There’s constant memory, which is read only. And there’s local memory, which acts as a fast scratchpad shared by a small group of threads. Everyone has a different name for this scratchpad memory. Intel calls it SLM (Shared Local Memory), Nvidia calls it Shared Memory, and AMD calls it LDS (Local Data Share). Apple calls it Tile Memory. To keep things simple, we’re going to use OpenCL terminology, and just call it local memory.

AMD and Apple take about as long to access local memory as they do to hit their first level caches. Of course, latency isn’t the whole story here. Each of AMD’s GCN CUs has 64 KB of LDS – four times the capacity of its L1D cache. Bandwidth from local memory is likely higher too, though we currently don’t have a test for that. Clinfo on M1 shows 32 KB of local memory, so M1 has at least that much available. That figure likely only indicates the maximum local memory allocation by a group of threads, so the hardware value could be higher.

Intel meanwhile enjoys very fast access to local memory, as does Nvidia, which is here for perspective. Their story is an interesting one too. Prior to Gen10, Intel put their SLM along the iGPU’s L3, outside the the subslices (Intel’s cloest equivalent to GPU cores on Apple and CUs on AMD). For a long time, that meant Intel iGPUs had unimpressive local memory latency.

Starting with Gen 11, Intel thankfully moved the SLM into the subslice, making the local memory configuration similar to AMD and Nvidia’s. Apple likely does the same (putting “tile memory” within iGPU cores) since local memory latency on Apple’s iGPU is also quite low.

CPU to GPU Copy Bandwidth

A shared, chip-level cache can bring other benefits. In theory, transfers between CPU and GPU memory spaces can go through the shared cache, basically providing a very high bandwidth link between the CPU and GPU. Due to time and resource constraints, slightly different devices are tested here. But Renoir and Cezanne should be similar, and Intel’s behavior is unlikely to regress from Skylake’s.

Only Intel is able to take advantage of the shared cache to accelerate data movement across different blocks. As long as buffer sizes fit in L3, Skylake handles copies completely within the chip, with performance counters showing very little memory traffic. Larger copies are still limited by m

iGPU Cache Setups Compared, Including M1 by ingve

iGPU Cache Setups Compared, Including M1 by ingve

Share This Article

Newsletter

Global Memory Latency

Global Memory Bandwidth

Local Memory Latency

CPU to GPU Copy Bandwidth

HackTech

Leave a comment Cancel reply

Editor's Choice

iGPU Cache Setups Compared, Including M1 by ingve

iGPU Cache Setups Compared, Including M1 by ingve

Share This Article

Newsletter

Global Memory Latency

Global Memory Bandwidth

Local Memory Latency

CPU to GPU Copy Bandwidth

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter