Most programmers have an intimate understanding of CPUs and sequential programming because they grow up writing code for the CPU, but many are less familiar with the inner workings of GPUs and what makes them so special. Over the past decade, GPUs have become incredibly important because of their pervasive use in deep learning. Today, it is essential for every software engineer to possess a basic understanding of how they work. My goal with this article is to give you that background.
Much of this article is based on the book “Programming Massively Parallel Processors”, 4th edition by Hwu et al. As the book covers Nvidia GPUs, I will also be talking about Nvidia GPUs and using Nvidia specific terminology. However, the fundamental concepts and approach to GPU programming apply to other vendors as well.
We will start by doing a comparison between CPU and GPU which will give us a better vantage point of the GPU landscape. However, this is a topic of its own and we cannot possibly squeeze everything in one section. So, we will stick to a few key points.
The major difference between CPUs and GPUs is in their design goals. CPUs were designed to execute sequential instructions
. To improve their sequential execution performance, many features have been introduced in the CPU design over the years. The emphasis has been on reducing the instruction execution latency so that CPUs can execute a sequence of instructions as fast as possible. This includes features like instruction pipelining, out of order execution, speculative execution and multilevel caches (just to list a few).
GPUs on the other hand have been designed for massive levels of parallelism and high throughput, at the cost of medium to high instruction latency. This design direction has been influenced by their use in video games, graphics, numerical computing, and now deep learning. All of these applications need to perform a ton of linear algebra and numerical computations at a very fast rate, because of which a lot of attention has gone into improving the throughput of these devices.
Let’s consider a concrete example. A CPU can add two numbers much faster than the GPU because of its low instruction latency. They will be able to do several of such computations in a sequence faster than a GPU. However, when it comes to doing millions or billions of such computations, a GPU will do those computations much much faster than a CPU because of its sheer massive parallelism.
If you like numbers, let’s talk about numbers. The performance of hardware for numerical computations is measured in terms of how many floating point operations it can do per second (FLOPS). The Nvidia Ampere A100 offers a throughput of 19.5 TFLOPS for 32-bit precision. In comparison, the throughput of an Intel 24-core processor is 0.66 TFLOPS for 32-bit precision (these numbers are from 2021). And, this gap in the throughput performance between GPUs and CPUs has been growing wider with each passing year.
The following figure compares the architectures of CPUs and GPUs.

As you may see, CPUs dedicate a significant amount of chip area towards features which will reduce instruction latency, such as large caches, less ALUs and more control units. In contrast, GPUs use a large number of ALUs to maximize their computation power and throughput. They use a very small amount of the chip area for caches and control units, the things which reduce the latency for CPUs.
You might wonder, how do the GPUs tolerate high latencies and yet provide high performance. We can understand this with the help of Little’s law from queuing theory. It states that the average number of requests in the system (Qd for queue depth) is equal to the average arrival rate of requests (throughput T) multiplied by the average amount of time to serve a request (latency L).
In the context of GPUs, this basically means one can tolerate a given level of latency in the system to achieve a target throughput by maintaining a queue of instructions which are either in execution or waiting. The large number of compute units in the GPU and efficient thread scheduling enables the GPUs to maintain this queue over kernel execution time and achieve a high throughput despite these long memory latencies.
So we understand GPUs favor high throughput but what does their architecture look like which enables them to achieve this, let’s discuss in this section.
A GPU consists of an array of streaming multiprocessors (SM). Each of these SMs in turn consists of several streaming processors or cores or threads. For instance, the Nvidia H100 GPU has 132 SMs with 64 cores per SM, totalling a whopping 8448 cores.
Each SM has a limited amount of on-chip memory, often referred to as shared memory or a scratchpad, which is shared among all the cores. Likewise, the control unit resources on the SM are shared by all the cores. Additionally, each SM is equipped with hardware-based thread schedulers for executing threads.
Apart from these, each SM also has several functional units or other accelerated compute units such as tensor cores, or ray tracing units to serve specific compute demands of the workload that the GPU caters to.
Next, let’s breakdown the GPU memory and look inside.
The GPU has several layers of different kinds of memories, with each having their specific use case. The following figure shows the memory hierarchy for one SM in the GPU.
Let’s break it down.
-
Registers: We will start with the registers. Each SM in the GPU has a large number of registers. For instance, the Nvidia A100, and H100 models have 65,536 registers per SM. These registers are shared between the cores, and are allocated to them dynamically depending on the requirement of the threads. During execution the registers allocated to a thread are private to it, i.e., other threads cannot read/write those registers.
-
Constant Caches: Next, we have constant caches on the chip. These are used to cache constant data used by the code executing on the SM. To utilize these caches, programmers have to explicitly declare objects as constants in the code so that the GPU may cache and keep them in the constant cache.
-
Shared Memory: Each SM also has a shared memory or scratchpad which is a small amount of fast and low latency on-chip programmable SRAM memory. It is designed to be shared by a block of threads running on the SM. The idea behind shared memory is that if multiple threads need to work with the same piece of data, only one of them should load it from the global memory, while others will share it. Careful usage of shared memory can cut down redundant load operations from global memory, and improve the kernel execution performance. Another usage of the shared memory is as a synchronization mechanism between threads executing within a block.
-
L1 Cache: Each SM also has an L1 cache which can cache frequently accessed data from L2 cache.
-
L2 Cache: There is an L2 cache which is shared by all SMs. It caches the frequently accessed data from the global memory to cut down the latency. Note that both L1 and L2 caches are transparent to the SM, i.e., the SM doesn’t know it is getting data from L1 or L2. As far as the SM is concerned, it is getting data from the global memory. This is similar to how L1/L2/L3 caches work in CPUs.
-
Global Memory: The GPU also has an off-chip global memory, which is a high capacity and high bandwidth DRAM. For instance, the Nvidia H100 has 80 GB high bandwidth memory (HBM) with bandwidth of 3000 GB/second. Due to being far away from the SMs, the latency of global memory is quite high. However, the several additional layers of on-chip memories, and high number of compute units help hide this latency (see Little’s law discussion in the CPU vs GPU section).
Now that we know about the key components of the GPU hardware, let’s go one step deeper and understand how these components come into picture when executing code.
To understand how the GPU executes a kernel, we first need to understand what a kernel is and what its configurations are. Let’s start there.
CUDA is the programming interface provided by Nvidia for writing programs for their GPUs. In CUDA you ex