Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf by rrampage

Share This Article

Sed ut perspiciatis unde.

18 Jan, 2025

Matrix-vector multiplication is a foundational operation in linear algebra, where a matrix transforms an input vector into an output vector. This operation basically powers numerous fields including computer science and deep learning. Optimizing matrix-vector multiplication, especially in the context of GPU programming and CUDA can help us learn many new things.

In this worklog, we will start by benchmarking cuBLAS‘s matrix-vector multiplication performance then we will iteratively optimize it in CUDA to see how close we can get to cuBLAS. The intention is to not replace it, but to learn from it. The NVIDIA GPU used for this worklog is one GTX 1050Ti (that’s all I have got right now). By the end of this worklog, we will achieve what the following figure shows:

The full code is available on the GitHub repository: Optimizing SGEMV in CUDA

If you want me to work together with you on deep learning models, inference, training, software development, custom CUDA kernels or something else then you can shoot a direct message (DM) to me here: me on X (formerly Twitter)

Let’s start!

Some background first

From now on, we will call matrix-vector multiplication as SGEMV which stands for Single-Precision General Matrix-Vector multiplication. The breakdown of this term is:

S: Indicates single-precision ( $32$ -bit floating-point numbers).
GE: Refers to a general matrix, meaning the matrix can have any shape or content (not restricted to special forms like symmetric or diagonal matrices).
MV: Stands for matrix-vector multiplication, the core operation the function performs.

In essence, given a matrix $A$ of shape $(M, N)$ and an input vector $x$ of shape $(N, 1)$ , SGEMV computes an output vector $y$ given as:

$y = α \cdot A \cdot x + β \cdot y$

Here the terms $α$ and $β$ are some scalar coefficients (floating point numbers). In this worklog, we will assume the following for simplicity:

The shape of the matrix $A$ will be $(4096, 8192)$
The shape of the vector $x$ will be $(8192, 1)$
The scalars $α = 1$ and $β = 0$

And that leaves us with:

$y = A \cdot x$

The figure above and the pseudocode below shows this computation. Note that each row of the matrix $A$ performs a dot product with the entire input vector $x$ to compute one element of the output vector $y$ .

function sgemv_basic(A, x, y, M, N) {
    // Initialize output vector
    for (i = 0; i < M; i++) {
        y[i] = 0;
    }

    // Perform the computation
    for (i = 0; i < M; i++) {
        for (j = 0; j < N; j++) {
            y[i] += A[i][j] * x[j];
        }
    }
}

Memory-bound tasks and FLOPS

This section is very important in terms of comparing performance characteristics of operations like matrix-vector multiplication (SGEMV) and matrix-matrix multiplication (SGEMM). First, we need to define the words FLOPs (notice the small ‘s’) and FLOPS.

Basically:

FLOPs stands for the total number of floating-point operations performed in a computation. The operation can be anything like addition, subtraction, multiplication, and so on.
FLOPS measures the rate of floating-point operations that a system can perform in one second. If the system performs $F$ FLOPs in $T$ seconds then FLOPS is given by $(F / T)$ . Also, $1$ GFLOPS = $10^{9}$ FLOPS.

Now, even though matrix-vector multiplication can be thought of as a special case of matrix-matrix multiplication, there are some differences when it comes to measuring performance of both the operations in CUDA.

SGEMV is a memory-bound operation whereas, SGEMM is a compute-bound operation. Let’s calculate the FLOPs required for both of these operations:

Matrix-vector (SGEMV)

Multiplies a matrix $A (M, N)$ with a vector $x (N, 1)$ resulting in a vector $y (M, 1)$ .
Memory accesses:
- Reads $A$ ( $M \times N$ elements)
- Reads $x$ ( $N$ elements)
- Writes $y$ ( $M$ elements)
Computations:
- Each row of $A$ is multiplied with $x$ , resulting in $M$ dot products.
- Each dot product consists of $N$ multiplications and $N - 1$ additions.
- FLOPs: $2 \times M N$
- $1$ floating-point number is $4$ bytes.
- Bytes transferred: $4 \times (M N + N + M)$

Matrix-matrix (SGEMM)

Multiplies two matrices $A (M, K)$ and $B (K, N)$ resulting in a matrix $C (M, N)$ .
Memory accesses:
- Reads $A$ ( $M \times K$ elements)
- Reads $B$ ( $K \times N$ elements)
- Writes $C$ ( $M \times N$ elements)
Computations:
- Dot product of row $i$ of $A$ with column $j$ of $B$
- Each dot product consists of $K$ multiplications and $K - 1$ additions.
- FLOPs: $2 \times M N K$
- $1$ floating-point number is $4$ bytes.
- Bytes transferred: $4 \times (M K + K N + M N)$

We divide the FLOPs by the total bytes transferred to calculate the computational intensity of the operation.

Considering $M = 4096$ , $N = 4096$ and $K = 4096$ , for SGEMV we get:

$FLOPs per byte = \frac{2 M N}{4 (M N + N + M)} \approx 0.4998$

and for SGEMM, we get:

$FLOPs per byte = \frac{2 M N K}{4 (M K + K N + M N)} \approx 682.67$

As we can see, the computational intensity for SGEMV is very low compared to SGEMM i.e. more time is spent transferring the data from (and to) the global memory of the GPU compared to the actual computation time. Conversly, for SGEMM more time is spent doing the actual computation than transferring the data from (and to) the global memory.

Thus, SGEMV is a memory-bound operation. So, we need to make sure that we are maximizing the memory bandwidth that our CUDA kernel achieves, such that it is close to the maximum memory bandwidth of the GPU ( $112.1$ GB/s in our case).

Benchmark – cuBLAS implementation

Let’s benchmark the SGEMV implementation that cuBLAS provides. To do this, we simply use the cublasSgemv function. Below is the corresponding code snippet that does this:

cublasHandle_t handle;
cublasCreate(&handle);

float alpha = 1.0f, beta = 0.0f;
cublasSgemv(
    handle, CUBLAS_OP_T, N, M, 
    &alpha, matd, N, vecd, 
    1, &beta, resd, 1
);

The on-device matrix is defined as matd, input vector is defined as vecd, and the resulting vector is defined as resd.
The matrix matd is stored in the row-major layout in memory and we will use linear indices to access its elements. When we run this we get:

>> GPU allocation time: 5.698560 ms
>> Host to device transfer time: 23.588863 ms
------- cuBLAS sgmev kernel ---------
>> Execution time: 1.532928 ms
>> Achieved (GFLOPS): 43.778225
>> Theoretical max (GFLOPS): 1911.040039
>> Maximum memory bandwidth: 112.127998 GB/s
>> Achieves 2.290806 % of peak GFLOPS
>> Achieves 78.114754 % of peak Memory Bandwidth
---------------------------
>> Device to host transfer time: 0.042816 ms

As expected, we see that the FLOPS achieved is $43.78$ GFLOPS i.e. much less than the theoretical maximum which the GPU can achieve. But, cuBLAS achieves around $78.1$ % of the peak memory bandwidth which is great for SGEMV since it is a memory-bound operation, as seen above.

Let’s first write a naive kernel in CUDA for SGMEV and iteratively improve it.

Kernel 1 – Naive SGEMV

Following the figure above, we can write a naive kernel for SGEMV. Each thread in a thread block will compute one output element of the vector resd in this kernel. The index of the current row and the corresponding output element will be written as row = blockDim.x * blockIdx.x + threadIdx.x.

The corresponding code snippet for this kernel looks like:

int row = blockDim.x * blockIdx.x + threadIdx.x;

if (row < M) {
    float sum = 0.0f;
    for (int col = 0; col < N; col++) {
        sum += matd[row * N + col] * vecd[col];
    }
    resd[row] = sum;
}

Running this kernel results in:

>> GPU allocation time: 5.820416 ms
>> Host to device transfer time: 24.071072 ms
------- Naive sgmev kernel ---------
>> Execution time: 8.279040 ms
>> Achieved (GFLOPS): 8.105875
>> Theoretical max (GFLOPS): 1911.040039
>> Maximum memory bandwidth: 112.127998 GB/s
>> Achieves 0.424160 % of peak GFLOPS
>> Achieves 14.463547 % of peak Memory Bandwidth
---------------------------
>> Device to host transfer time: 0.048992 ms

The naive kernel achieves only $8.27$ GFLOPS which is around $0.42$ % of the peak GFLOPS. Apart from that, it achieves only $14.46$ % of the peak memory bandwidth. This result is kind of unimpressive.

But no worries, we can improve this!

Kernel 2 – Coalesced access with reductions

One way to improve the naive kernel is to ensure that the memory accesses to both the matrix matd and vector vecd are coalesced. Let’s understand what that means.

In CUDA, we have a grid of blocks where each block can have $T$ number of threads. The streaming multiprocessors (SM) on the GPU process each block in a ‘group’ of $32$ threads. This group of $32$ threads is called a warp. In essence, the SM can execute one instruction at a time for all the threads in a warp. Thus, each block consists of $c e i l (T / 32)$ warps.

But what does this have to do with memory accesses? Well, let’s see how the memory accesses for the matrix matd looks like in the naive kernel.

Warp 1 (Threads 0-31):

Assuming that we are dealing with the first block where blockIdx.x = 0, the number of threads in a block is $1024$ , and all the threads in this warp of block $0$ will execute one instruction in parallel, then we have:

Thread $0$ executing:

The value of row = $1024 \times 0 + 0$ = $0$
Thread will enter the for loop now.
We have col = $0$ for this thread in the start
Element of matd accessed = $0 \times N + 0$ = $0$

Now, at the same time, we have:

Thread $1$ executing:

The value of row = $1024 \times 0 + 1$ = $1$
Thread will enter the for loop now.
We have col = $0$ for this thread in the start
Element of matd accessed = $1 \times N + 0$ = $N$

Similary, we can see that for this particular warp the elements of the matrix matd accessed by the other threads will be always separated by $N$ elements. But, this is actually NOT an optimal way to access data residing in the global memory! We need to access data from the global memory in a coalesced manner.

Coalesced memory access occurs when threads in a warp ( $32$ threads) access consecutive memory addresses, allowing the ‘memory controller’ to combine these accesses into a single memory transaction. Global memory on GPUs has high latency and limited bandwidth compared to the speed of computation. Coalescing memory accesses minimizes the number of transactions required to fetch data, maximizing the effective bandwidth. Also, the hardware is designed to handle these coalesced accesses efficiently. When the accesses to the global memory are scattered or random, it forces the ‘memory controller’ to dispatch multiple memory transactions. This results in a slowdown compared to when the memory accesses are coalesced.

The matrix matd is stored in row-major format in the global memory i.e. the elements of the matrix in each row are next to each other (in consecutive memory addresses). The figure below shows the difference of coalesced vs. non-coalesced accesses in the matrix.

In this kernel, each block will operate on one row of the matrix. Also, we will assume that each block contains of only $32$ threads i.e. $1$ warp. Consecutive threads in the warp will load consecutive elements in the row of the matrix matd and the vector vecd. We will also have a private variable (private to each thread in a warp) called partial_sum which will hold the partial sum of the elements processed by one particular thread. The below code snippet shows this:

int bid = blockIdx.x;
if (bid >= M) return;

int tid = threadIdx.x;

// each thread calculates its own partial output
float partial_sum = 0.f;
for (int col = tid; col < N; col += blockDim.x) {
    partial_sum += matd[bid * N + col] * vecd[col];
}

For example, if bid = 0 and blockDim.x = 32 then a thread with index tid in this block will process the elements tid, tid + blockDim.x, tid + 2 * blockDim.x and so on. Thread $0$ processes the elements $0$ , $32$ , $64$ etc… thread $1$ processes the elements $1$ , $33$ , $65$ etc… and the same for remaining threads i.e. consecutive elements are loaded and processed by each thread which results in coalesced global memory access.

But, there’s a problem! Each thread will have its own partial_sum variable. To have the final dot product, we need to sum the partial sums for all the threads that are present in the warp. Note that the final value of the dot product of two vectors is a single floating-point number.

This is where reductions can help us. We can essentially ‘communicate’ values between the threads present in a block/warp using shared memory or warp shuffle intrinsics, because every thread in a block/warp can have access to the block’s shared memory. Have a look at my CUDA softmax worklog which dives deeper into reductions. The figure below can help you understand red

Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf by rrampage

Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf by rrampage

Share This Article

Newsletter

Some background first

Memory-bound tasks and FLOPS

Benchmark – cuBLAS implementation

Kernel 1 – Naive SGEMV

Kernel 2 – Coalesced access with reductions

HackTech

Leave a comment Cancel reply

Editor's Choice

Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf by rrampage

Learning CUDA by optimizing matrix-vector multiplication for cuBLAS-like perf by rrampage

Share This Article

Newsletter

Some background first

Memory-bound tasks and FLOPS

Benchmark – cuBLAS implementation

Kernel 1 – Naive SGEMV

Kernel 2 – Coalesced access with reductions

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter