Introduction
A common misconception is that BLAS implementations of matrix multiplication
are orders of magnitude faster than naive implementations because they are very complex.
This tutorial shows that, using Intel intrinsics (FMA3
and AVX2), BLAS-speed
in dense matrix multiplication can be achieved using only 100 lines of C.
The code can be found here.
It is based on the paper by Smith et al.[1]:
Anatomy of High-Performance Many-Threaded Matrix Multiplication
Hardware
The CPU used in this tutorial provides 128 GFLOPS peak performance (2 physical cores x 2.00 GHz x 32 FLOps/cycle):
- 2 physical cores, 4 virtual (2.00 GHz)
- Haswell microarchitecture (32 FLOps/cycle: 8-float AVX2 x 2 FMA3 units x 2 FLOps/cycle per FMA)
Where FLOps = floating point operations and FLOPS = floating point operations per second.
Optimizations
This tutorial implements the GEMM procedure specified in [1], measuring
throughput for various levels of optimization.
Each refers to a function in
compare_blas.cpp
-
Naive implementation
The naive implementation uses three nested for loops -
+ Vectorization to take advantage of SIMD and FMA hardware
Three nested loops are still used, but