BLAS-level CPU Performance in 100 Lines of C by henning

Share This Article

Sed ut perspiciatis unde.

Introduction

A common misconception is that BLAS implementations of matrix multiplication
are orders of magnitude faster than naive implementations because they are very complex.
This tutorial shows that, using Intel intrinsics (FMA3
and AVX2), BLAS-speed
in dense matrix multiplication can be achieved using only 100 lines of C.

The code can be found here.
It is based on the paper by Smith et al.[1]:
Anatomy of High-Performance Many-Threaded Matrix Multiplication

Hardware

The CPU used in this tutorial provides 128 GFLOPS peak performance (2 physical cores x 2.00 GHz x 32 FLOps/cycle):

2 physical cores, 4 virtual (2.00 GHz)
Haswell microarchitecture (32 FLOps/cycle: 8-float AVX2 x 2 FMA3 units x 2 FLOps/cycle per FMA)

Where FLOps = floating point operations and FLOPS = floating point operations per second.

Optimizations

This tutorial implements the GEMM procedure specified in [1], measuring
throughput for various levels of optimization.

Each refers to a function in
compare_blas.cpp

Naive implementation

The naive implementation uses three nested for loops
+ Vectorization to take advantage of SIMD and FMA hardware

Three nested loops are still used, but

BLAS-level CPU Performance in 100 Lines of C by henning

BLAS-level CPU Performance in 100 Lines of C by henning

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

BLAS-level CPU Performance in 100 Lines of C by henning

BLAS-level CPU Performance in 100 Lines of C by henning

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter