Show HN: Implementation of a work-efficient parallel SpMSpV algorithm in C++ by a_ermolinskiy

Share This Article

Sed ut perspiciatis unde.

A collection of high-performance parallel linear algebra kernels for shared-memory multiprocessors. These kernels are implemented in C++ and rely on OpenMP for parallelization.

Kernels

SpMSpV

Description

Sparse matrix-sparse vector (SpMSpV) multiplication of the form y = Ax is a widely used computational kernel with many applications in machine learning and graph analytics. A sparse input matrix A is multiplied by a sparse input vector x to produce a sparse output vector y.

src/spmspv/ implements the SpMSpV-bucket algorithm that was recently proposed by Azad and Buluc [1]. Unlike most prior approaches, this algorithm achieves work-efficiency, which means that the total work performed by all processors remains within a constant factor of the fastest known serial algorithm.

Experimental Evaluation

To assess the performance and strong scaling of our SpMSpV-bucket implementation, we measure its runtime on several representative matrix inputs taken from the University of Florida sparse matrix collection [2]. In each experiment, we multiply the input matrix A by a randomly generated sparse input vector x, varying the degree of sparsity in x and the number of threads.

Results on ljournal-2008

ljournal-2008 is an adjacency matrix of the LiveJournal social network containing approximately 5.36M vertices and 79M edges. We multiply this matrix by a sparse vector with 25K, 250K, and 2.5M nonzero values and report the runtime (averaged over 1000 iterations) as a function of thread count.

Results on hugetrace-00020

hugetrace-00020 is a high-diameter graph representing frames from a two-dimensional adaptive numerical simulation. It contains approximately 16M vertices and 24M edges. We use the same methodology as above and report the average SpMSpV runtime as a function of thread count for several input vector sparsity patterns.

Evaluation platform

Our measurements were conducted on a dual-socket 2.9GHz Intel Cascade Lake Xeon server with 24 cores per socket and 35.75MB of L3 cache. Peak memory bandwidth (as reported by mlc) is 80GB/s per node.

Show HN: Implementation of a work-efficient parallel SpMSpV algorithm in C++ by a_ermolinskiy

Show HN: Implementation of a work-efficient parallel SpMSpV algorithm in C++ by a_ermolinskiy

Share This Article

Newsletter

Kernels

SpMSpV

Description

Experimental Evaluation

Results on ljournal-2008

Results on hugetrace-00020

Evaluation platform

Ref

HackTech

Leave a comment Cancel reply

Editor's Choice

Show HN: Implementation of a work-efficient parallel SpMSpV algorithm in C++ by a_ermolinskiy

Show HN: Implementation of a work-efficient parallel SpMSpV algorithm in C++ by a_ermolinskiy

Share This Article

Newsletter

Kernels

SpMSpV

Description

Experimental Evaluation

Results on ljournal-2008

Results on hugetrace-00020

Evaluation platform

Ref

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter