GPU vendor-agnostic fluid dynamics solver in Julia by moelf

Share This Article

Sed ut perspiciatis unde.

WaterLily.jl is a simple and fast fluid simulator written in pure Julia. It solves the unsteady incompressible 2D or 3D Navier-Stokes equations on a Cartesian grid.
The pressure Poisson equation is solved with a geometric multigrid method.
Solid boundaries are modelled using the Boundary Data Immersion Method.

v1.0 has ported the solver from a serial CPU execution to a backend-agnostic execution including multi-threaded CPU and GPU from different vendors (NVIDIA and AMD) thanks to KernelAbstractions.jl (KA).
We have also recently published an extended abstract preprint with benchmarking details regarding this port (see arXiv).
In this post, we will review our approach to port the code together with the challenges we have faced.

Introducing KernelAbstractions.jl

The main ingredient of this port is the @kernel macro from KA.
This macro takes a function definition and converts it into a kernel specialised for a given backend. KA can work with CUDA, ROCm, oneAPI, and Metal backends.

As example, consider the divergence operator for a 2D vector field $vec{u}=(u, v)$.
In the finite-volume method (FVM) and using a Cartesian (uniform) grid with unit cells, this is defined as

[begin{align}
sigma=unicode{x2230}(nablacdotvec{u}),mathrm{d}V = unicode{x222F}vec{u}cdothat{n},mathrm{d}Srightarrow sigma_{i,j} = (u_{i+1,j} – u_{i,j}) + (v_{i,j+1} – v_{i,j}),
end{align}]

where $i$ and $j$ are the indices of the discretised staggered grid:

In WaterLily, we define loops based on the CartesianIndex such that I=(i,j,...), thus writing an n-dimensional solver in a very straight-forward way.
With this, to compute the divergence of a 2D vector field we can use

δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a] - u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1, ndims(a)))

N = (10, 10) # domain size
σ = zeros(N) # scalar field
u = rand(N..., length(N)) # 2D vector field with ghost cells

for d ∈ 1:ndims(σ), I ∈ inside(σ)
    σ[I] += ∂(d, I, u)
end

where a loop for each dimension d and each Cartesian index I is used.
The function δ provides a Cartesian step in the direction d, for example δ(1, 2) returns CartesianIndex(1, 0) and δ(2, 3) returns CartesianIndex(0, 1, 0).
This is used in the derivative function ∂ to implement the divergence equation as described above.
inside(σ) provides the CartesianIndices of σ excluding the ghost elements, ie. a range of Cartesian indices starting at (2, 2) and ending at (9, 9) when size(σ) == (10, 10).

Note that the divergence operation is independent for each I, and this is great because it means we can parallelize it!
This is where KA comes into place.
To be able to generate the divergence operator using KA, we need to write the divergence kernel which is just the divergence operator written in KA style.

We define the divergence kernel _divergence! and a wrapper divergence! as follows

using KernelAbstractions: get_backend, @index, @kernel

@kernel function _divergence!(σ, u, @Const(I0))
    I = @index(Global, Cartesian)
    I += I0
    σ_sum = zero(eltype(σ))
    for d ∈ 1:ndims(σ)
        σ_sum += ∂(d, I, u)
    end
    σ[I] = σ_sum
end
function divergence!(σ, u)
    R = inside(σ)
    _divergence!(get_backend(σ), 64)(σ, u, R[1]-oneunit(R[1]), ndrange=size(R))
end

Note that in the _divergence! kernel we operate again using Cartesian indices by calling @index(Global, Cartesian) from the KA @index macro.
The range of Cartesian indices is given by the ndrange argument in the wrapper function where we pass the inside(σ) Cartesian indices range, and the backend is inferred with the get_backend method.
Also note that we pass an additional (constant) argument I0 which provides the offset index to apply to the indices given by @index naturally starting on (1,1,...).
Using a workgroup size of 64 (size of the group of threads acting in parallel, see terminology), KA will parallelize over I by multi-threading in CPU or GPU.
In this regard, we just need to change the array type of σ and u from Array (CPU backend) to CuArray (NVIDIA GPU backend) or ROCArray (AMD GPU backend), and KA will specialise the kernel for the desired backend

using CUDA: CuArray

N = (10, 10)
σ = zeros(N) |> CuArray
u = rand(N..., length(N)) |> CuArray

divergence!(σ, u)

Automatic loop and kernel generation

As a stencil-based CFD solver, WaterLily heavily uses for loops to iterate over the n-dimensional arrays.
To automate the generation of such loops, the macro @loop is defined

macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op ∈ (:(∈),:(in))
    return quote
        for $I ∈ $R
            $ex
        end
    end |> esc
end

This macro takes an expression such as @loop over I ∈ R where R is a CartesianIndices range, and produces the loop for I ∈ R end.
For example, the serial divergence operator could now be simply defined using

for d ∈ 1:ndims(σ)
    @loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end

which generates the I ∈ inside(σ) loop automatically.

Even though this could be seen as small improvement (if any), the nice thing about writing loops using this approach is that the computationally-demanding part of the code can be abstracted out of the main workflow.
For example, it is easy to add performance macros such as @inbounds and/or @fastmath to each loop by changing the quote block in the @loop macro

macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op ∈ (:(∈),:(in))
    return quote
        @inbounds for $I ∈ $R
            @fastmath $ex
        end
    end |> esc
end

And, even nicer, we can also use this approach to automatically generate KA kernels for every loop in the code!
To do so, we modify @loop to generate the KA kernel using @kernel and the wrapper function that sets the backend and the workgroup size

macro loop(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    grab!(sym,ex)     # get arguments and replace composites in `ex`
    setdiff!(sym,[I]) # don't want to pass I as an argument
    @gensym kern      # generate unique kernel function name
    return quote
        @kernel function $kern($(rep.(sym)...),@Const(I0)) # replace composite arguments
            $I = @index(Global,Cartesian)
            $I += I0
            $ex
        end
        $kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
    end |> esc
end
function grab!(sym,ex::Expr)
    ex.head == :. && return union!(sym,[ex])      # grab composite name and return
    start = ex.head==:(call) ? 2 : 1              # don't grab function names
    foreach(a->grab!(sym,a),ex.args[start:end])   # recurse into args
    ex.args[start:end] = rep.(ex.args[start:end]) # replace composites in args
end
grab!(sym,ex::Symbol) = union!(sym,[ex])        # grab symbol name
grab!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Symbol(ex.args[2].value) : ex

The helper functions grab! and rep allow to extract the arguments required by the expression ex and the Cartesian index range that will be passed to the kernel.

The code generated by @loop and @kernel can be explored using @macroexpand. For example, for d=1

@macroexpand @loop σ[I] += ∂(1, I, u) over I ∈ inside(σ)

we can observe that the code for both CPU and GPU kernels is produced:

Generated code

@macroexpand @loopKA σ[I] += ∂(1, I, u) over I ∈ inside(σ)
quote
    begin
        function var"cpu_##kern#339"(__ctx__, σ, u, I0; )
            let I0 = (KernelAbstractions.constify)(I0)
                $(Expr(:aliasscope))
                begin
                    var"##N#341" = length((KernelAbstractions.__workitems_iterspace)(__ctx__))
                    begin
                        for var"##I#340" = (KernelAbstractions.__workitems_iterspace)(__ctx__)
                            (KernelAbstractions.__validindex)(__ctx__, var"##I#340") || continue
                            I = KernelAbstractions.__index_Global_Cartesian(__ctx__, var"##I#340")
                            begin
                                I += I0
                                σ[I] += ∂(1, I, u)
                            end
                        end
                    end
                end
                $(Expr(:popaliasscope))
                return nothing
            end
        end
        function var"gpu_##kern#339"(__ctx__, σ, u, I0; )
            let I0 = (KernelAbstractions.constify)(I0)
                if (KernelAbstractions.__validindex)(__ctx__)
                    begin
                        I = KernelAbstractions.__index_Global_Cartesian(__ctx__)
                        I += I0
                        σ[I] += ∂(1, I, u)
                    end
                end
                return nothing
            end
        end
        begin
            if !($(Expr(:isdefined, Symbol("##kern#339"))))
                begin
                    $(Expr(:meta, :doc))
                    var"##kern#339"(dev) = begin
                            var"##kern#339"(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)())
                        end
                end
                var"##kern#339"(dev, size) = begin
                        var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.DynamicSize)())
                    end
                var"##kern#339"(dev, size, range) = begin
                        var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.StaticSize)(range))
                    end
                function var"##kern#339"(dev::Dev, sz::S, range::NDRange) where {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size}
                    if (KernelAbstractions.isgpu)(dev)
                        return (KernelAbstractions.construct)(dev, sz, range, var"gpu_##kern#339")
                    else
                        return (KernelAbstractions.construct)(dev, sz, range, var"cpu_##kern#339")
                    end
                end
            end
        end
    end
    (var"##kern#339"(get_backend(σ), 64))(σ, u, (inside(σ))[1] - oneunit((inside(σ))[1]), ndrange = size(inside(σ)))
end

The best feature we achieve when modifying @loop to produce KA kernels is that the divergence operator remains the same as before using KA

for d ∈ 1:ndims(σ)
    @loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end

This exact approach is what has allowed WaterLily to have the same LOC as before using KA, just around 800!

Benchmarking

Now that we have all the items in place, we can benchmark the speedup achieved by KA compared to the serial execution using BenchmarkTools.jl.
Let’s now gather all the code we have used and create a small benchmarking MWE (see below or download it here).
In this code showcase, we will refer to the serial CPU execution as “serial”, the multi-threaded CPU execution as “CPU”, and the GPU execution as “GPU”:

using KernelAbstractions: get_backend, synchronize, @index, @kernel, @groupsize
using CUDA: CuArray
using BenchmarkTools

δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a]-u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1,ndims(a)))

# serial loop macro
macro loop(args...)
    ex,_,itr = args
    op,I,R = itr.args
    @assert op ∈ (:(∈),:(in))
    return quote
        for $I ∈ $R
            $ex
        end
    end |> esc
end
# KA-adapted loop macro
macro loopKA(args...)
    ex,_,itr = args
    _,I,R = itr.args; sym = []
    grab!(sym,ex)     # get arguments and replace composites in `ex`
    setdiff!(sym,[I]) # don't want to pass I as an argument
    @gensym kern      # ge

GPU vendor-agnostic fluid dynamics solver in Julia by moelf

GPU vendor-agnostic fluid dynamics solver in Julia by moelf

Share This Article

Newsletter

Introducing KernelAbstractions.jl

Automatic loop and kernel generation

Benchmarking

HackTech

Leave a comment Cancel reply

Editor's Choice

GPU vendor-agnostic fluid dynamics solver in Julia by moelf

GPU vendor-agnostic fluid dynamics solver in Julia by moelf

Share This Article

Newsletter

Introducing KernelAbstractions.jl

Automatic loop and kernel generation

Benchmarking

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter