WaterLily.jl is a simple and fast fluid simulator written in pure Julia. It solves the unsteady incompressible 2D or 3D Navier-Stokes equations on a Cartesian grid.
The pressure Poisson equation is solved with a geometric multigrid method.
Solid boundaries are modelled using the Boundary Data Immersion Method.
v1.0 has ported the solver from a serial CPU execution to a backend-agnostic execution including multi-threaded CPU and GPU from different vendors (NVIDIA and AMD) thanks to KernelAbstractions.jl (KA).
We have also recently published an extended abstract preprint with benchmarking details regarding this port (see arXiv).
In this post, we will review our approach to port the code together with the challenges we have faced.
Introducing KernelAbstractions.jl
The main ingredient of this port is the @kernel
macro from KA.
This macro takes a function definition and converts it into a kernel specialised for a given backend. KA can work with CUDA, ROCm, oneAPI, and Metal backends.
As example, consider the divergence operator for a 2D vector field $vec{u}=(u, v)$.
In the finite-volume method (FVM) and using a Cartesian (uniform) grid with unit cells, this is defined as
[begin{align}
sigma=unicode{x2230}(nablacdotvec{u}),mathrm{d}V = unicode{x222F}vec{u}cdothat{n},mathrm{d}Srightarrow sigma_{i,j} = (u_{i+1,j} – u_{i,j}) + (v_{i,j+1} – v_{i,j}),
end{align}]
where $i$ and $j$ are the indices of the discretised staggered grid:
In WaterLily, we define loops based on the CartesianIndex
such that I=(i,j,...)
, thus writing an n-dimensional solver in a very straight-forward way.
With this, to compute the divergence of a 2D vector field we can use
δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a] - u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1, ndims(a)))
N = (10, 10) # domain size
σ = zeros(N) # scalar field
u = rand(N..., length(N)) # 2D vector field with ghost cells
for d ∈ 1:ndims(σ), I ∈ inside(σ)
σ[I] += ∂(d, I, u)
end
where a loop for each dimension d
and each Cartesian index I
is used.
The function δ
provides a Cartesian step in the direction d
, for example δ(1, 2)
returns CartesianIndex(1, 0)
and δ(2, 3)
returns CartesianIndex(0, 1, 0)
.
This is used in the derivative function ∂
to implement the divergence equation as described above.
inside(σ)
provides the CartesianIndices
of σ
excluding the ghost elements, ie. a range of Cartesian indices starting at (2, 2)
and ending at (9, 9)
when size(σ) == (10, 10)
.
Note that the divergence operation is independent for each I
, and this is great because it means we can parallelize it!
This is where KA comes into place.
To be able to generate the divergence operator using KA, we need to write the divergence kernel which is just the divergence operator written in KA style.
We define the divergence kernel _divergence!
and a wrapper divergence!
as follows
using KernelAbstractions: get_backend, @index, @kernel
@kernel function _divergence!(σ, u, @Const(I0))
I = @index(Global, Cartesian)
I += I0
σ_sum = zero(eltype(σ))
for d ∈ 1:ndims(σ)
σ_sum += ∂(d, I, u)
end
σ[I] = σ_sum
end
function divergence!(σ, u)
R = inside(σ)
_divergence!(get_backend(σ), 64)(σ, u, R[1]-oneunit(R[1]), ndrange=size(R))
end
Note that in the _divergence!
kernel we operate again using Cartesian indices by calling @index(Global, Cartesian)
from the KA @index
macro.
The range of Cartesian indices is given by the ndrange
argument in the wrapper function where we pass the inside(σ)
Cartesian indices range, and the backend is inferred with the get_backend
method.
Also note that we pass an additional (constant) argument I0
which provides the offset index to apply to the indices given by @index
naturally starting on (1,1,...)
.
Using a workgroup size of 64
(size of the group of threads acting in parallel, see terminology), KA will parallelize over I
by multi-threading in CPU or GPU.
In this regard, we just need to change the array type of σ
and u
from Array
(CPU backend) to CuArray
(NVIDIA GPU backend) or ROCArray
(AMD GPU backend), and KA will specialise the kernel for the desired backend
using CUDA: CuArray
N = (10, 10)
σ = zeros(N) |> CuArray
u = rand(N..., length(N)) |> CuArray
divergence!(σ, u)
Automatic loop and kernel generation
As a stencil-based CFD solver, WaterLily heavily uses for
loops to iterate over the n-dimensional arrays.
To automate the generation of such loops, the macro @loop
is defined
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
end
end |> esc
end
This macro takes an expression such as @loop
where R
is a CartesianIndices
range, and produces the loop for I ∈ R
.
For example, the serial divergence operator could now be simply defined using
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
which generates the I ∈ inside(σ)
loop automatically.
Even though this could be seen as small improvement (if any), the nice thing about writing loops using this approach is that the computationally-demanding part of the code can be abstracted out of the main workflow.
For example, it is easy to add performance macros such as @inbounds
and/or @fastmath
to each loop by changing the quote block in the @loop
macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
@inbounds for $I ∈ $R
@fastmath $ex
end
end |> esc
end
And, even nicer, we can also use this approach to automatically generate KA kernels for every loop in the code!
To do so, we modify @loop
to generate the KA kernel using @kernel
and the wrapper function that sets the backend and the workgroup size
macro loop(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
grab!(sym,ex) # get arguments and replace composites in `ex`
setdiff!(sym,[I]) # don't want to pass I as an argument
@gensym kern # generate unique kernel function name
return quote
@kernel function $kern($(rep.(sym)...),@Const(I0)) # replace composite arguments
$I = @index(Global,Cartesian)
$I += I0
$ex
end
$kern(get_backend($(sym[1])),64)($(sym...),$R[1]-oneunit($R[1]),ndrange=size($R))
end |> esc
end
function grab!(sym,ex::Expr)
ex.head == :. && return union!(sym,[ex]) # grab composite name and return
start = ex.head==:(call) ? 2 : 1 # don't grab function names
foreach(a->grab!(sym,a),ex.args[start:end]) # recurse into args
ex.args[start:end] = rep.(ex.args[start:end]) # replace composites in args
end
grab!(sym,ex::Symbol) = union!(sym,[ex]) # grab symbol name
grab!(sym,ex) = nothing
rep(ex) = ex
rep(ex::Expr) = ex.head == :. ? Symbol(ex.args[2].value) : ex
The helper functions grab!
and rep
allow to extract the arguments required by the expression ex
and the Cartesian index range that will be passed to the kernel.
The code generated by @loop
and @kernel
can be explored using @macroexpand
. For example, for d=1
@macroexpand @loop σ[I] += ∂(1, I, u) over I ∈ inside(σ)
we can observe that the code for both CPU and GPU kernels is produced:
Generated code
@macroexpand @loopKA σ[I] += ∂(1, I, u) over I ∈ inside(σ) quote begin function var"cpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) $(Expr(:aliasscope)) begin var"##N#341" = length((KernelAbstractions.__workitems_iterspace)(__ctx__)) begin for var"##I#340" = (KernelAbstractions.__workitems_iterspace)(__ctx__) (KernelAbstractions.__validindex)(__ctx__, var"##I#340") || continue I = KernelAbstractions.__index_Global_Cartesian(__ctx__, var"##I#340") begin I += I0 σ[I] += ∂(1, I, u) end end end end $(Expr(:popaliasscope)) return nothing end end function var"gpu_##kern#339"(__ctx__, σ, u, I0; ) let I0 = (KernelAbstractions.constify)(I0) if (KernelAbstractions.__validindex)(__ctx__) begin I = KernelAbstractions.__index_Global_Cartesian(__ctx__) I += I0 σ[I] += ∂(1, I, u) end end return nothing end end begin if !($(Expr(:isdefined, Symbol("##kern#339")))) begin $(Expr(:meta, :doc)) var"##kern#339"(dev) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)()) end end var"##kern#339"(dev, size) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.DynamicSize)()) end var"##kern#339"(dev, size, range) = begin var"##kern#339"(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.StaticSize)(range)) end function var"##kern#339"(dev::Dev, sz::S, range::NDRange) where {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size} if (KernelAbstractions.isgpu)(dev) return (KernelAbstractions.construct)(dev, sz, range, var"gpu_##kern#339") else return (KernelAbstractions.construct)(dev, sz, range, var"cpu_##kern#339") end end end end end (var"##kern#339"(get_backend(σ), 64))(σ, u, (inside(σ))[1] - oneunit((inside(σ))[1]), ndrange = size(inside(σ))) end
The best feature we achieve when modifying @loop
to produce KA kernels is that the divergence operator remains the same as before using KA
for d ∈ 1:ndims(σ)
@loop σ[I] += ∂(d, I, u) over I ∈ inside(σ)
end
This exact approach is what has allowed WaterLily to have the same LOC as before using KA, just around 800!
Benchmarking
Now that we have all the items in place, we can benchmark the speedup achieved by KA compared to the serial execution using BenchmarkTools.jl.
Let’s now gather all the code we have used and create a small benchmarking MWE (see below or download it here).
In this code showcase, we will refer to the serial CPU execution as “serial”, the multi-threaded CPU execution as “CPU”, and the GPU execution as “GPU”:
using KernelAbstractions: get_backend, synchronize, @index, @kernel, @groupsize
using CUDA: CuArray
using BenchmarkTools
δ(d,::CartesianIndex{D}) where {D} = CartesianIndex(ntuple(j -> j==d ? 1 : 0, D))
@inline ∂(a,I::CartesianIndex{D},u::AbstractArray{T,n}) where {D,T,n} = u[I+δ(a,I),a]-u[I,a]
inside(a) = CartesianIndices(ntuple(i-> 2:size(a)[i]-1,ndims(a)))
# serial loop macro
macro loop(args...)
ex,_,itr = args
op,I,R = itr.args
@assert op ∈ (:(∈),:(in))
return quote
for $I ∈ $R
$ex
end
end |> esc
end
# KA-adapted loop macro
macro loopKA(args...)
ex,_,itr = args
_,I,R = itr.args; sym = []
grab!(sym,ex) # get arguments and replace composites in `ex`
setdiff!(sym,[I]) # don't want to pass I as an argument
@gensym kern # ge