
An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone
TL;DR: To make Anukari’s performance reliable across all Apple silicon macOS devices, I need to talk to someone on the Apple Metal team. It would be great if someone can connect me with the right person inside Apple, or direct them to my feedback request FB17475838 as well as this devlog entry.
This is going to be a VERY LONG HIGHLY TECHNICAL post, so either buckle your seatbelt or leave while you still can.
Background
The Anukari 3D Physics Synthesizer simulates a large spring-mass model in real-time for audio generation. To support a nontrivial number of physics objects, it requires a GPU for the simulation. The physics code is ALU-bound, not memory-bound. All mutable state in the simulation is stored in the GPU’s threadgroup memory, which is roughly equivalent to a manually-allocated L1 cache, so it is extremely fast.
The typical use-case for Anukari is running it as an AudioUnit (AU) or VST3 plugin inside a host application like Pro Tools or Ableton, also called a Digital Audio Workstation (DAW). The DAW invokes Anukari for each audio buffer block, which is a request to generate/process N samples of audio. For each block, Anukari invokes the physics simulation GPU kernel, waits for the result, and returns.
The audio buffer block system is important because GPU kernel scheduling has a certain amount of latency overhead, and for real-time audio we have fixed time constraints. By amortizing the GPU scheduling latency over, say, 512 audio samples, it becomes negligible. But the runtime of the kernel itself is still very important.
Basic Problem
Apple’s macOS is obviously extremely clever about power management, and Apple silicon hardware is built to support the OS in achieving high power efficiency.
As with all modern hardware, the clock rate for Apple silicon chips can be slowed down to reduce power consumption. When the OS detects that the processing demand for a given chip is low (or non-existent), it can decrease the clock rate for that chip. This is awesome.
The problem is that due to the way Anukari runs inside a DAW and interacts with the GPU, the heuristics that macOS uses to determine whether there is sufficient demand upon the GPU to increase its clock rate do not work.
Consider the chart below. The CPU does some preparatory work, there’s a small gap which represents kernel invocation latency, and then the GPU does a large block of work. Finally there’s another small gap representing the real-time headroom.
(An aside: chalkboards are way better than whiteboards, unless you enjoy getting high on noxious fumes. in which case whiteboards are the way to go.)
I don’t have any real knowledge of macOS’s heuristics for deciding when to increase the GPU clock speed, but I might reasonably guess that it relies on something like the load average. In the diagram above, the GPU load average might be only 60%, because between audio buffer blocks it is idle. Perhaps this does not meet the threshold for increasing the GPU clock rate.
But this is terrible for Anukari, because to meet real-time constraints, it needs the absolute lowest latency possible, which requires the highest GPU clock rate. I’m not sure how low the Apple GPU clock rate can go, but it definitely goes low enough to make Anukari unusable.
To be clear, it’s pretty understandable that macOS handles this situation poorly, because the GPU is mostly used for throughput workflows like graphics or ML. Audio on the GPU is really new, and there are only a couple of companies doing it right now.
Are you sure the clock rate is the problem?
Oh yes. Thankfully, Apple’s first-part Instruments tools that come with Xcode have a handy Metal profiler. Among other things, this is how I first learned that Anukari is ALU-bound.
The Metal profiler has an incredibly useful feature: it allows you to choose the Metal “Performance State” while profiling the application. This is not configurable outside of the profiler. This is how I first figured out that the GPU clock rate was the issue: Anukari works perfectly under the Maximum performance state, and abysmally under the Minimum performance state.
Wait, Anukari mostly works great on macOS. How is that possible?
Giv
8 Comments
humbledrone
Some folks may have seen my Show HN post for Anukari here: https://news.ycombinator.com/item?id=43873074
In that thread, the topic of macOS performance came up there. Basically Anukari works great for most people on Apple silicon, including base-model M1 hardware. I've done all my testing on a base M1 and it works wonderfully. The hardware is incredible.
But to make it work, I had to implement an unholy abomination of a workaround to get macOS to increase the GPU clock rate for the audio processing to be fast enough. The normal heuristics that macOS uses for the GPU performance state don't understand the weird Anukari workload.
Anyway, I finally had time to write down the full situation, in terrible detail, so that I could ask for help getting in touch with the right person at Apple, probably someone who works on the Metal API.
Help! :)
krackers
>The Metal profiler has an incredibly useful feature: it allows you to choose the Metal “Performance State” while profiling the application. This is not configurable outside of the profiler.
Seems like there might be a private API for this. Maybe it's easier to go the reverse engineering route? Unless it'll end up requiring some special entitlement that you can't bypass without disabling SIP.
LiamPowell
The problem with exposing an API for this is that far too many developers will force the highest performance state all the time. I don't know if there's really a good way to stop that and have the API at the same time.
Someone
One thing I don’t understand: if latency is important for this use case, why isn’t the CPU busy preparing the next GPU ‘job’ while a GPU ‘job’ is running?
Is that a limitation of the audio plug-in APIs?
threeseed
Best way to do this:
1. Go through WWDC videos and find the engineer who seems the most knowledgable about the issue you're facing.
2. Email them directly with this format: mthomson@apple.com for Michael Thomson.
SOLAR_FIELDS
https://xkcd.com/1172/ feels a lot like the workaround OP describes
sgt
I have zero need for this app but it's so cool. Apps like these bring the "fun" back into computing. I don't mean there's no fun at the moment, but reminds me of the old days with more graphical and experimental programs that floated around, even the demoscene.
charcircuit
>Any MTLCommandQueue managed by an Audio Workgroup thread could be treated as real-time and the GPU clock could be adjusted accordingly.
>The Metal API could simply provide an option on MTLCommandQueue to indicate that it is real-time sensitive, and the clock for the GPU chiplet handling that queue could be adjusted accordingly.
Realtime scheduling on a GPU and what the GPU is clocked to are separate concepts. From the article it sounds like the issue is with the clock speeds and not how the work is being scheduled. It sounds like you need something else for providing a hint for requesting a higher GPU clock.