An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone

Share This Article

Sed ut perspiciatis unde.

TL;DR: To make Anukari’s performance reliable across all Apple silicon macOS devices, I need to talk to someone on the Apple Metal team. It would be great if someone can connect me with the right person inside Apple, or direct them to my feedback request FB17475838 as well as this devlog entry.

This is going to be a VERY LONG HIGHLY TECHNICAL post, so either buckle your seatbelt or leave while you still can.

Background

The Anukari 3D Physics Synthesizer simulates a large spring-mass model in real-time for audio generation. To support a nontrivial number of physics objects, it requires a GPU for the simulation. The physics code is ALU-bound, not memory-bound. All mutable state in the simulation is stored in the GPU’s threadgroup memory, which is roughly equivalent to a manually-allocated L1 cache, so it is extremely fast.

The typical use-case for Anukari is running it as an AudioUnit (AU) or VST3 plugin inside a host application like Pro Tools or Ableton, also called a Digital Audio Workstation (DAW). The DAW invokes Anukari for each audio buffer block, which is a request to generate/process N samples of audio. For each block, Anukari invokes the physics simulation GPU kernel, waits for the result, and returns.

The audio buffer block system is important because GPU kernel scheduling has a certain amount of latency overhead, and for real-time audio we have fixed time constraints. By amortizing the GPU scheduling latency over, say, 512 audio samples, it becomes negligible. But the runtime of the kernel itself is still very important.

Basic Problem

Apple’s macOS is obviously extremely clever about power management, and Apple silicon hardware is built to support the OS in achieving high power efficiency.

As with all modern hardware, the clock rate for Apple silicon chips can be slowed down to reduce power consumption. When the OS detects that the processing demand for a given chip is low (or non-existent), it can decrease the clock rate for that chip. This is awesome.

The problem is that due to the way Anukari runs inside a DAW and interacts with the GPU, the heuristics that macOS uses to determine whether there is sufficient demand upon the GPU to increase its clock rate do not work.

Consider the chart below. The CPU does some preparatory work, there’s a small gap which represents kernel invocation latency, and then the GPU does a large block of work. Finally there’s another small gap representing the real-time headroom.

(An aside: chalkboards are way better than whiteboards, unless you enjoy getting high on noxious fumes. in which case whiteboards are the way to go.)

I don’t have any real knowledge of macOS’s heuristics for deciding when to increase the GPU clock speed, but I might reasonably guess that it relies on something like the load average. In the diagram above, the GPU load average might be only 60%, because between audio buffer blocks it is idle. Perhaps this does not meet the threshold for increasing the GPU clock rate.

But this is terrible for Anukari, because to meet real-time constraints, it needs the absolute lowest latency possible, which requires the highest GPU clock rate. I’m not sure how low the Apple GPU clock rate can go, but it definitely goes low enough to make Anukari unusable.

To be clear, it’s pretty understandable that macOS handles this situation poorly, because the GPU is mostly used for throughput workflows like graphics or ML. Audio on the GPU is really new, and there are only a couple of companies doing it right now.

Are you sure the clock rate is the problem?

Oh yes. Thankfully, Apple’s first-part Instruments tools that come with Xcode have a handy Metal profiler. Among other things, this is how I first learned that Anukari is ALU-bound.

The Metal profiler has an incredibly useful feature: it allows you to choose the Metal “Performance State” while profiling the application. This is not configurable outside of the profiler. This is how I first figured out that the GPU clock rate was the issue: Anukari works perfectly under the Maximum performance state, and abysmally under the Minimum performance state.

Wait, Anukari mostly works great on macOS. How is that possible?

Giv

Post Author

humbledrone

Posted May 6, 2025 at 3:40 am

Some folks may have seen my Show HN post for Anukari here: https://news.ycombinator.com/item?id=43873074

In that thread, the topic of macOS performance came up there. Basically Anukari works great for most people on Apple silicon, including base-model M1 hardware. I've done all my testing on a base M1 and it works wonderfully. The hardware is incredible.

But to make it work, I had to implement an unholy abomination of a workaround to get macOS to increase the GPU clock rate for the audio processing to be fast enough. The normal heuristics that macOS uses for the GPU performance state don't understand the weird Anukari workload.

Anyway, I finally had time to write down the full situation, in terrible detail, so that I could ask for help getting in touch with the right person at Apple, probably someone who works on the Metal API.

Help! :)

0Likes Log in to Reply
Post Author

krackers

Posted May 6, 2025 at 8:03 am

>The Metal profiler has an incredibly useful feature: it allows you to choose the Metal “Performance State” while profiling the application. This is not configurable outside of the profiler.

Seems like there might be a private API for this. Maybe it's easier to go the reverse engineering route? Unless it'll end up requiring some special entitlement that you can't bypass without disabling SIP.

0Likes Log in to Reply
Post Author

LiamPowell

Posted May 6, 2025 at 8:44 am

The problem with exposing an API for this is that far too many developers will force the highest performance state all the time. I don't know if there's really a good way to stop that and have the API at the same time.

0Likes Log in to Reply
Post Author

Someone

Posted May 6, 2025 at 9:51 am

One thing I don’t understand: if latency is important for this use case, why isn’t the CPU busy preparing the next GPU ‘job’ while a GPU ‘job’ is running?

Is that a limitation of the audio plug-in APIs?

0Likes Log in to Reply
Post Author

threeseed

Posted May 6, 2025 at 9:58 am

Best way to do this:

1. Go through WWDC videos and find the engineer who seems the most knowledgable about the issue you're facing.

2. Email them directly with this format: mthomson@apple.com for Michael Thomson.

0Likes Log in to Reply
Post Author

SOLAR_FIELDS

Posted May 6, 2025 at 11:04 am

https://xkcd.com/1172/ feels a lot like the workaround OP describes

0Likes Log in to Reply
Post Author

sgt

Posted May 6, 2025 at 11:10 am

I have zero need for this app but it's so cool. Apps like these bring the "fun" back into computing. I don't mean there's no fun at the moment, but reminds me of the old days with more graphical and experimental programs that floated around, even the demoscene.

0Likes Log in to Reply
Post Author

charcircuit

Posted May 6, 2025 at 11:33 am

>Any MTLCommandQueue managed by an Audio Workgroup thread could be treated as real-time and the GPU clock could be adjusted accordingly.

>The Metal API could simply provide an option on MTLCommandQueue to indicate that it is real-time sensitive, and the clock for the GPU chiplet handling that queue could be adjusted accordingly.

Realtime scheduling on a GPU and what the GPU is clocked to are separate concepts. From the article it sounds like the issue is with the clock speeds and not how the work is being scheduled. It sounds like you need something else for providing a hint for requesting a higher GPU clock.

0Likes Log in to Reply

An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone

An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone

Share This Article

Newsletter

Background

Basic Problem

Are you sure the clock rate is the problem?

Wait, Anukari mostly works great on macOS. How is that possible?

HackTech

8 Comments

humbledrone

krackers

LiamPowell

Someone

threeseed

SOLAR_FIELDS

sgt

charcircuit

Leave a comment Cancel reply

Editor's Choice

An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone

An Appeal to Apple from Anukari: one tiny macOS detail to make Anukari fast by humbledrone

Share This Article

Newsletter

Background

Basic Problem

Are you sure the clock rate is the problem?

Wait, Anukari mostly works great on macOS. How is that possible?

8 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter