Don’t “optimize” conditional moves in shaders with mix()+step() by romes

Share This Article

Sed ut perspiciatis unde.

Intro

In this article I want to correct a popular misconception that’s been making the rounds in computer graphics aficionado circles for a long time now. It has to do with branching in the GPUs. Unfortunately there are a couple of educational websites out there that are spreading some misinformation and it would be nice correcting that. I tried contacting the authors without success, so without further ado, here goes my attempt to fix things up:

The issue

So, say I have this code, which I actually published the other day:

vec2 snap45( in vec2 v )
{
vec2 s = sign(v);
float x = abs(v.x);
return x>0.923880?vec2(s.x,0.0):
x>0.382683?s*sqrt(0.5):
vec2(0.0,s.y);
}

The exact details of what it does don’t matter for this discussion. All we care about is the two ternary operations, which as you know, implement conditional execution. Indeed, depending on the value of the variable x, the function will return different results. This could be implemented also with regular if statements, and all that I’m going to say stays the same.

But here’s the problem – when seeing code like this, somebody somewhere will invariably propose the following “optimization”, which replaces what they believe (erroneously) are “conditional branches” by arithmetical operations. They will suggest something like this:

vec2 snap45( in vec2 v )
{
vec2 s = sign(v);
float x = abs(v.x);

float w0 = step(0.92387953,x);
float w1 = step(0.38268343,x)*(1.0-w0);
float w2 = 1.0-w0-w1;

vec2 res0 = vec2(s.x,0.0);
vec2 res1 = vec2(s.x,s.y)*sqrt(0.5);
vec2 res2 = vec2(0.0,s.y);

return w0*res0 + w1*res1 + w2*res2;&#

The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this: float step( float x, float y ) { return x < y ? 1.0 : 0.0; }

The variants of mix where a is genBType select which vector each returned component comes from. For a component of a that is false, the corresponding component of x is returned. For a component of a that is true, the corresponding component of y is returned.

13 Comments

Post Author

ttoinou

Posted February 9, 2025 at 1:25 pm

Thanks Inigo !

How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?

0Likes

Post Author

doctorhandshake

Posted February 9, 2025 at 1:28 pm

I don’t know enough about these implementations to know if this can be interpreted as a blanket ‘conditionals are fine’ or, rather, ‘ternary operations which select between two themselves non-branching expressions are fine’.

Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.

0Likes Log in to Reply
Post Author

toredo1729_2

Posted February 9, 2025 at 1:32 pm

Unrelated, but somehow similar: I really hate it that it's not possible to force gcc to transform things like this into a conditional move:

x > c ? y : 0.;

It annoyed me many times and it still does.

0Likes Log in to Reply
Post Author

ryao

Posted February 9, 2025 at 1:40 pm

Do shader compilers have optimization passes to undo this mistake and if not, could they be added?

0Likes Log in to Reply
Post Author

mirsadm

Posted February 9, 2025 at 1:45 pm

I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.

0Likes Log in to Reply
Post Author

londons_explore

Posted February 9, 2025 at 1:51 pm

So why isn't the compiler smart enough to see that the 'optimised' version is the same?

Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?

This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)

0Likes Log in to Reply
Post Author

flowzai4

Posted February 9, 2025 at 1:56 pm

[dead]

0Likes Log in to Reply
Post Author

magicalhippo

Posted February 9, 2025 at 1:59 pm

Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.

As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed…

0Likes Log in to Reply
Post Author

quuxplusone

Posted February 9, 2025 at 2:14 pm

I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:

"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version […] wasting two multiplications and one or two additions. […] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"

—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.

0Likes Log in to Reply
Post Author

TinkersW

Posted February 9, 2025 at 2:15 pm

It is weird how long misinformation like this sticks around, the conditional move/select approach has been superior for decades on both CPU & GPU, but somehow some people still write the other approach as an "optimization".

0Likes Log in to Reply

mahkoh

Posted February 9, 2025 at 2:22 pm

    So, if you ever see somebody proposing this

    float a = mix( b, c, step( y, x ) );

The author seems unaware of

    float a = mix( b, c, y > x );

which encodes the desired behavior and also works for vectors:

Post Author

alkonaut

Posted February 9, 2025 at 2:56 pm

I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.

0Likes Log in to Reply
Post Author

DrNosferatu

Posted February 9, 2025 at 3:28 pm

This should be quantified and generalized for a full set of cases – that way the argument would stand far more clearly.

0Likes Log in to Reply

Don’t “optimize” conditional moves in shaders with mix()+step() by romes

Don’t “optimize” conditional moves in shaders with mix()+step() by romes

Share This Article

Newsletter

Intro

The issue

HackTech

13 Comments

ttoinou

doctorhandshake

toredo1729_2

ryao

mirsadm

londons_explore

flowzai4

magicalhippo

quuxplusone

TinkersW

mahkoh

alkonaut

DrNosferatu

Leave a comment Cancel reply

Editor's Choice

Don’t “optimize” conditional moves in shaders with mix()+step() by romes

Don’t “optimize” conditional moves in shaders with mix()+step() by romes

Share This Article

Newsletter

Intro

The issue

13 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter