I want a good parallel computer by raphlinus

Share This Article

Sed ut perspiciatis unde.

The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

I believe there are two main things holding it back. One is an impoverished execution model, which makes certain tasks difficult or impossible to do efficiently; GPUs excel at big blocks of data with predictable shape, such as dense matrix multiplication, but struggle when the workload is dynamic. Second, our languages and tools are inadequate. Programming a parallel computer is just a lot harder.

Modern GPUs are also extremely complex, and getting more so rapidly. New features such as mesh shaders and work graphs are two steps forward one step back; for each new capability there is a basic task that isn’t fully supported.

I believe a simpler, more powerful parallel computer is possible, and that there are signs in the historical record. In a slightly alternate universe, we would have those computers now, and be doing the work of designing algorithms and writing programs to run well on them, for a very broad range of tasks.

Last April, I gave a colloquium (video) at the UCSC CSE program with the same title. This blog is a companion to that.

Memory efficiency of sophisticated GPU programs

I’ve been working on Vello, an advanced 2D vector graphics renderer, for many years. The CPU uploads a scene description in a simplified binary SVG-like format, then the compute shaders take care of the rest, producing a 2D rendered image at the end. The compute shaders parse tree structures, do advanced computational geometry for stroke expansion, and sorting-like algorithms for binning. Internally, it’s essentially a simple compiler, producing a separate optimized byte-code like program for each 16×16 pixel tile, then interpreting those programs. What it cannot do, a problem I am increasingly frustrated by, is run in bounded memory. Each stage produces intermediate data structures, and the number and size of these structures depends on the input in an unpredictable way. For example, changing a single transform in the encoded scene can result in profoundly different rendering plans.

The problem is that the buffers for the intermediate results need to be allocated (under CPU control) before kicking off the pipeline. There are a number of imperfect potential solutions. We could estimate memory requirements on the CPU before starting a render, but that’s expensive and may not be precise, resulting either in failure or waste. We could try a render, detect failure, and retry if buffers were exceeded, but doing readback from GPU to CPU is a big performance problem, and creates a significant architectural burden on other engines we’d interface with.

The details of the specific problem are interesting but beyond the scope of this blog post. The interested reader is directed to the Potato design document, which explores the question of how far you can get doing scheduling on CPU, respecting bounded GPU resources, while using the GPU for actual pixel wrangling. It also touches on several more recent extensions to the standard GPU execution model, all of which are complex and non-portable, and none of which quite seem to solve the problem.

Fundamentally, it shouldn’t be necessary to allocate large buffers to store intermediate results. Since they will be consumed by downstream stages, it’s far more efficient to put them in queues, sized large enough to keep enough items in flight to exploit available parallelism. Many GPU operations internally work as queues (the standard vertex shader / fragment shader / rasterop pipeline being the classic example), so it’s a question of exposing that underlying functionality to applications. The GRAMPS paper from 2009 suggests this direction, as did the Brook project, a predecessor to CUDA.

There are a lot of potential solutions to running Vello-like algorithms in bounded memory; most have a fatal flaw on hardware today. It’s interesting to speculate about changes that would unlock the capability. It’s worth emphasizing, I’m not feeling held back by the amount of parallelism I can exploit, as my approach of breaking the problem into variants of prefix sum easily scales to hundreds of thousands of threads. Rather, it’s the inability to organize the overall as stages operating in parallel, connected through queues tuned to use only the amount of buffer memory needed to keep everything smoothly, as opposed to the compute shader execution model of large dispatches separated by pipeline barriers.

Parallel computers of the past

The lack of a good parallel computer today is especially frustrating because there were some promising designs in the past, which failed to catch on for various complex reasons, leaving us with overly complex and limited GPUs, and extremely limited, though efficient, AI accelerators.

Connection Machine

I’m listing this not because it’s a particularly promising design, but because it expressed the dream of a good parallel computer in the clearest way. The first Connection Machine shipped in 1985, and contained up to 64k processors, connected in a hypercube network. The number of individual threads is large even by today’s standards, though each individual processor was extremely underpowered.

Perhaps more than anything else, the CM spurred tremendous research into parallel algorithms. The pioneering work by Blelloch on prefix sum was largely done on the Connection Machine, and I find early paper on sorting on CM-2 to be quite fascinating.

Connection Machine 1 (1985) at KIT / Informatics / TECO • by KIT TECO • CC0

Cell

Another important pioneering parallel computer was Cell, which shipped as part of the PlayStation 3 in 2006. That device shipped in fairly good volume (about 87.4 million units), and had fascinating applications including high performance computing, but was a dead end; the Playstation 4 switched to a fairly vanilla Radeon GPU.

Probably one of the biggest challenges in the Cell was the programming model. In the version shipped on the PS3, there were 8 parallel cores, each with 256kB of static RAM, and each with 128 bit wide vector SIMD. The programmer had to manually copy data into local SRAM, where a kernel would then do some computation. There was little or no support for high level programming; thus people wanting to target this platform had to painstakingly architect and implement parallel algorithms.

All that said, the Cell basically met my requirements for a “good parallel computer.” The individual cores could run effectively arbitrary programs, and there was a global job queue.

The Cell had approximately 200 GFLOPS of total throughput, which was impressive at the time, but pales in comparison to modern GPUs or even a modern CPU (Intel i9-13900K is approximately 850 GFLOPS, with a medium-high end Ryzen 7 is 3379 GFLOPS).

Larrabee

Perhaps the most poignant road not taken in the history of GPU design is Larrabee. The 2008 SIGGRAPH paper makes a compelling case, but ultimately the project failed. It’s hard to say why exactly, but I think it’s possible it was just poor execution on Intel’s part, and with more persistence and a couple of iterations to improve the shortcomings in the original version, it might well have succeeded. At heart, Larrabee is a standard x86 computer with wide (512 bit) SIMD units and just a bit of special hardware to optimize graphics tasks. Most graphics functions are implemented in software. If it had succeeded, it would very easily fulfill my wishes; wor

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (16)

16 Comments

Post Author

armchairhacker

Posted March 21, 2025 at 9:38 pm

> The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

What other workloads would benefit from a GPU?

Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.

For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.

In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.

I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.

0Likes Log in to Reply
Post Author

IshKebab

Posted March 21, 2025 at 9:42 pm

Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.

Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.

0Likes Log in to Reply
Post Author

svmhdvn

Posted March 21, 2025 at 9:47 pm

I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.

0Likes Log in to Reply
Post Author

bee_rider

Posted March 21, 2025 at 9:55 pm

It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).

> As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.

I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.

0Likes Log in to Reply
Post Author

andrewstuart

Posted March 21, 2025 at 10:17 pm

AMD Strix Halo APU is a CPU with very powerful integrated GPU.

It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.

16 CPU cores and 40 GPU compute units sounds pretty parallel to me.

Doesn’t that fit the bill?

0Likes Log in to Reply
Post Author

grg0

Posted March 21, 2025 at 10:47 pm

The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.

– You need to compile shader source/bytecode at runtime; you can't just "run" a program.

– On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.

– You need to synchronize data access between CPU-GPU and GPU workloads.

– You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.

– You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.

What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.

0Likes Log in to Reply
Post Author

helf

Posted March 21, 2025 at 11:02 pm

[dead]

0Likes Log in to Reply
Post Author

Retr0id

Posted March 21, 2025 at 11:08 pm

Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified – which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.

0Likes Log in to Reply
Post Author

morphle

Posted March 21, 2025 at 11:14 pm

I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.

For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.

https://en.wikipedia.org/wiki/Roofline_model

0Likes Log in to Reply
Post Author

Animats

Posted March 21, 2025 at 11:22 pm

Interesting article.

Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
Now, 3D renderers, we need all the help we can get.

In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.

Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)

While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.

There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.

In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.

The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process.
This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.

Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

[1] https://github.com/linebender/vello/

0Likes Log in to Reply
Post Author

dekhn

Posted March 21, 2025 at 11:35 pm

There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.

Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.

See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.

AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.

I've found the best strategy is to target my development for what the high end consumers are buying in 2 years – this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")

0Likes Log in to Reply
Post Author

deviantbit

Posted March 21, 2025 at 11:47 pm

"I believe there are two main things holding it back."

He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.

I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.

What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.

I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.

0Likes Log in to Reply
Post Author

amelius

Posted March 21, 2025 at 11:52 pm

Isn't the ONNX standard already going into the direction of programming a GPU using a computation graph? Could it be made more general?

0Likes Log in to Reply
Post Author

casey2

Posted March 21, 2025 at 11:54 pm

I think Tim was right, it's 2025, Nvidia just released their 50 series, but I don't see any cards, let alone GPUs.

0Likes Log in to Reply
Post Author

api

Posted March 21, 2025 at 11:54 pm

I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.

The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.

Larrabee and Xeon Phi are closer to what I’d want.

I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.

0Likes Log in to Reply
Post Author

sitkack

Posted March 22, 2025 at 12:25 am

This essay needs more work.

Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.

Why not link to Vello? https://github.com/linebender/vello

I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.

SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.

Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.

0Likes Log in to Reply

I want a good parallel computer by raphlinus

I want a good parallel computer by raphlinus

Share This Article

Newsletter

Memory efficiency of sophisticated GPU programs

Parallel computers of the past

Connection Machine

Cell

Larrabee

HackTech

16 Comments

armchairhacker

IshKebab

svmhdvn

bee_rider

andrewstuart

grg0

helf

Retr0id

morphle

Animats

dekhn

deviantbit

amelius

casey2

api

sitkack

Leave a comment Cancel reply

Editor's Choice

I want a good parallel computer by raphlinus

I want a good parallel computer by raphlinus

Share This Article

Newsletter

Memory efficiency of sophisticated GPU programs

Parallel computers of the past

Connection Machine

Cell

Larrabee

16 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter