Skip to content Skip to footer
I want a good parallel computer by raphlinus

I want a good parallel computer by raphlinus

16 Comments

  • Post Author
    armchairhacker
    Posted March 21, 2025 at 9:38 pm

    > The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?

    What other workloads would benefit from a GPU?

    Computers are so fast that in practice, many tasks don't need more performance. If a program that runs those tasks is slow, it's because that program's code is particularly bad, and the solution to make the code less bad is simpler than re-writing it for the GPU.

    For example, GUIs have been imperceptibly reactive to user input for over 20 years. If an app's GUI feels sluggish, the problem is that the app's actions and rendering aren't on separate coroutines, or the action's coroutine is blocking (maybe it needs to be on a separate thread). But the rendering part of the GUI doesn't need to be on a GPU (any more than it is today, I admit I don't know much about rendering), because responsive GUIs exist today, some even written in scripting languages.

    In some cases, parallelizing a task intrinsically makes it slower, because the number of sequential operations required to handle coordination mean there are more forced-sequential operations in total. In other cases, a program spawns 1000+ threads but they only run on 8-16 processors, so the program would be faster if it spawned less threads because it would still use all processors.

    I do think GPU programming should be made much simpler, so this work is probably useful, but mainly to ease the implementation of tasks that already use the GPU: real-time graphics and machine learning.

  • Post Author
    IshKebab
    Posted March 21, 2025 at 9:42 pm

    Having worked for a company that made a "hundreds of small CPUs on a single chip", I can tell you now that they're all going to fail because the programming model is too weird, and nobody will write software for them.

    Whatever comes next will be a GPU with extra capabilities, not a totally new architecture. Probably an nVidia GPU.

  • Post Author
    svmhdvn
    Posted March 21, 2025 at 9:47 pm

    I've always admired the work that the team behind https://www.greenarraychips.com/ does, and the GA144 chip seems like a great parallel computing innovation.

  • Post Author
    bee_rider
    Posted March 21, 2025 at 9:55 pm

    It is odd that he talks about Larabee so much, but doesn’t mention the Xeon Phis. (Or is it Xeons Phi?).

    > As a general trend, CPU designs are diverging into those optimizing single-core performance (performance cores) and those optimizing power efficiency (efficiency cores), with cores of both types commonly present on the same chip. As E-cores become more prevalent, algorithms designed to exploit parallelism at scale may start winning, incentivizing provision of even larger numbers of increasingly efficient cores, even if underpowered for single-threaded tasks.

    I’ve always been slightly annoyed by the concept of E cores, because they are so close to what I want, but not quite there… I want, like, throughput cores. Let’s take E cores, give them their AVX-512 back, and give them higher throughput memory. Maybe try and pull the Phi trick of less OoO capabilities but more threads per core. Eventually the goal should be to come up with an AVX unit so big it kills iGPUs, haha.

  • Post Author
    andrewstuart
    Posted March 21, 2025 at 10:17 pm

    AMD Strix Halo APU is a CPU with very powerful integrated GPU.

    It’s faster at AI than an Nvidia RTX4090, because 96GB of the 128GB can be allocated to the GPU memory space. This means it’s doesn’t have the same swapping/memory thrashing that a discrete GPU experiences when processing large models.

    16 CPU cores and 40 GPU compute units sounds pretty parallel to me.

    Doesn’t that fit the bill?

  • Post Author
    grg0
    Posted March 21, 2025 at 10:47 pm

    The issue is that programming a discrete GPU feels like programming a printer over a COM port, just with higher bandwidths. It's an entirely moronic programming model to be using in 2025.

    – You need to compile shader source/bytecode at runtime; you can't just "run" a program.

    – On NUMA/discrete, the GPU cannot just manipulate the data structures the CPU already has; gotta copy the whole thing over. And you better design an algorithm that does not require immediate synchronization between the two.

    – You need to synchronize data access between CPU-GPU and GPU workloads.

    – You need to deal with bad and confusing APIs because there is no standardization of the underlying hardware.

    – You need to deal with a combinatorial turd explosion of configurations. HW vendors want to protect their turd, so drivers and specs are behind fairly tight gates. OS vendors also want to protect their turd and refuse even the software API standard altogether. And then the tooling also sucks.

    What I would like is a CPU with a highly parallel array of "worker cores" all addressing the same memory and speaking the same goddamn language that the CPU does. But maybe that is an inherently crappy architecture for reasons that are beyond my basic hardware knowledge.

  • Post Author
    helf
    Posted March 21, 2025 at 11:02 pm

    [dead]

  • Post Author
    Retr0id
    Posted March 21, 2025 at 11:08 pm

    Something that frustrates me a little is that my system (apple silicon) has unified memory, which in theory should negate the need to shuffle data between CPU and GPU. But, iiuc, the GPU programming APIs at my disposal all require me to pretend the memory is not unified – which makes sense because they want to be portable across different hardware configurations. But it would make my life a lot easier if I could just target the hardware I have, and ignore compatibility concerns.

  • Post Author
    morphle
    Posted March 21, 2025 at 11:14 pm

    I haven't yet read the full blog post but so far my response is you can have this good parallel computer. See my previous HN comments the past months on building an M4 Mac mini supercomputer.

    For example reverse engineering the Apple M3 Ultra GPU and Neural Engine instruction set and IOMMU and pages tables that prevent you from programming all processor cores in the chip (146 cores to over ten thousand depending on how you delineate what a core is) and making your own Abstract Syntax Tree to assembly compiler for these undocumented cores will unleash at least 50 trillion operations per second. I still have to benchmark this chip and make the roofline graphs for the M4 to be sure, it might be more.

    https://en.wikipedia.org/wiki/Roofline_model

  • Post Author
    Animats
    Posted March 21, 2025 at 11:22 pm

    Interesting article.

    Other than as an exercise, it's not clear why someone would write a massively parallel 2D renderer that needs a GPU. Modern GPUs are overkill for 2D.
    Now, 3D renderers, we need all the help we can get.

    In this context, a "renderer" is something that takes in meshes, textures, materials, transforms, and objects, and generates images. It's not an entire game development engine, such as Unreal, Unity, or Bevy. Those have several more upper levels above the renderer. Game engines know what all the objects are and what they are doing. Renderers don't.

    Vulkan, incidentally, is a level below the renderer. Vulkan is a cross-hardware API for asking a GPU to do all the things a GPU can do. WGPU for Rust, incidentally, is an wrapper to extend that concept to cross-platform (Mac, Android, browsers, etc.)

    While it seems you can write a general 3D renderer that works in a wide variety of situations, that does not work well in practice. I wish Rust had one. I've tried Rend3 (abandoned), and looked at Renderling (in progress), Orbit (abandoned), and Three.rs (abandoned). They all scale up badly as scene complexity increases.

    There's a friction point in design here. The renderer needs more info to work efficiently than it needs to just draw in a dumb way. Modern GPSs are good enough that a dumb renderer works pretty well, until the scene complexity hits some limit. Beyond that point, problems such as lighting requiring O(lights * objects) time start to dominate. The CPU driving the GPU maxes out while the GPU is at maybe 40% utilization. The operations that can easily be parallelized have been. Now it gets hard.

    In Rust 3D land, everybody seems to write My First Renderer, hit this wall, and quit.

    The big game engines (Unreal, etc.) handle this by using the scene graph info of the game to guide the rendering process.
    This is visually effective, very complicated, prone to bugs, and takes a huge engine dev team to make work.

    Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

    [1] https://github.com/linebender/vello/

  • Post Author
    dekhn
    Posted March 21, 2025 at 11:35 pm

    There are many intertwined issues here. One of the reasons we can't have a good parallel computer is that you need to get a large number of people to adopt your device for development purposes, and they need to have a large community of people who can run their code. Great projects die all the time because a slightly worse, but more ubiquitous technology prevents flowering of new approaches. There are economies of scale that feed back into ever-improving iterations of existing systems.

    Simply porting existing successful codes from CPU to GPU can be a major undertaking and if there aren't any experts who can write something that drive immediate sales, a project can die on the vine.

    See for example https://en.wikipedia.org/wiki/Cray_MTA when I was first asked to try this machine, it was pitched as "run a million threads, the system will context switch between threads when they block on memory and run them when the memory is ready". It never really made it on its own as a supercomputer, but lots of the ideas made it to GPUs.

    AMD and others have explored the idea of moving the GPU closer to the CPU by placing it directly onto the same memory crossbar. Instead of the GPU connecting to the PCI express controller, it gets dropped into a socket just like a CPU.

    I've found the best strategy is to target my development for what the high end consumers are buying in 2 years – this is similar to many games, which launch with terrible performance on the fastest commericially available card, then runs great 2 years later when the next gen of cards arrives ("Can it run crysis?")

  • Post Author
    deviantbit
    Posted March 21, 2025 at 11:47 pm

    "I believe there are two main things holding it back."

    He really science’d the heck out of that one. I’m getting tired of seeing opinions dressed up as insight—especially when they’re this detached from how real systems actually work.

    I worked on the Cell processor and I can tell you it was a nightmare. It demanded an unrealistic amount of micromanagement and gave developers rope to hang themselves with. There’s a reason it didn’t survive.

    What amazes me more is the comment section—full of people waxing nostalgic for architectures they clearly never had to ship stable software on. They forget why we moved on. Modern systems are built with constraints like memory protection, isolation, and stability in mind. You can’t just “flatten address spaces” and ignore the consequences. That’s how you end up with security holes, random crashes, and broken multi-tasking. There's a whole generation of engineers that don't seem to realize why we architected things this way in the first place.

    I will take how things are today over how things used to be in a heart beat. I really believe I need to spend 2-weeks requiring students write code on an Amiga, and the programs have to run at the same time. If anyone of them crashes, they all will fail my course. A new found appreciation may flourish.

  • Post Author
    amelius
    Posted March 21, 2025 at 11:52 pm

    Isn't the ONNX standard already going into the direction of programming a GPU using a computation graph? Could it be made more general?

  • Post Author
    casey2
    Posted March 21, 2025 at 11:54 pm

    I think Tim was right, it's 2025, Nvidia just released their 50 series, but I don't see any cards, let alone GPUs.

  • Post Author
    api
    Posted March 21, 2025 at 11:54 pm

    I implemented some evolutionary computation stuff on the Cell BE in college. It was a really interesting machine and could be very fast for its time but it was somewhat painful to program.

    The main cores were PPC and the Cell cores were… a weird proprietary architecture. You had to write kernels for them like GPGPU, so in that sense it was similar. You couldn’t use them seamlessly or have mixed work loads easily.

    Larrabee and Xeon Phi are closer to what I’d want.

    I’ve always wondered about many—many-many-core CPUs too. How many tiny ARM32 cores could you put on a big modern 5nm die? Give each one local RAM and connect them with an on die network fabric. That’d be an interesting machine for certain kinds of work loads. It’d be like a 1990s or 2000s era supercomputer on a chip but with much faster clock, RAM, and network.

  • Post Author
    sitkack
    Posted March 22, 2025 at 12:25 am

    This essay needs more work.

    Are you arguing for a better software abstraction, a different hardware abstraction or both? Lots of esoteric machines are name dropped, but it isn't clear how that helps your argument.

    Why not link to Vello? https://github.com/linebender/vello

    I think a stronger essay would at the end give the reader a clear view of what Good means and how to decide if a machine is closer to Good than another machine and why.

    SIMD machines can be turned into MIMD machines. Even hardware problems still need a software solution. The hardware is there to offer the right affordances for the kinds of software you want to write.

    Lots of words that are in the eye of beholder. We need a checklist or that Good parallel computer won't be built.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.