Supporting imagery
A quick introduction
As smartphones surge in adoption, the videogame market is experiencing an unusual growth led by discount App Stores and affordable development licenses. With this, one can only wonder when kids will prefer an iPhone 4 over a Nintendo DSi.
In the midst of finding out the answer, Nintendo conceives a thrilling successor to its triumphant portable system. In it, users will find old, present and unfamiliar technology – many of which can’t be replicated by smartphones.
And so, this new production of the Architecture of Consoles series will give you a profound description of how this new console works, both internally and externally.
Recommended reading
If you are new to this article series, I strongly suggest reading the GameCube, Game Boy Advance and Nintendo DS articles beforehand, as they will explain various terms and concepts referenced in this one.
Models and variants
Throughout the lifecycle (and struggles) of this console, Nintendo released numerous revisions in an attempt to correct its target audience and recover loyal customers.

From the architectural point of view, there were a total of six different models:
- Nintendo 3DS (2011) and Nintendo 3DS XL (2012): The debuting series featuring the original architecture. The only relevant difference between the XL and non-XL models is the screen size.
- Nintendo 2DS (2013): A cheaper alternative to the original Nintendo 3DS by removing the stereoscopic screen and featuring a Game Boy-inspired shape.
- New Nintendo 3DS (2014) and New Nintendo 3DS XL (2015): A re-engineering of the standard 3DS models. The ‘New’ variants exhibit an incremental hardware upgrade, an NFC reader, a larger button set and an improved stereoscopic system.
- New Nintendo 2DS XL (2017): The ‘New’ affordable alternative to the New Nintendo 3DS XL by omitting stereoscopic functionality.
Now, for this article, the focus will be on the original Nintendo 3DS (after all, it’s the lowest common denominator for games). However, since the architectural differences of the ‘New’ series are worth studying, these will receive a dedicated section.
Displays
There’s only one company that keeps altering the standard structure of all my analyses, and that’s Nintendo. This time, I’ll start with the stereoscopic screens (a.k.a. ‘3D without glasses’).
First things first, the Nintendo 3DS, as a successor of the Nintendo DS, includes two LCD screens. The upper screen has a resolution of 800 x 240 pixels and somehow can display images with a sense of depth. When I first read this, only questions popped into my head:
- What optics principles are they applying?
- How is the screen designed?
- How do games comply with this system?
Well, here are the answers!
Principles
Liked or not, the fundamentals are not so different from the Virtual Boy, which I’ve happened to analyse two years before. To recall, the Virtual Boy displays two images, one to each eye, and shows objects individually shifted from the centre. By looking at the two pictures at the same time, they are perceived as some objects are behind others (sense of depth). This is the basis of Stereoscopic Parallax.


Mario’s Tennis (1995).
Now, the way the Virtual Boy executed this was a bit cumbersome: it required users to place their heads close to the eyepiece and then adjust the focal length and inter-pupil distance. 15 years later, Nintendo rightly said ‘No’ to all of that nuisance, and designed a new system where users could enjoy 3D-looking scenery without considerable intervention.


Luigi’s Mansion (2018).
This brings us to our next question.
The special screen
Take a look again at the resolution of the upper LCD screen. On paper, it says it’s 800 x 240 pixels wide, which results in a ludicrous aspect ratio.

In reality, the physical screen is made of half-width pixels and operates in two modes:
- Traditional/2D mode: When the stereoscopic function is disabled, groups of two horizontal pixel pairs are treated as a single one.
- Truth be told, the screen can still display a frame of 800 x 240 px, although no commercial game ever used this.
- Stereoscopic/3D mode: All pixels are treated individually, and with it, the screen displays two frames of 400 x 240 pixels at the same time.
Moreover, to perform stereoscopic parallax, this particular LCD houses an extra layer called Parallax Barrier [2]. These opaque shutters deviate the backlight beamed behind the pixels of the LCD, so each eye will receive the light of a different subset of pixels [3]. The half-width pixels will also appear to be wider, thereby giving the feeling they have the traditional aspect ratio.
All in all, this recreates the original effect of the Virtual Boy without requiring controls for adjustment.
The technology is not perfect, however, as there are a few caveats:
- The parallax barrier requires extra brightness, thereby impacting the battery life.
- The user must not hold the screen in a tilted position (compared to the user’s eyes). Otherwise, the user will end up seeing a confusing mix of the two parallax frames, which can be a disorienting experience. Not to mention the eyes won’t enjoy the extra fatigue.
- Combining the fact the user must maintain a fixed posture while playing, and that stereoscopic parallax can tire the eyes quicker. The 3D feature, as a whole, can be an unnecessary hassle for most.

To remediate things, Nintendo added a slider control (called 3D depth slider) to adjust the level of depth between objects. In doing so, it either increases or decreases the difference between the two frames. This was done to reduce the depth effect for people who didn’t find it enjoyable or too fatiguing.
Setting the 3D slider to the max can be disorienting at first. In my experience, my eyes eventually got focused, at which point I perceived the top LCD screen as if I were looking through a window. The main problem is that users will need to continuously shift their eyes to see the bottom screen, and the repeated action can be very straining.
As a side note, one can’t help but find it amusing how the graphics pipeline has gone full circle when rendering stereoscopic frames. During rendering, 3D data is projected into a 2D space, and now with the stereoscopic screen, that 2D space is displayed again as 3D. At this point, let’s just use holograms and skip the 3D projection stage altogether.
A small update

With the advent of the ‘New 3DS’ model, Nintendo revisioned their stereoscopic screen in an effort to reach enjoyability levels. In the new model, the console incorporates a face-tracking mechanism to tackle the tilting effect, so issues don’t need to worry about keeping a good head-console posture anymore.
The special games
Now for this system to work, games must play along (pun intended). Just like they traditionally interact with the GPU to draw frames on the display, they must now broadcast two frames of the scenery but with objects slightly shifted.
To make life easier for developers, there are official APIs that assist in this, especially for those games with 3D sceneries. These APIs help by providing routines that construct two projection matrices, the graphics pipeline then uses them to render the two slightly-shifted frames.
CPU
Now that we know how the display works, let’s look at the internals of this console. If you get a hold of the motherboard, you’ll see three big chips, one being the CPU CTR. That’s the big System-On-Chip (SoC) that houses the entire system (aside from storage and RAM).

CPU CTR follows the design methods of previous portable consoles from Nintendo. That is, squash all your engineering into a single block. In doing so, it will reduce the production of counterfeits, protect sensible components and improve heat dissipation.
In terms of the actual CPU, Nintendo partnered again with their old friend, ARM, to produce their next-generation core. ARM’s licensing model happens to be favourable to Nintendo as they have always offered synthesisable designs, which allows Nintendo to mould to their needs (including, fitting them into a big SoC). In the end, ARM gave them a relatively antiquated with substantial upgrades. Their choice was the ARM11 core, a successor of the ARM9 (featured with the Nintendo DS). More specifically, the MPCore variant, ARM’s first homogenous multi-core solution.
Using ARM’s designs, Nintendo crafted an ARM11 MPCore cluster housing two ARM11 cores [4]. Three years later, with the arrival of the ‘New’ 3DS, the SoC was expanded to contain four ARM11 cores. The effects of this will be explained in due time so, before anything else, let’s analyse what the new CPU cores offered to this console.
An iconic industry
The ARM11 series originates from 2002, as a successor of the popular ARM9 and the short-lived ARM10.

In case you haven’t heard about them before, ARM11s are best known for powering the 2006-2008 generation of smartphones (back when many of them featured a keypad or a clamshell design). If you owned a Nokia N95, 5230 or the first iPhone, you’ve used an ARM11. This also applied to many high-end cameras, GPS or similar peripherals. If you wonder, other manufacturers like RIM and Samsung held onto Intel XScale (the continuation of StrongARM, implementing the ARMv5TE instruction set) until 2009, when they made the switch to ARM11 (this is a bit ironic, considering the iPhone’s CPU was supplied by Samsung!). Last but not least, the ARM11 was the choice of the CPU for the first model of the Raspberry Pi.
Now, by the time Nintendo adopted the ARM11, its creator had already succeeded it with the Cortex-A series. This is nothing but expected, as Nintendo’s model favours cost-effectiveness over avant-garde CPUs. Look at it from another way, saving in CPU costs allows them to focus their budget on other aspects of the console, you’ll soon see.
New dialects…
Along with the new shiny CPUs, a new instruction set arrived, the ARMv6.
From a programmer’s perspective, the ARMv6 ISA innovates with a new set of vector instructions and multi-core support [5]. The new vector set provides SIMD instructions that operate groups of four 8-bit values or two 16-bit values at the same time (using the existing 32-bit registers) [6]. The new multi-core instructions consist of Store
and Load
opcodes with special care for synchronisation (crucial for an environment of multiple CPUs using the same memory locations) [7].
All in all, this may not seem that thriving for a new chip series, but remember that ARM’s CPUs speak many ‘languages’. In the case of an ARM11-based core, you are provided with:
- The main 32-bit ISA, called ARMv6.
- A compressed alternative called Thumb. Its instructions fit in 16-bit words instead. If you’d like to know more, I go over it in the Game Boy Advance article, as it weighs significant importance in that console.
- Jazelle, a Java bytecode interpreter, mostly forgotten and left unused. I’ve mentioned a bit of it in the Wii article.
- Any extension bundled into the core. For instance, the MPCore includes a Vector Floating-point Coprocessor with additional instructions to control said coprocessor [8].
To make matters less confusing, ARM tends to package all of these with a single nomenclature. For instance, in the case of the ARM11 MPCore opcodes, ARM refers to them as the ARMv6k instruction set.
… and a fragmented distribution
The adoption of extensions and alternative instruction sets eventually made things very convoluted for developers targeting generic ARM hardware, you only have to look at the uncountable ARM ports devised for Linux distributions.
Debian, one of the most popular distributions, tried to tackle the disparities by developing two ports in parallel:
armel
: unoptimized, compatible with ARMv4T onwards.armhf
: accelerated with VFP, but only compatible with ARMv7 onwards.
Yet, with the arrival of the Raspberry Pi (powered by ARMv6 and accelerated with VFP), neither of them was deemed acceptable. Thus, an unofficial port called ‘Raspbian’ was developed to provide a VFP-accelerated version for ARMv6 CPUs [9]. Even so, the trend continued: years later, with the arrival of ARMv8 and AArch64, Debian spawned yet-another port, arm64
, optimised for the new 64-bits ISA.
I don’t remember seeing this labyrinth with x86, but at least things are now getting more orderly. AArch64 has unified many extensions and dropped alternative modes (farewell, Thumb and Jazelle).
Core functionality
That was a big deviation. Let’s go back to the 3DS CPU, the ARM11, and check what’s inside.
For this study, we can divide the ARM11 MPCore into two areas:
- The MP11 cores that make up the cluster.
- The Advanced eXtensible Interface (AXI) bus, a new invention that interconnects the cores and interfaces with the outside world.
Let’s start with the cores now and then we’ll check the AXI bus.
The original MPCore

The first ARM11 MPCore variant, which debuted with the original 3DS, includes two cores. Each is called MP11 and runs at 268 MHz [10].
Apart from implementing the ARMv6k instruction set, the CPU features an 8-stage pipeline [11]. Furthermore, the core provides two levels of branch prediction, ‘dynamic’ (based on previous executions) and ‘static’ (based on the current instruction alone). Overall, both enhancements will be quickly noticed, considering the 5-stage ARM9 couldn’t predict a thing!
Additionally, since the ARM946E-S CPU, ARM has been fitting a System Control Coprocessor called CP15. This time, it provides Memory-Management (MMU functions) and registers that output information about the MPCore cluster.
Now, there’s no more Tightly-Coupled Memory (TCM). There are however 16 KB of instruction cache and 16 KB of data cache, this change of model resembles other systems of the same generation. If you are curious, this L1 cache is 4-way set associative.
Finally, each core houses a co-processor called Vector Floating-point Coprocessor (also known as ‘VFP11’). This accelerates arithmetic operations with floating-point numbers, both 32-bit single-precision (a.k.a. float
) and 64-bit double-precision (a.k.a. double
) ones [12]. It’s not a big coprocessor though, as its register file is composed of 32 32-bit registers, so doubles will consume two registers. In any case, this processor implements the VFPv2 instruction set and follows the IEEE 754 standard. The latter is a welcomed decision, considering the architecture of previous generations.
The ‘New’ MPCore

With the arrival of the New 3DS in 2014, a new SoC was included (CPU LGR) and with it, a luxurious CPU upgrade.
The most apparent change is that we have now four MP11 cores instead of two. The consequences of this, however, are not simple to disseminate, but we’ll see them in due time.
The second change is that the CPU incorporates 2 MB of L2 cache shared between the four cores. This type of cache is 16-way associative, which anticipates four cores accessing it at the same time. If you’d like to know more, I went over associative caches with the Xbox 360 article.
Moving on, all cores now run at 804 MHz (three times the original speed, which will certainly raise a few eyebrows).
The AXI bus

Whether there are two or four cores, all of these are connected using a specialised bus, proudly authored by ARM, called the Advanced eXtensible Interface (AXI). This protocol is part of the AMBA3 model, a successor of the original AMBA revision that we’ve seen in the Wii and Wii U (both housing an ARM9 CPU).
Generally speaking, the AMBA model provides a set of protocols for connecting components with distinct bandwidth requirements using a bus topology. Compare this to the token-ring model of the PlayStation 3 or the mesh solution made for the Xbox 360. All of these consoles shared the same problem, but each came up with different solutions, neither better nor worse, just different.
Following AMBA’s methodologies for interconnecting components, there will be a master-slave hierarchy imposed to maintain order. The master components (typically, the CPU cores) will be the ones sending commands to the slaves (i.e. memory and I/O blocks).
Now, as part of the AMBA3 specification, ARM offered the AXI model as a critical ingredient for building System On Chips (SoC). Instead of using a single bus, AXI uses a dedicated block (called AXI interconnect) acting as a bus matrix [13], this is connected to every single component using 64-bit dedicated buses [14]. In doing so, AXI overcomes the limitations of high-bandwidth components sharing the same bus (as it happened with the PlayStation 2). Moreover, multiple master devices can communicate with slave nodes using separate channels to avoid waiting for other masters to finish. Finally, traditional enhancements like burst transactions are implemented, from which the MP11 cores take advantage.
In the case of the 3DS, the AXI interconnect is housed in a bigger block called Snoop Control Unit (SCU) that also takes care of automatically maintaining L1 cache coherency between the MP11 cores.
Any other CPUs?
Up to this moment, I’ve been talking about the MPCore as if it were the only CPU in this system, the reason being mixing up distinct CPUs for this analysis can turn it into an incomprehensible essay. That is, until now.
The truth is, Nintendo had extra requirements for this console. They wanted a proper security system, but also the possibility to turn the console into a Nintendo DSi or a GBA on-the-fly. So, for all of that, they ended up bundling three distinct CPU packages – one being the mentioned ARM11. The other two are well hidden, in the sense that games are completely unaware of them. In fact, 3DS emulators like Citra don’t care about them either [15].

But we do! Here’s the complete list of CPUs this system houses:
- The ARM11 MPCore we’ve just seen.
- An ARM946E-S from the Nintendo DS days. It’s treated as a secret co-processor and it’s only managed by the operating system. Alternatively, it becomes the main processor whenever a DS or DSi game is executed.
- Thanks to its bundled CP15 co-processor, there’s a Memory Protection Unit (MPU) in place. This will protect the CPU from arbitrarily executing code from any location in memory.
- An ARM7TDMI from the Game Boy Advance days. It’s a relatively ignored CPU, unless a DS or DSi game is being played, in which case it acts as a co-processor. However, on the special occasion when a Game Boy Advanced game is running, the main execution falls into this CPU.
Unfortunately, or for obvious reasons, the three CPUs are never usable at the same time. Instead, the console has three modes of operation:
- Native 3DS mode: The ARM11 MPCore executes a 3DS game while the ARM946E-S deals with I/O and security. The ARM7, on the other side, is switched off.
- Nintendo DSi mode: The ARM946E-S and ARM7TDMI operate in a multi-processor configuration to execute a Nintendo DS or DSi game. Just like with its predecessor, the ARM7TDMI has greater access to I/O. Meanwhile, the ARM11 MPCore will be working in the background to replicate missing and re-located DS hardware (real-time clock, power management, keypad, GBA/DS PPU display and so forth).
- Game Boy Advance mode: The ARM7TDMI is the only CPU executing instructions (in 99% of cases, that will come from a GBA game). The ARM11 MPCore and ARM9, both still operating within the capacities of ‘Native 3DS mode’, will be working in the background.
If you stop to think about it, the Nintendo 3DS ends up housing four processors in total (Two MP11 cores + one ARM9 + one ARM7), or the absurd amount of six in the case of the New 3DS. How convoluted is that? Luckily, this system didn’t suffer the complications of the Sega Saturn and you can thank Nintendo and ARM’s engineering for that. After all, 3DS developers only had to deal with the MPCore.
Since the ARM9 and ARM7 are predominantly for I/O, security and backwards compatibility (neither of which require the developer’s awareness), I discuss them in later sections of this article. But if you’d like to know more about the design of the ARM7 and ARM9, I wrote about them in previous articles (the Game Boy Advance and Nintendo DS ones, respectively).
Multi-core communication
I guess the question now is, how can CPUs and cores talk to each other? Well, the easiest way is to share RAM… but you could also try a more efficient approach, depending on the cores trying to communicate:

- With inter-core ARM11 communication, a core can send interrupts to another core by writing on its
Software Interrupt Register
[16]. - In the case of ARM11↔︎ARM9 or ARM9↔︎ARM7 communication, the same FIFO model from the Nintendo DS is implemented. Plus, the ARM11↔︎ARM9 FIFO is also called ‘PXI’ [17].
Memory available
Having three different CPUs also means the memory layout will not be simple, especially if you care about security.

To make a long story short, we’ve got the following blocks:
- From the developer’s perspective, the system provides 128 MB of FCRAM. The New 3DS increased this to 256 MB. The rest is redundant for games.
- For predominantly security reasons, the ARM11 is also provided with a fast block of 512 KB of SRAM. The ARM9 is also given a block of 1 MB of SRAM (1.5 MB in the case of the New 3DS).
- By inheriting the model of the Nintendo DS, the ARM9 also houses Tightly-Coupled Memory (TCM). Particularly, there’s 32 KB for instructions and 16 KB for data.
A new type of memory spotted
It’s all jolly that the Nintendo 3DS includes 32 times the general-purpose memory of its predecessor, but what about that ‘FCRAM’? Is it any different from the other standards?
Well, Fast Cycle DRAM (FCRAM) is yet another RAM invention, this time authored in 2002 by Fujitsu and Toshiba. Presented as an alternative to DRAM-based technology (i.e. SDRAM, EDO DRAM, RDRAM, etc.), FCRAM excels on non-continuous reads, where it exhibits a lower latency than DRAM [18]. This was done to replicate the performance offered by the more expensive SRAM.
FCRAM competes directly with DDR DRAM by offering a revamped design of the memory arrays. In place of adding more circuitry on top of it, arrays are split into smaller subblocks, which are then accessed using a 3-stage pipeline [19]. In doing so, reading and writing on random locations become faster. These changes are still designed with backwards compatibility in mind. Thus, FCRAM is compatible with DDR DRAM controllers (hence, its full name is ‘DDR FCRAM’).
Faster memory transfers
The inventors of the MPCore and the AMBA bus happen to also offer a brand of DMA controllers called CoreLink, with Nintendo being a loyal client. So, it’s no mystery as to why the 3DS bundles multiple blocks of CoreLink DMA-330 into their SoC [20].
These DMAs in particular are attached to an AXI bus and act as master devices. They can transfer data between two slaves interconnected with the AMBA protocol (either AXI or the slower APB) with the following advantages:
- Faster transfer rates compared to either CPU.
- Support of up to eight channels (eight transfers at the same time) [21].
To be precise, Nintendo fitted one CoreLink DMA next to the ARM9, this is referred to as XDMA and provides up to four channels. There’s another DMA next to the ARM11 block, this time called CDMA, which provides up to eight channels. With the arrival of the New 3DS, another CoreLink DMA-330 is fitted next to the ARM11 block (now a quad-core cluster).
Programming
With all being said, how do you program a system featuring this unorthodox CPU arrangement? To be fair, unusual systems are no strangers to videogame developers. But in this case, 3DS programmers only have access to the ARM11 MPCore. Furthermore, once you reach the ‘Operating System’ section, you’ll learn the abilities with this cluster are further restricted.
In any case, no matter the console revision, programmers base their algorithms on the multi-threading model: the program groups sequences of instructions using threads, these are then dispatched by the operating system to the physical cores, as the former deems fit. Once a novelty for Xbox 360 software, this standard provides a layer of abstraction that blinds the developers from writing software only compatible with a fixed number and type of CPU cores.
Dealing with the ‘New’ hardware
Since the New 3DS diverts considerably from the original specification, Nintendo set up a thin compatibility layer to enable old 3DS games to work with the new hardware without manual intervention.
In essence, when a game is launched from a New 3DS console, the game’s code specifies if it’s specifically targeting the new models or not [22]. If it is, the operating system will proceed to activate all the novelties (faster clock speed, extra RAM and use of L2 cache) for that game to enjoy. If it’s not, the operating system will keep its exclusive hardware deactivated until the user exits the game, so the game can safely assume it’s running on the old hardware and will do so without issue.
To keep supporting the old 3DS, games can be packaged with two codebases (one for the ‘New’ model and the other for the ‘Old’ one). It’s up to the game studios to decide whether to support the old and new 3DS, or only the new 3DS.
You may be wondering what happens with the rest of the exclusive hardware the New 3DS houses (i.e. extra ARM11 cores and DMA). Well, to properly understand the rationale, I explain this once you reach the ‘Operating System’ section, but I’m afraid you won’t like the answer!
Graphics
Next to a new CPU is always a modern GPU. So, what kind of Picture Processing Unit did Nintendo build this time? To tell the truth, none. For the first time in their portable line, they resorted to a GPU supplier.
Nevertheless, the requirements of Nintendo haven’t shifted. The company still wanted a chip with acceptable performance… and the intellectual property. This will allow them to embed the GPU into their SoC, in the same way they did with the ARM CPUs.





Meanwhile, a potential candidate just finished unveiling their new invention at SIGGRAPH 2006 [23]. For some time, Digital Media Professionals Inc. (also known as ‘DMP’) have been building affordable GPUs for the embedded market and, while their chips are nothing out of the ordinary, they guarantee decent OpenGL ES support. Furthermore, their licensing framework offers synthesisable GPUs.
This seemed enough for Nintendo, who happily negotiated a license for DMP’s latest core, the PICA200 and subsequently bundled it on CTR CPU (the Nintendo 3DS’ SoC). The GPU runs at 268 MHz.
Architecture of the PICA200
If I had to summarise it in one sentence, the PICA200 is a budget low-power 3D processor that combines a pre-unified architecture with a modernised API. The underlying architecture of the PICA200 is called Maestro 2G [24] and its design is compliant with OpenGL ES 1.1, but extended with elements from OpenGL ES 2.0 [25]. However, the PICA200’s APIs are not limited to either standard.
You see, even though the pipeline is segregated and the pixel stage is fixed-function (ala PlayStation 2), DMP expanded the limited circuitry with a set of Maestro functions that provide capabilities beyond the expectations of the embedded market [26]. This includes fragment lighting, multiple shadowing algorithms, polygon subdivision, bump mapping, procedural textures and many fog effects.
Additionally and in contrast to the Nintendo DS, the PICA200 only works with framebuffers. That’s it. The sprite engine, a popular workaround to tackle unaffordable memory requirements, is now a thing of the past. This also includes scan-line tricks, as contemporary GPUs work way faster than the refresh rate of a CRT.
Organising the content
Now that we know that this console can draw 3D shapes, the question now is: where does it store its materials? There are two locations, the large FCRAM block and the smaller but faster VRAM.
Nintendo only provided 6 MB of VRAM exclusively for the GPU. Ideally, programmers would fit as much as they can there, but since it will fill up pretty quickly, it is expected to be used to store data that needs instant access (i.e. commands, buffers and recurrent textures) while placing the rest on FCRAM. The PICA200 comes with a DMA unit that can transfer data between FCRAM and VRAM. So, at the end of the day, it’s the responsibility of the programmer to come up with an efficient placement to avoid bottlenecks.

During rendering, programmers allocate dedicated render buffers (i.e. frame, stencil, depth, etc.) for many operations. That’s always been the case. With the 3DS, alongside these buffers, programmers are also expected to reserve extra space for Display buffers, these are bound to the physical screens. The 3DS requires to allocate three Display buffers (two for the stereoscopic upper screen and one for the bottom one). To give you an idea, the display process works as follows:
- The LCD continuously displays the content of the front (active) Display buffer, as instructed by the value of the buffer index.
- Meanwhile, the GPU finishes rendering geometry in a framebuffer.
- The framebuffer is exported to the back (inactive) Display buffer.
- The GPU swaps the index of the front Display buffer.
- For practical reasons, the index swap should happen at the end of Vertical Sync to avoid tearing down the picture [27]. The official APIs provide synchronisation functions to keep all operations at the correct pace.
- The LCD will now be scanning the recently updated Display buffer from now on.
Adopting open standards
On an interesting note, just like the ARM11 MPCore adopts ARM’s AXI protocol for interconnecting its cores, DMP adopted a less-proprietary option called Open Core Protocol (OCP) [28]. As its name indicates, the Open Core protocol does not impose any licensing restrictions on its users, something that vendors using the PICA200 may find advantageous. For comparison purposes, AXI was released in 2003 (along with the AMBA 3 specification) while OCP was published in 2001. It does make me wonder what kind of technology Nintendo fitted to adapt the OCP signal coming from the PICA200 into an AXI-compliant signal, so the rest of the SoC understands it. I assume that there’s a bridge between the PICA200 and the AXI bus.
Interestingly enough, the predecessor of the PICA200, the ULTRAY2000, shares many similarities with its successor. The most notable difference, however, is that the data interfaces use the PCI and DDR-SDRAM protocols instead [29].
Constructing the frame
Naturally, the GPU is not aware of the stereoscopic or dual-screen nature of the displays, it will only be tasked with rendering three screens during gameplay:
- Top stereoscopic-left: 400 x 240 pixels wide.
- Top stereoscopic-right: 400 x 240 pixels wide.
- Bottom: 320 x 240 pixels wide.
All of them can display 8-bit RGB colours, which equates to up to 16.78 million colours.
Considering players will expect acceptable frame rates on all three screens (especially on the first two), the single PICA200 will be subject to high amounts of workload throughout its operation, an important aspect to remember when judging its performance.
That being said, here is an overview of how data travels to draw a single frame:

… and as customary in this series of articles, we’ll now take a look at what happens at each stage.
Commands

This is Nintendo’s first portable console to finally draw triangles in ‘the usual way’. That is, with the use of commands. But it’s not a surprising factor, as the PICA200 is expected to abide by the teachings of OpenGL ES.
In essence, the PICA200 draws polygons by reading a command buffer [30]. Furthermore, the vertex data can either be embedded within the command or stored in a separate buffer in VRAM, with the latter being the most efficient.
Vertex

The PICA200 provides four Vertex Processors (VP) that operate in parallel. However, if the geometry shader (the next pipeline stage) is activated, only three processors can be utilised.
Each core computes 96-bit vectors made of four 24-bit floating-point values [31], but unlike the ARM11’s VFP, they don’t comply with IEEE-754 [32]. The vertex processors are programmed using assembly language specific to the PICA200 (reminiscent of the days of the Nvidia NV30) and are operated as follows [33]:
- Developers write the vertex shader using PICA200 assembly. For reference, the instruction set is very similar to Microsoft’s
vs_2_0
[34]. - The shader is compiled using a proprietary assembler.
- The 3DS program must copy the compiled binary to memory (either FCRAM or VRAM).
- Then, the 3DS program issues a GPU command to load the binary and connect it with the program.
Once the vertex cores finish processing, they output the results to the Sync Control block, which acts as a vertex cache and buffer. It has a capacity of 384 Bytes, enabling it to hold up to 32 96-bit vectors. Finally, the next stage reads from this block.
Geometry

The geometry stage is a signature feature of 8th-generation consoles, allowing developers to spawn complex geometry out of simple vertex data.
In this case, the PICA200’s geometry stage is implemented by stealing one of the four Vertex Processors. Then, the ‘geometry’ vertex core is loaded with a different vertex shader. Finally, it receives the vertex data from the three other processors.
Even though the geometry shader is programmable, in practice, Nintendo doesn’t allow this. Thus, game developers can only choose from a pre-programmed set of geometry shader programs (found in the SDK). Examples of available geometry shaders include square and line generation (using point primitives), geometry subdivision, silhouette edge rendering; and random particle generation.
Rasteriser

At this stage, all primitives are converted into pixels.
The rasterizer unit on the PICA200 is very simple, it just generates triangles out of primitives, then applies culling and clipping to remove unseen triangles (hidden behind others and/or outside the view area, respectively). This is all very similar to OpenGL ES’ modus operandi, albeit developers have to watch out for some coordinate systems that are inverted when working with the PICA200.
Fragment

The fragment stage is made of two areas: the texture units, which can fetch textures in memory and process them. And the shading unit, which can perform extra operations on the texture data.
The PICA200 contains four texture units [35], each houses 256 Bytes of L1 cache and all of them share 8 KB of L2 cache. However, the units are not homogenous. Instead, the range of services varies between each unit [36]:
- Only three units can process 2D textures.
- Only one unit can perform shadow, cube and projective texture mapping.
- The last unit is more of a noise generator, meaning it only outputs random textures. It uses a combination of a random number generator and a colour lookup table. This is a slender yet efficient way of implementing procedure generation with textures, saving bandwidth along the way.
Afterwards, it’s the job of the shading unit to creatively fiddle with the textures coming in. However – and something unexpected considering we’re talking about an 8th-generation console – is that the PICA200’s unit is not programmable with pixel shaders [37]. Instead, we find six configurable colour combiners, each combiner receives three RGB or Alpha values and performs a logical operation on them. The result is passed to the next combiner and so forth. Each colour combiner can get its input from the previous combiner (except the first), a texture unit or a constant value.
All in all, a modern reflection of the Flipper era (while abiding by the OpenGL specification [38]), but don’t forget developers may also combine this with the aforementioned Maestro functions.
Post-processing

After the frame is processed and ready to be written into the framebuffer (or render targets), it goes through a sequence of final ‘corrections’. This is similar to the OpenGL ES 2.0’s pipeline.
That being said, the frame goes through alpha, stencil and depth testing. Afterwards, the result can be mixed with an existing frame (in the framebuffer) using the colour blender or logical operators (AND, XOR, etc.). Finally, the frame is written into the assigned buffer in memory either as a whole or through a stencil filter (for masking).
For additional smoothing of the edges, the PICA200 can render the framebuffer at twice the selected dimensions, and then average it with antialiasing 2×2. This is an old technique known as supersampling.
Once the framebuffer is ready to be displayed, it must be copied into another block in memory called Display Buffer (whose format is better aligned to the scan-line procedure of the LCD screen) and then transferred to the LCD in the form of scan-lines.
Interactive comparison
Now that you’ve seen how the PICA200 draws its triangles on the screen, it’s time for some practical examples. Here I’ve gathered two Marios from Smash Bros games, the Wii and 3DS one. Notice how the level of detail of ‘angry Mario’ hasn’t changed that much, considering we’re comparing a 2006 home console with its 2011 portable.
It’s worth reminding again that, in practice, the PICA200 will be rendering three screens at the same time, something that the Wii’s GPU wasn’t subjected to.
Nostalgic rendering
After all that’s been explained, there’s one question left unanswered: How does the PICA200 render Nintendo DS and Game Boy Advanced games? You may remember that the DS and GBA’s GPU exhibit completely different modus operandi for rendering and broadcasting frames.

The explanation is that the DS and GBA PPUs are housed in the SoC and DSi/DS/GBA games will operate them as they originally did on previous consoles. The PPUs output (scanlines) is delivered to a block called LgyFB, which may optionally upscale the frame, and then forwarded to the framebuffer, where the PICA200 will take care of displaying it. It’s the job of the ARM11 and its DMA to take care of all memory transfers during this process.
Naturally, this arrangement will add some delay (a.k.a. lag), albeit negligible to the user.
Audio
Overall, the SoC houses two audio blocks:
- A proprietary DSP exclusively programmed for sound operations. This is used by 3DS games.
- A variant of the Nintendo DS audio block named CSND. 3DS, DS and GBA games use it.
The 3DS-only hardware
You may know that this same DSP was previously bundled with the Nintendo DSi, but treated as an optional accelerator instead. With the 3DS, it’s become the designated audio processor, so it’s no longer a voluntary component.
The DSP is called CEVA TeakLite II [39] and operates at ~134 MHz [40]. It’s manufactured by ParthusCeva, a company that provides synthesisable cores for audio processing [41], and I guess ‘synthesisable’ was the keyword Nintendo was looking for when they partnered.
Moving on, the DSP outputs stereo samples (2 channels) of up to 32 kHz of sampling rate and 16-bit resolution.
Next to this component, we can find 512 KB of RAM that is used by the DSP as a working area. It’s double-buffered (256 KB per buffer), so both the CPU and DSP can read and write without interruption [42]. Apart from that, the DSP comes with a dedicated DMA that can transfer data in and out of those 512 KB.
Operation
For all intents and purposes, games treat this as an opaque DSP. Thus, only Nintendo knows how to program it.
3DS programs, as a consequence of being developed using the official SDK, bundle a DSP firmware (solely authored by Nintendo) which is then uploaded to the DSP chip at runtime [43]. Afterwards, programs rely on that firmware to execute audio-related routines. Furthermore, the audio services provided by the operating system further abstract the communication between the program and the DSP’s firmware [44].
In any case, while the DSP firmware may change over the years, some capabilities have remained the same. For instance, the DSP can mix ADPCM and PCM samples. with support of up to 24 channels [45]. There’s also functionality for filtering and sequencing, including the generation of PSG-like sounds.
Interestingly enough, the steps followed for hacking the 3DS (so it can execute homebrew application) optionally involve extracting the HOME Menu’s DSP firmware, so homebrew may use it to provide audio output [46].
The backwards-compatible block
At the other end of the spectrum, we find the CSND block. 3DS may use it as an extension of the DSP and DS/DSi/GBA games rely on it to replicate their hardware.
In terms of functionality, the CSND features 32 channels [47], which is twice the amount of the Nintendo DS counterpart.
Curiously enough, early homebrew defaulted to this block for providing sounds, while waiting for the DSP to be reverse-engineered.
Pipeline
Both DSP and CSND work independently and separately output their audio to the speaker.

As a curious note, the original Nintendo 3DS didn’t play well with the speaker’s capabilities, as Nintendo ended up providing troubleshooting guides for cases of buzzing noises and fluctuations with 3D slider [48], all caused by the design of the case.
I/O
This section tends to be very rich in technologies considering Nintendo’s consoles favour generous I/O before state-of-the-art CPUs and GPUs. Let’s see what the Nintendo 3DS offers.
External interfaces and peripherals
The Nintendo DS had tons of modules built-in and the Nintendo DSi added more on top of it (after removing the GBA Slot). Now we found ourselves with a new console combining interfaces from two decades (the 2000s and 2010s).

To be fair, we still don’t have a standard like USB, but that may be expendable considering the Nintendo 3DS bundles the following:
- A keypad composed of digital buttons, an analogue circle pad.
- 3D and volume sliders.
- A Wi-Fi switch.
- A resistive touch sensor on the bottom screen.
- A gyroscope measuring the console’s rotation changes.
- An accelerometer to measure the console’s motion.
- One front camera and two back cameras, the latter allowing to take stereoscopic pictures.
- An infrared transceiver, used to transfer data between external accessories.
- An SD card slot, serving as external storage.
- A standard 3.5 mm jack socket for headphones.
- A game card reader, where 3DS, DSi and DS retail games are read from.
The ‘New’ enhancements
If that wasn’t enough, the New 3DS came with more modules on top. This includes:
- Two extra digital buttons and an extra analogue circle pad (called ‘C-Stick’).
- An NFC Reader on the bottom screen.
- An infrared LED, reserved for head tracking.
- The SD slot is replaced with a microSD slot.
- The Wi-Fi switch has been removed, now it’s only controlled through software.
Now, to prevent leaving ‘old’ users behind, Nintendo provided external accessories to enhance the old models, although most of them relied on the single infrared transceiver to connect. Thus, only one accessory could be connected at the same time.
Not all the exclusive features of the New 3DS can be replicated, however. For instance, the New 3DS’ head tracking mechanism depends on the extra ARM11 core.
Internal interfaces
Now it’s time to check how are these interfaces – and others – internally wired up.
Firstly, a large subset is interconnected with the standard Serial Peripheral Interface (SPI) protocol. There are four SPI buses and all of them are accessed by the ARM9 (which I assume also includes the ARM7). The ARM11 only has access to most of them [49]. In any case, the SPI buses connect the following modules [50]:
- The flash memory found inside 3DS game car