ARM’s Cortex A53: Tiny but Important by ingve

Share This Article

Sed ut perspiciatis unde.

Tech enthusiasts probably know ARM as a company that develops reasonably performant CPU architectures with a focus on power efficiency. Product lines like the Cortex A7xx and Cortex X series use well balanced, moderately sized out-of-order execution engines to achieve those goals. But ARM covers lower power and performance tiers too. Such cores are arguably just as important for the company. The Cortex A53 is a prominent example. Unlike its bigger cousins, A53 focuses on handling tasks that aren’t sensitive to CPU performance while minimizing power and area. It fills an important spot in ARM’s lineup, because not all applications require a lot of processing power.

If I had to guess, cell phone makers shipped more A53 cores than any other ARM core type from 2014 to 2017. As a “little” core in mobile SoC, A53 outlasted two generations of its larger companions. Mobile SoCs often shipped as many “little” cores as “big” ones, if not more. Qualcomm’s Snapdragon 835 used four Cortex A73s as “big” cores, while four A53s were set up as a “little” cluster. Some lower end mobile chips were even fitted out exclusively with A53 cores. For example, the Snapdragon 626 used eight A53 cores clocked at 2.2 GHz.

Tegra X1 implements four A53 cores alongside four A57 big cores, from A53’s early days. Die shot by Fritzchens Fritz, labels by Clam

Even after the A55 came out and gradually took over the “little” role in mobile SoCs, A53 continued to appear in new products. Google’s 2018 Pixel Visual Core used an A53 core to manage an array of image processing units. Socionext used 24 A53 cores to create a fanless edge server chip. Beyond that, Roku has used A53 in set-top boxes. A53 made a lot of devices tick, even if it doesn’t dominate spec sheets or get shiny stickers on boxes.

Google’s 2018 Pixel Visual Core used an A53, at a time when A53 was getting superseded by A55. Picture from Google’s Hot Chips presentation

I’m testing the Cortex A53 cores on the Odroid N2+, which uses the Amlogic S922X. The Amlogic S922X is a system-on-chip (SoC) that implements a dual core A53 cluster, a quad core A73 cluster, and a Mali G52 iGPU. Besides the Odroid N2+ single board computer, the S922X also appears on a few set-top boxes. Unlike Intel and AMD, ARM sells their core designs to separate chip designers, who then implement the cores. Therefore, some content in this article will be specific to that implementation.

Architecture

ARM’s Cortex A53 is a dual issue, in-order architecture. It’s a bit like Intel’s original Pentium in that regard, but modern process nodes make a big difference. Higher transistor budgets let A53 dual issue a very wide variety of instructions, including floating point ones. A53 also enjoys higher clock speeds, thanks both to process technology and a longer pipeline. Pentium had a five stage pipeline and stopped at around 300 MHz. A53 clocks beyond 2 GHz, with an eight-stage pipeline. All of that is accomplished with very low power draw.

We’ve analyzed quite a few CPUs on this site, including lower power designs. But all of those are out-of-order designs, which still aim for higher performance points than A53 does. A53 should therefore be an interesting look at how an architecture is optimized for extremely low power and performance targets.

Branch Predictor

A CPU’s pipeline starts at the branch predictor, which tells the rest of CPU where to go. Branch prediction is obviously important for both performance and power, because sending the pipeline down the wrong path will hurt both factors. But a more capable branch predictor will cost more power and area, so designers have to work within limits.

A53 is a particularly interesting case both because it targets very low power and area, and because branch prediction accuracy matters less for it. That’s because A53 can’t speculate as far as out-of-order CPUs. A Zen or Skylake core could throw out over a hundred instructions worth of work after a mispredict. With A53, you can count lost instruction slots with your fingers.

So, ARM has designed a tiny branch predictor that emphasizes low power and area over speed and accuracy. The A53 Technical Reference Manual states that the branch predictor has a global history table with 3072 entries. For comparison, AMD’s Athlon 64 from 2003 had a 16384 entry history table. Qualcomm’s Snapdragon 821 is an interesting comparison here, because it’s contemporary to the A53 but used the same Kryo cores in a big.little setup. Kryo’s branch prediction is much more capable, but Qualcomm’s approach is also quite different because the “little” Kryo cores occupy the same area as the big ones.

A53 struggles in our testing to recognize a repeating pattern with a period longer than 8 or 12. It also suffers with larger branch footprints, likely because branches start destructively interfering with each other as they clash into the same history table entries.

Return Prediction

For returns, the A53 has an eight-deep return stack. This is quite shallow compared to other cores. For example, Kryo has a 16-deep return stack. AMD’s Athlon 64’s return stack had 12 entries. But even an eight-deep return stack is better than nothing, and could be adequate for the majority of cases.

Indirect Branch Prediction

Indirect branches can go to multiple targets, adding another level of difficulty to branch prediction. According to ARM’s Technical Reference Manual, the A53 has a 256 entry indirect target array. From testing, the A53 can reliably track two targets per branch for up to 64 branches. If a branch goes to more targets, A53 struggles compared to more sophisticated CPUs.

Again, the A53 turns in an underwhelming performance compared to just about any out-of-order CPU we’ve looked at. But some sort of indirect predictor is better than nothing. Object oriented code is extremely popular these days, and calling a method on an object often involves an indirect call.

Branch Predictor Speed

Branch speed is important too. If the predictor takes a few cycles before it can steer the pipeline, that’s lost frontend bandwidth. A lot of high performance CPUs employ branch target buffers to cache branch targets, but A53 doesn’t do that. Most of the time, it has to fetch the branch from L1i, decode it, and calculate the target before it knows where the branch goes. This process takes three cycles.

For very tiny loops, A53 has a single entry Branch Target Instruction Cache (BTIC) that holds instruction bytes for two fetch windows. I imagine each fetch window is 8 bytes, because that would correspond to two ARM instructions. The BTIC would probably be a 16 byte buffer in the branch predictor. From testing taken branch latency, we do see BTIC benefits fall off once branches are spaced by 16 bytes or more.

If we do hit the BTIC, we can get two cycle taken branch latency. The “Branch Per 4B” case doesn’t count for ARM because ARM instructions are 4 bytes, and a situation where every instruction is a branch does not make sense. Zooming out though, the BTIC’s effect is quite limited. It’ll only cover extremely tiny loops without any nested branches.

Once branches spill out of the 32 KB L1 instruction cache, latency heads for the hills. Without a decoupled branch target buffer, the A53 won’t know where a branch goes before the branch’s instruction bytes arrive. If those instruction bytes come from L2 or beyond, the frontend would have to stall for a lot of cycles. If we present the data in terms of branch footprint in KB, we can take a glimpse at L2 and memory latency from the instruction side. It’s pretty cool, because normally this kind of test would simply show BTB latency, with the branch predictor running ahead and largely hiding cache latency.

Basically testing cache latency out of the instruction side

This also means A53 will suffer hard if code spills out of the instruction cache. There’s very basic prefetching capability, but it’s not driven by the branch predictor and struggles to hide latency once you branch over enough instructions.

Instruction Fetch

Once the branch predictor has generated a fetch target, the CPU’s frontend has to fetch the instruction bytes, decode them to figure out what it has to do, and pass the decoded instructions on to the backend.

To accelerate instruction delivery, the S922X’s A53s have 32 KB, 2-way set associative instruction caches. ARM lets chipmakers configure instruction cache capacity between 8 and 64 KB, so the cache could have different sizes on other chips. The instruction cache is virtually indexed and physically addressed, allowing for lower latency because address translation can be done in parallel with cache indexing.

ARM has heavily prioritized power and area savings with measures that we don’t see desktop CPUs from Intel and AMD. Parity protection is optional, allowing implementers to save power and area in exchange for lower reliability. If parity protection is selected, an extra bit is stored for every 31 bits to indicate whether there’s an even or odd number of 1s present. On a mismatch, the cache set with the error gets invalidated, and data is reloaded from L2 or beyond. ARM also saves space by storing as little extra metadata as possible. The L1i uses a pseudo-random replacement policy, unlike the more popular LRU scheme. LRU means the least recently used line in a set gets evicted when a new data is brought into the cache. Implementing a LRU scheme requires storing additional data to track which line was least recently accessed, essentially using extra area and power to improve hitrates.

To further reduce power and frontend latency, the L1i stores instructions in an intermediate format, moving some decode work to predecode stages before the L1i is filled. That lets ARM simplify the main decode stages, at the cost of using more storage per instruction.

The instruction cache has no problems feeding the two-wide decoders. Qualcomm’s little Kryo has a huge advantage over the A53, because it’s really a big core running at lower clocks with a cut-down L2 cache. Kryo’s frontend can fetch and decode four instructions per cycle, making it quite wide for a low power core in the 2017 timeframe.

However, A53 and Kryo both see a massive drop in instruction throughput when code spills out of the instruction cache. L2 code fetch bandwidth can only sustain 1 IPC on Kryo, and even less on A53. 1 IPC would be four bytes per cycle. ARM’s technical reference manual says the L1i has a 128-bit (16 byte) read interface to L2. The CPU likely has a 128-bit internal interface, but can’t track enough outstanding code fetch requests to saturate that interface. To make things even worse, there’s no L3 cache, and 256 KB of L2 is not a lot of capacity for a last level cache. Instruction throughput is extremely poor when fetching instructions from memory, so both of these low power cores will do poorly with big code footprints.

Execution Engine

Once instructions are brought into the core, they’re sent to the execution engine. Unlike the original Pentium, which had a fairly rigid two-pipe setup, the A53 has two dispatch ports that can flexibly send instructions to several stacks of execution units. Instructions can dual issue as long as dependencies are satisfied and execution units are available. Many execution units have two copies, so dual issue opportunities are likely to be limited by dependencies rather than execution unit throughput.

The execution engine is cleverly designed to get around pipeline hazards that would trip up a basic in-order core. For example, it seems immune to write-after-write hazards. Two instructions that write to the same ISA register can dual issue. I doubt the A53 does true register renaming, which is how out-of-order CPUs get around the problem. Rather, it probably has conflict detection logic that prevents an instruction from writing to a register if it has already been written to by a newer instruction.

Since there’s no register renaming, we also don’t see renamer tricks like move elimination and zeroing idiom recognition. For example, XOR-ing a register with itself, or subtracting a register with itself will have a false dependency on the previous value o the register (even though the result will always be zero).

Integer Execution Units

A53 can dual issue the most common operations like integer adds, register to register MOVs, and bitwise operations. Less common operations like integer multiplies and branches can’t be dual issued, but can issue alongside another type of instruction.

Instruction	IPC	Latency
Integer Add	1.77	1 cycle
Branch (Not Taken)	0.95	See branch prediction section above
64-bit Integer Multiply	0.95	4 cycles
64-bit Load from Memory	1	See caching section below

IPC > 1 implies the instruction can dual issue

Mixing integer multiplies and branches results in about 0.7 IPC, so the two functional units share a pipe and cannot dual issue. That shouldn’t be a big deal, but it does suggest the A53 organizes the two integer pipes into a basic one and a complex one.

Floating Point and Vector Execution

Unlike the original Pentium, which only had one FP pipe, the A53 uses modern transistor budgets to dual issue common floating point instructions like FP adds and multiplies. That’s a nice capability to have because Javascript’s numbers are all floating point, and A53 may have to deal with a lot of Javascript in phones. Unfortunately, latency is 4 cycles, which isn’t good considering the A53’s low clock speed. It’s even worse because the A53 doesn’t benefit from a big out-of-order engine, which means the core has to stall to resolve execution dependencies.

Instruction	IPC	Latency
FP32 Add	1.82	4 cycles
FP32 Multiply	1.67	4 cycles
FP32 FMA	1.19	8 cycles
128-bit Vector FP32 Add	0.91	4 cycles
128-bit Vector FP32 Multiply	0.91	4 cycles
128-bit Vector FP32 FMA	0.95	8 cycles
128-bit Vector INT32 Add	0.95	2 cycles
128-bit Vector INT32 Multiply	0.91	4 cycles
128-bit Load from Memory	0.49	Not tested
128-bit Store	1	N/A

The same dual issue ability doesn’t apply to vector operations. A53 supports NEON, but can’t dual issue 128-bit instructions. Furthermore, 128-bit instructions can’t issue alongside scalar FP instructions, so a 128-bit operation probably occupies both FP ports regardless of whether it’s operating on FP or integer elements.

Nonblocking Loads

The A53 also has limited reordering capability that gives it some wiggle room around cache misses. Cache misses are the most common long latency instructions, and could easily take dozens to hundreds of cycles depending on whether data comes from L2 or memory. To mitigate that latency, the A53 can move past a cache miss and run ahead for a number of instructions before stalling. That means it has internal buffers that can track state for various types of instructions while they wait to have their results committed.

But unlike an out-of-order CPU, these buffers are very small. They likely consume very little power and area, because they don’t have to check every entry every cycle like a scheduling queue. So, the A53 will very quickly hit a situation that forces it to stall:

Having eight total instructions in flight, including the cache miss
Any instruction that uses the cache miss’s result (no scheduler)
Any memory operation, including stores (no OoO style load/store queues)
Having four floating point instructions in flight, even if they’re independent
Any 128-bit vector instruction (NEON)
More than three branches, taken or not

Because the A53 does in-order execution, a stall stops execution until data comes back from a lower cache level or memory. If a cache miss goes out to DRAM, that can be hundreds of cycles. This sort of nonblocking load capability is really not comparable to even the oldest and most basic out-of-order execution implementations. But it does show A53 has a set of internal buffers aimed at keeping the execution units fed for a short distance after a cache miss. So, A53 is utilizing the transistor budget of newer process nodes to extract more performance without the complexity of out-of-order execution.

Memory Execution

The A53 has a single address generation pipeline for handling memory operations. Assuming a L1D hit, memory loads have a latency of 3 cycles if a simple addressing mode is used, or 4 cycles with scaled, indexed addressing. This single AGU pipeline means the A53 cannot sustain more than one memory operation per cycle. It’s a significant weakness compared to even the weakest out-of-order CPUs, which can usually handle two memory operations per cycle.

Load/Store Unit

A CPU’s load/store unit has a very complex job, especially on out-of-order CPUs where it has to track a lot of memory operations in flight and make them appear to execute in program order. The A53’s load/store unit is simpler thanks to in-order execution, but handling memory operations still isn’t easy. It still has to handle instructions that address memory with byte granularity and varying sizes, with a L1 data cache that accesses a larger, aligned region under the hood. It still has to handle memory dependencies. Even though the A53 is in-order, pipelining means a load can need data from a store that hasn’t committed yet.

The A53 appears to access its data cache in 8 byte (64-bit) aligned chunks. Accesses that span a 8 byte boundary incur a 1 cycle penalty. Forwarding similarly incurs a 1 cycle penalty, but notably, there is no massively higher cost if a load only partially overlaps a store. In the worst case, forwarding costs 6 cycles when the load partially overlaps the store and the load address is higher than the store address.

Contrast that with Neoverse N1, which incurs a 10-11 cycle penalty if a load isn’t 4-byte aligned within a store. Or Qualcomm’s Kryo, which takes 12-13 cycles when a load is contained within a store, or 14-15 cycles for a partial overlap. This shows a benefit of simpler CPU design, which lets designers dodge complicated problems instead of having to solve them. Maybe A53 doesn’t have a true forwarding mechanism, and simply delays the load until the store commits, after which it can simply read the data normally from the L1 cache.

With SIMD (NEON) memory accesses, A53 has a harder time. Stores and loads have to be 128-bit (16 byte) aligned, or they’ll incur a penalty cycle. Forwarding is basically free, except for some cases where the load address is higher than the store’s address. But just like with scalar integer-side accesses, the throughput of a dependent load/store pair per 6 cycles is pretty darn good.

A53 is remarkably resilient to load/store unit penalities in other areas too. There’s no cost to crossing a 4K page boundary – something that higher performance CPUs sometimes struggle with. For example, ARM’s higher performance Neoverse N1 takes a 12 cycle penalty if a store crosses a 4K boundary.

Address Translation

Virtual memory is necessary to run modern operating systems, which give each user program a virtual address space. The operating system sets up mappings between virtual and physical addresses in predefined structures (page tables). However, going through the operating systems mapping structures to translate an address would basically turn each user memory access into several dependent ones. To avoid this, CPUs keep a cache of address translations in a stupidly named structure called the TLB, or translation lookaside buffer.

A53 has a two-level TLB setup. ARM calls these two levels a micro-TLB, and a main TLB. This terminology is appropriate because the main TLB only adds 2 cycles of latency. If there’s a TLB miss, the A53 has a 64 entry, 4-way page walk cache that holds second level paging structures.

For comparison, AMD’s Zen 2’s data TLBs consist of a 64 entry L1 TLB, backed by a 2048 entry, 16-way L2 TLB. Zen 2 gets far better TLB coverage, but a L2 TLB hit takes seven extra cycles. Seven cycles can easily be hidden by a large out-of-order core, but would be sketchy on an in-order one. A53’s TLB setup is thus optimized to deliver address translations at low latency across small memory footprints. Zen 2 can deal with larger memory footprints, through a combination of out-of-order execution and larger, higher latency TLBs.

In addition to using small TLBs, ARM further optimized for power and area by only supporting 40-bit physical addresses. A53 is therefore limited to 1 TB of physical memory, putting it on par with older CPUs like Intel’s Core 2. Mobile devices won’t care about addressing more than 1 TB of memory, but that limitation could be problematic for servers.

Cache and Memory Access

Each A53 core has a 4-way set-associative L1 data cache (L1D). The Odroid N2+’s A53 cores have 32 KB of L1D capacity each, but implementers can set capacity to 8 KB, 16 KB, 32 KB, or 64 KB to make performance and area tradeoffs. As menti

ARM’s Cortex A53: Tiny but Important by ingve

ARM’s Cortex A53: Tiny but Important by ingve

Share This Article

Newsletter

Architecture

Branch Predictor

Return Prediction

Indirect Branch Prediction

Branch Predictor Speed

Instruction Fetch

Execution Engine

Integer Execution Units

Floating Point and Vector Execution

Nonblocking Loads

Memory Execution

Load/Store Unit

Address Translation

Cache and Memory Access

HackTech

Leave a comment Cancel reply

Editor's Choice

ARM’s Cortex A53: Tiny but Important by ingve

ARM’s Cortex A53: Tiny but Important by ingve

Share This Article

Newsletter

Architecture

Branch Predictor

Return Prediction

Indirect Branch Prediction

Branch Predictor Speed

Instruction Fetch

Execution Engine

Integer Execution Units

Floating Point and Vector Execution

Nonblocking Loads

Memory Execution

Load/Store Unit

Address Translation

Cache and Memory Access

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter