The Latency Shakeup by snvzz

Share This Article

Sed ut perspiciatis unde.

All gateware components are now in place on the BoxLambda SoC, but the system is not yet behaving as required. A key requirement of BoxLambda is deterministic behavior. The duration of operations such as internal memory or register access must be predictable by design. When analyzing a snippet of assembly code, you should be able to predict exactly how many clock cycles it will take to execute, without relying on statistics. That’s what this post is about.

Recap

Here’s a summary of the current state of BoxLambda:

An Ibex RISC-V core with machine timer and hardware interrupt support.
Wishbone interconnect and Harvard architecture internal memory.
DDR3 external memory access through the Litex memory controller.
OpenOCD-based debug access on FPGA and Verilator.
VERA-based VGA graphics: 2 layers tile or bitmap mode, 2 banks of 64 sprites, 128KB Video RAM, 256 color palette.
Dual YM2149 PSG Audio.
SD Card Controller and FatFs File System.
24-pin GPIO, UART, SPI Flash Controller, I2C Controller.
Real-time Clock and Calendar (RTCC) support.
USB HID Keyboard and Mouse support.
A PicoRV32-based Programmable DMA Controller.
A Picolibc-based standard C environment for software running on the Ibex RISC-V core.
DFX Partial FPGA Reconfiguration support.
A Base and a DFX configuration targeting the Arty-A7-100T FPGA development board.
A suite of test applications covering all SoC components, running on both FPGA and Verilator.
Automated testing on Verilator and CocoTB.
A Linux CMake and Bender-based Software and Gateware build system.

Behavior of a Simple Word Copy Loop Before Adjustments

Let’s look at the behavior of the Ibex CPU’s Instruction Fetch (IF) and Instruction Decode (ID) stage in the case of a simple word copy loop.

This is the loop’s disassembly:

34c:       0004a283                lw      t0,0(s1)
350:       00542023                sw      t0,0(s0)
354:       0411                    addi    s0,s0,4
356:       0491                    addi    s1,s1,4
358:       17fd                    addi    a5,a5,-1
35a:       fbed                    bnez    a5,34c

Here are some relevant signals from the Ibex IF and ID stages making one pass through the loop:

Before making any system changes, Ibex IF and ID stage executing a word copy loop (click to zoom).

The following signals illustrate the behavior of the Ibex two-stage pipeline:

prefetch_buffer_i.instr_* signals show the Instruction Prefetcher fetching instructions from memory:
- instr_addr_o: instruction address output.
- instr_req_o: instruction request output strobe.
- instr_rvalid_i: instruction return data valid input strobe.
prefetch_buffer.addr_o/valid_o shows the IF stage handing over the next instruction to the ID stage.
id_stage.pc_i shows the ID stage’s program counter, i.e. the address of the instruction being executed.
id_stage.instr_is_compressed indicates if the instruction being executed is compressed.
id_stage.id_in_ready_o:: This signal indicates when the ID stage is ready to receive the next instruction from the IF stage. If the ID stage stalls (e.g. due to a load operation in the Load-Store unit), id_ready_o is deasserted until the ID stage is ready to proceed.
id_stage.instr_executing indicates if the ID stage is currently executing an instruction.

The pc_id_i signal shows how the ID stage progresses through the loop. Notice that similar instructions don’t have different execution times:

the lw (load word) instruction takes longer to execute than the sw (store word) instruction.
the 2nd addi (add immediate) instruction takes longer than the first and third addi instruction.

Additionally, the branch instruction takes a long time to complete.

The goal is to analyze this behavior and adjust the system so that execution timing can be predicted directly from the code, without requiring waveform inspection.

16-bit and 32-bit instructions

There are several factors at play here. To start with, the word copy loop contains a mix of 16-bit and 32-bit instructions, but the instruction Prefetcher always fetches 32-bit words at a time. A prefetched 32-bit word can contain one of the following:

Two 16-bit instructions.
One 32-bit instruction.
One 16-bit instruction and the first half of a 32-bit instruction.
The 2nd half of a 32-bit instruction and a 16-bit instruction.
The 2nd half of a 32-bit instruction and the 1st half of another 32-bit instruction.

All of these combinations have an impact on the cycle count of the instructions involved. This impact is predictable based on the instruction sequence. However, I prefer to keep things simple (another core BoxLambda requirement), so I’m going to switch to all uncompressed instructions by setting CFLAG -march to rv32im instead of rv32imc:

     41c:       0004a283                lw      t0,0(s1)
     420:       00542023                sw      t0,0(s0)
     424:       00440413                addi    s0,s0,4
     428:       00448493                addi    s1,s1,4
     42c:       fff78793                addi    a5,a5,-1
     430:       fe0796e3                bnez    a5,41c

The assembly code is the same as before, but all instructions are 32-bit now.

The waveform of the CPU IF and ID stages making one pass through this code now looks like this:

Ibex IF Prefetcher and ID stage executing a word copy loop. All 32-bit instructions (click to zoom).

It’s now easier to match up what’s going on in the ID stage with the IF stage just before. There are still irregularities in the instruction cycle counts, however. Also, the instruction cycle counts are quite high.

Bypassing the Crossbar

Looking at the Architecture Block Diagram at the beginning of the post, you’ll see that Instruction fetches and data access go through the Wishbone Crossbar. There are two problems with this approach:

The Crossbar prioritizes throughput over latency. It can move a lot of data in systems that support many outstanding transactions. In the case of setups such as BoxLambda, however, where each transaction is blocking, the Crossbar is slow.
There is a one-clock cycle cost for channel switching. An instruction fetch transaction completes faster when executed back-to-back with the previous transaction, without deasserting CYC. However, if the ID stage stalls the IF stage, preventing back-to-back IF transactions, the instruction fetch

The Latency Shakeup by snvzz

The Latency Shakeup by snvzz

Share This Article

Newsletter

Recap

Behavior of a Simple Word Copy Loop Before Adjustments

16-bit and 32-bit instructions

Bypassing the Crossbar

HackTech

Leave a comment Cancel reply

Editor's Choice

The Latency Shakeup by snvzz

The Latency Shakeup by snvzz

Share This Article

Newsletter

Recap

Behavior of a Simple Word Copy Loop Before Adjustments

16-bit and 32-bit instructions

Bypassing the Crossbar

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter