All gateware components are now in place on the BoxLambda SoC, but the system is not yet behaving as required. A key requirement of BoxLambda is deterministic behavior. The duration of operations such as internal memory or register access must be predictable by design. When analyzing a snippet of assembly code, you should be able to predict exactly how many clock cycles it will take to execute, without relying on statistics. That’s what this post is about.
Recap
Here’s a summary of the current state of BoxLambda:
- An Ibex RISC-V core with machine timer and hardware interrupt support.
- Wishbone interconnect and Harvard architecture internal memory.
- DDR3 external memory access through the Litex memory controller.
- OpenOCD-based debug access on FPGA and Verilator.
- VERA-based VGA graphics: 2 layers tile or bitmap mode, 2 banks of 64 sprites, 128KB Video RAM, 256 color palette.
- Dual YM2149 PSG Audio.
- SD Card Controller and FatFs File System.
- 24-pin GPIO, UART, SPI Flash Controller, I2C Controller.
- Real-time Clock and Calendar (RTCC) support.
- USB HID Keyboard and Mouse support.
- A PicoRV32-based Programmable DMA Controller.
- A Picolibc-based standard C environment for software running on the Ibex RISC-V core.
- DFX Partial FPGA Reconfiguration support.
- A Base and a DFX configuration targeting the Arty-A7-100T FPGA development board.
- A suite of test applications covering all SoC components, running on both FPGA and Verilator.
- Automated testing on Verilator and CocoTB.
- A Linux CMake and Bender-based Software and Gateware build system.
Behavior of a Simple Word Copy Loop Before Adjustments
Let’s look at the behavior of the Ibex CPU’s Instruction Fetch (IF) and Instruction Decode (ID) stage in the case of a simple word copy loop.
This is the loop’s disassembly:
34c: 0004a283 lw t0,0(s1)
350: 00542023 sw t0,0(s0)
354: 0411 addi s0,s0,4
356: 0491 addi s1,s1,4
358: 17fd addi a5,a5,-1
35a: fbed bnez a5,34c
Here are some relevant signals from the Ibex IF and ID stages making one pass through the loop:
Before making any system changes, Ibex IF and ID stage executing a word copy loop (click to zoom).
The following signals illustrate the behavior of the Ibex two-stage pipeline:
- prefetch_buffer_i.instr_* signals show the Instruction Prefetcher fetching instructions from memory:
- instr_addr_o: instruction address output.
- instr_req_o: instruction request output strobe.
- instr_rvalid_i: instruction return data valid input strobe.
- prefetch_buffer.addr_o/valid_o shows the IF stage handing over the next instruction to the ID stage.
- id_stage.pc_i shows the ID stage’s program counter, i.e. the address of the instruction being executed.
- id_stage.instr_is_compressed indicates if the instruction being executed is compressed.
- id_stage.id_in_ready_o:: This signal indicates when the ID stage is ready to receive the next instruction from the IF stage. If the ID stage stalls (e.g. due to a load operation in the Load-Store unit), id_ready_o is deasserted until the ID stage is ready to proceed.
- id_stage.instr_executing indicates if the ID stage is currently executing an instruction.
The pc_id_i signal shows how the ID stage progresses through the loop. Notice that similar instructions don’t have different execution times:
- the lw (load word) instruction takes longer to execute than the sw (store word) instruction.
- the 2nd addi (add immediate) instruction takes longer than the first and third addi instruction.
Additionally, the branch instruction takes a long time to complete.
The goal is to analyze this behavior and adjust the system so that execution timing can be predicted directly from the code, without requiring waveform inspection.
16-bit and 32-bit instructions
There are several factors at play here. To start with, the word copy loop contains a mix of 16-bit and 32-bit instructions, but the instruction Prefetcher always fetches 32-bit words at a time. A prefetched 32-bit word can contain one of the following:
- Two 16-bit instructions.
- One 32-bit instruction.
- One 16-bit instruction and the first half of a 32-bit instruction.
- The 2nd half of a 32-bit instruction and a 16-bit instruction.
- The 2nd half of a 32-bit instruction and the 1st half of another 32-bit instruction.
All of these combinations have an impact on the cycle count of the instructions involved. This impact is predictable based on the instruction sequence. However, I prefer to keep things simple (another core BoxLambda requirement), so I’m going to switch to all uncompressed instructions by setting CFLAG -march to rv32im instead of rv32imc:
41c: 0004a283 lw t0,0(s1)
420: 00542023 sw t0,0(s0)
424: 00440413 addi s0,s0,4
428: 00448493 addi s1,s1,4
42c: fff78793 addi a5,a5,-1
430: fe0796e3 bnez a5,41c
The assembly code is the same as before, but all instructions are 32-bit now.
The waveform of the CPU IF and ID stages making one pass through this code now looks like this:
Ibex IF Prefetcher and ID stage executing a word copy loop. All 32-bit instructions (click to zoom).
It’s now easier to match up what’s going on in the ID stage with the IF stage just before. There are still irregularities in the instruction cycle counts, however. Also, the instruction cycle counts are quite high.
Bypassing the Crossbar
Looking at the Architecture Block Diagram at the beginning of the post, you’ll see that Instruction fetches and data access go through the Wishbone Crossbar. There are two problems with this approach:
- The Crossbar prioritizes throughput over latency. It can move a lot of data in systems that support many outstanding transactions. In the case of setups such as BoxLambda, however, where each transaction is blocking, the Crossbar is slow.
- There is a one-clock cycle cost for channel switching. An instruction fetch transaction completes faster when executed back-to-back with the previous transaction, without deasserting CYC. However, if the ID stage stalls the IF stage, preventing back-to-back IF transactions, the instruction fetch