Preamble
I don’t really have much experience with microcontrollers. I’ve played around with some arduinos before and the main entry point for my home network is a Raspberry Pi, but that’s about it for recent experience. I did take a single course on microcontrollers a few years back, and I was hilariously bad at it, barely reaching a passing grade. Nonetheless, I am fascinated by them – they’re low powered devices that we can program to make almost anything happen, as long as we’re a little careful with ressource management and don’t shoot ourselves in the foot.
One thing that is always implicitly assumed when talking about julia is the requirement for a runtime and garbage collector. Most of the time, optimizing julia (or any code, really) comes down to two things:
-
minimize the time spent running code you didn’t write
-
have as much code you want to run compiled to the native instructions of where you want to run it
Requirement 1) results more or less in “don’t talk to runtime & GC if you don’t have to” and 2) boils down to “make sure you don’t run unnecessary code, like an interpreter” – i.e. statically compile your code and avoid dynamicness wherever you can.[1]
I’m already used to 1) due to regular optimization when helping people on Slack and Discourse, and with better static compilation support inching ever closer over the past few years and me procrastinating writing my bachelors’ thesis last week, I thought to myself
-
Julia is based on LLVM and is basically already a compiled language.
-
You’ve got some old arduinos lying around.
-
You know those take in some AVR blob to run as their code.
-
LLVM has an AVR backend.
and the very next thought I had was “that can’t be too difficult to get to work, right?”.
This is the (unexpectedly short) story of how I got julia code to run on an arduino.
[1] | Funnily enough, once you’re looking for it, you can find these concepts everywhere. For example, you want to minimize the number of times you talk to the linux kernel on an OS, since context switches are expensive. You also want to call into fast native code as often as possible, as is done in python by calling into C when performance is required. |
So, what are we dealing with? Well, even arduino don’t sell these anymore:
This is an Arduino Ethernet R3, a variation on the common Arduino UNO. It’s the third revision, boasting an ATmega328p, an ethernet port, a slot for an SD card as well as 14 I/O pins, most of which are reserved. It has 32KiB of flash memory, 2KiB SRAM and 1KiB EEPROM. Its clock runs at measly 16 MHz, there’s a serial interface for an external programmer and it weighs 28g.
With this documentation, the schematic for the board, the datasheet for the microcontroller and a good amount of “you’ve done harder things before” I set out to achieve the simplest goal imaginable: Let the LED labeled L9
(see the lower left corner of the board in the image above, right above the on
LED above the power connector) blink.
For comparison sake and to have a working implementation to check our arduino with, here’s a C implementation of what we’re trying to do:
#include
#include
#define MS_DELAY 3000
int main (void) {
DDRB |= _BV(DDB1);
while(1) {
PORTB |= _BV(PORTB1);
_delay_ms(MS_DELAY);
PORTB &= ~_BV(PORTB1);
_delay_ms(MS_DELAY);
}
}
This short piece of code does a few things. It first configures our LED-pin as an output, which we can do by setting pin DDB1
[2] in DDRB
(which is a contraction of “Data Direction Register Port B” – it controls whether a given I/O pin is interpreted as input or output). After that, it enters an infinite loop, where we first set our pin PORTB1
on PORTB
to HIGH
(or 1
) to instruct our controller to power the LED. We then wait for MS_DELAY
milliseconds, or 3 seconds. Then, we unpower the LED by setting the same PORTB1
pin to LOW
(or 0
). Compiling & flashing this code like so[3] :
avr-gcc -Os -DF_CPU=16000000UL -mmcu=atmega328p -c -o blink_led.o blink_led.c
avr-gcc -mmcu=atmega328p -o blink_led.elf blink_led.o
avr-objcopy -O ihex blink_led.elf blink_led.hex
avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:blink_led.hex
results in a nice, blinking LED.
These few shell commands compile our .c
soure code to an .o
object file targeting our microcontroller, link it into an .elf
, translate that to the Intel .hex
format the controller expects and finally flash it to the controller with the appropriate settings for avrdude
. Pretty basic stuff. It shouldn’t be hard to translate this, so where’s the catch?
Well, most of the code above is not even C, but C preprocessor directives tailored to do exactly what we mean to do. We can’t make use of them in julia and we can’t import those .h
files, so we’ll have to figure out what they mean. I haven’t checked, but I think not even _delay_ms
is a function.
On top of this, we don’t have a convenient existing avr-gcc
to compile julia to AVR for us. However, if we manage to produce a .o
file, we should be able to make the rest of the existing toolchain work for us – after all, avr-gcc
can’t tell the difference between a julia-created .o
and a avr-gcc
created .o
.
[2] | Finding the right pin & port took a while. The documentation states that the LED is connected to “digital pin 9”, which is supported by the label L9 next to the LED itself. It then goes on to say that on most of the arduino boards, this LED is placed on pin 13, which is used for SPI on mine instead. This is confusing, because the datasheet for our board connects this LED to pin 13 (PB1 , port B bit 1) on the controller, which has a split trace leading to pin 9 of the J5 pinout. I mistakenly thought “pin 9” referred to the microcontroller, and tried to control the LED through PD5 (port D, bit 5) for quite some time, before I noticed my mistake. The upside was that I now had a known-good piece of code that I could compare to – even on the assembly level. |
[3] | -DF_CPU=16000000UL is required for _delay_ms to figure out how to translate from milliseconds to “number of cycles required to wait” in our loops. While it’s nice to have, it’s not really required – we only have to wait some visibly distinct amount to notice the blinking, and as such, I’ve skipped implementing this in the julia version. |
A first piece of julia pseudocode
So with all that in mind, let’s sketch out what we think our code should look like:
const DDRB = ??
const PORTB = ??
function main()
set_high(DDRB, DDB1)
while true
set_high(PORTB, PORTB1)
for _ in 1:500000
end
set_low(PORTB, PORTB1)
for _ in 1:500000
end
end
end
From a high level, it’s almost exactly the same. Set bits, busy loop, unset bits, loop. I’ve marked all places where we have to do something, though we don’t know exactly what yet, with ??
. All of these places are a bit interconnected, so let’s dive in with the first big question: how can we replicate what the C-macros DDRB
, DDB1
, PORTB
and PORTB1
end up doing?
Datasheets & Memory Mapping
To answer this we first have to take a step back, forget that these are defined as macros in C and think back to what these represent. Both DDRB
and PORTB
reference specific I/O registers in our microcontroller. DDB1
and PORTB1
refer to the (zero-based) 1st bit of the respective register. In theory, we only have to set these bits in the registers above to make the controller blink our little LED. How do you set a bit in a specific register though? This has to be exposed to a high level language like C somehow. In assembly code we’d just access the register natively, but save for inline assembly, we can’t do that in either C or julia.
When we take a look in our microcontroller datasheet, we can notice that there’s a chapter 36. Register Summary
from page 621 onwards. This section is a register reference table. It has an entry for each register, specifying an address, a name, the name of each bit, as well as the page in the datasheet where further documentation, such as initial values, can be found. Scrolling to the end, we find what we’ve been looking for:
Address | Name | Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bit 2 | Bit 1 | Bit 0 | Page |
---|---|---|---|---|---|---|---|---|---|---|
0x05 (0x25) | PORTB | PORTB7 | PORTB6 | PORTB5 | PORTB4 | PORTB3 | PORTB2 | PORTB1 | PORTB0 | 100 |
0x04 (0x24) | DDRB | DDR7 | DDR6 | DDR5 | DDR4 | DDR3 | DDR2 | DDR1 | DDR0 | 100 |
So PORTB
is mapped to addresses 0x05
and 0x25
, while DDRB
is mapped to addresses 0x04
and 0x24
. Which memory are those addresses referring to? We have EEPROM, flash memory as well as SRAM after all. Once again, the datasheet comes to our help: Chapter 8 AVR Memories
has a short section on our SRAM memory, with a very interesting figure:
as well as this explanation:
The first 32 locations [of SRAM] address the Register File, the next 64 locations the standard I/O memory, then 160 locations of Extended I/O memory, and the next 512/1024/1024/2048 locations address the internal data SRAM.
So the addresses we got from the register summary actually correspond 1:1 to SRAM addresses[4]. Neat!
Translating what we’ve learned into code, our prototype now looks like this:
const DDRB = Ptr{UInt8}(36)
const PORTB = Ptr{UInt8}(37)
const DDB1 = 0b00000010
const PORTB1 = 0b00000010
function main_pointers()
unsafe_store!(DDRB, DDB1)
while true
pb = unsafe_load(PORTB)
unsafe_store!(PORTB, pb | PORTB1)
for _ in 1:500000
end
pb = unsafe_load(PORTB)
unsafe_store!(PORTB, pb & ~PORTB1)
for _ in 1:500000
end
end
end
builddump(main_pointers, Tuple{})
We can write to our registers by storing some data at its address, as well as read from our register by reading from the same address.
In one fell swoop, we got rid of all of our ??
at once! This code now seemingly has everything the C version has, so let’s start on the biggest unknown: how do we compile this?
[4] | This is in contrast to more high level systems like an OS kernel, which utilizes virtual RAM and paging of sections of memory to give the illusion of being on the “baremetal” machine and handling raw pointers. |
Compiling our code
Julia has for quite some time now run on more than just x86(_64) – it also has support for Linux as well as macOS on ARM. These are, in large part, possible due to LLVM supporting ARM. However, there is one other large space where julia code can run directly: GPUs. For a while now, the package GPUCompiler.jl has done a lot of work to compile julia down to NVPTX
and AMDGPU
, the NVidia and AMD specific architectures supported by LLVM. Because GPUCompiler.jl interfaces with LLVM directly, we can hook into this same mechanism to have it produce AVR instead – the interface is extensible!
Configuring LLVM
The default julia install does not come with the AVR backend of LLVM enabled, so we have to build both LLVM and julia ourselves. Be sure to do this on one of the 1.8
betas, like v1.8.0-beta3
. More recent commits currently break GPUCompiler.jl with this, which should be fixed in the future as well.
Julia luckily already supports building its dependencies, so we just have to make a few changes to two Makefile
s, enabling the backend
@@ -60,7 +60,7 @@ endif
LLVM_LIB_FILE := libLLVMCodeGen.a
# Figure out which targets to build
-LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF
+LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF;AVR
LLVM_EXPERIMENTAL_TARGETS :=
LLVM_CFLAGS :=
and instruct julia not to use the prebuilt LLVM by setting a flag in Make.user
:
USE_BINARYBUILDER_LLVM=0
Now, after running make
to start the build process, LLVM is downloaded, patched & built from source and made available to our julia code. The whole LLVM compilation took about 40 minutes on