SBCL: The Assembly Code Breadboard by medo-bear

Share This Article

Sed ut perspiciatis unde.

EDIT: Lutz Euler points out that the NEXT sequence (used to) encode
an effective address with an index register but no base. The mistake
doesn’t affect the meaning of the instruction, but forces a wasteful
encoding. The difference in machine code are as follows.

Before (14 bytes):

;       03:       8B043D00000000   MOV EAX, [RDI] ; _5_ useless bytes!
;       0A:       4883C704         ADD RDI, 4
;       0E:       4801F0           ADD RAX, RSI
;       11:       FFE0             JMP RAX

Now (9 bytes):

;       93:       8B07             MOV EAX, [RDI]
;       95:       4883C704         ADD RDI, 4
;       99:       4801F0           ADD RAX, RSI
;       9C:       FFE0             JMP RAX

I fixed the definition of NEXT, but not the disassembly snippets
below; they still show the old machine code.

Earlier this week, I took another look at the
F18. As usual with Chuck Moore’s
work, it’s hard to tell the difference between insanity and mere
brilliance ;) One thing that struck me is how small the stack is: 10
slots, with no fancy overflow/underflow trap. The rationale is that,
if you need more slots, you’re doing it wrong, and that silent
overflow is useful when you know what you’re doing. That certainly
jibes with my experience on the HP-41C and with x87. It also reminds
me of a
post of djb’s decrying our misuse of x87’s rotating stack:
his thesis was that, with careful scheduling, a “free” FXCH makes
the stack equivalent – if not superior – to registers. The post
ends with a (non-pipelined) loop that wastes no cycle on shuffling
data, thanks to the x87’s implicit stack rotation.

That lead me to wonder what implementation techniques become available
for stack-based VMs that restrict their stack to, e.g., 8 slots.
Obviously, it would be ideal to keep everything in registers.
However, if we do that naïvely, push and pop become a lot more
complicated; there’s a reason why Forth engines usually cache only the
top 1-2 elements of the stack.

I decided to mimic the x87 and the F18 (EDIT: modulo the latter’s two
TOS cache registers): pushing/popping doesn’t cause any data movement.
Instead, like the drawing below shows, they decrement/increment a modular
counter that points to the top of the stack (TOS). That would still be
slow in software (most ISAs can’t index registers). The key is that
the counter can’t take too many values: only 8 values if there are 8
slots in the stack. Stack VMs already duplicate primops for performance
reasons (e.g., to help the BTB by spreading out execution of the same
primitive between multiple addresses), so it seems reasonable to specialise
primitives for all 8 values the stack counter can take.

In a regular direct threaded VM, most primops would end with a code
sequence that jumps to the next one (NEXT), something like
add rsi, 8 ; increment virtual IP before jumping
jmp [rsi-8] ; jump to the address RSI previously pointed to
where rsi is the virtual instruction pointer, and VM instructions
are simply pointers to the machine code for the relevant primitive.

I’ll make two changes to this sequence. I don’t like hardcoding
addresses in bytecode, and 64 bits per virtual instruction is overly
wasteful. Instead, I’ll encode offsets from the primop code block:
mov eax, [rsi]
add rsi, 4
add rax, rdi
jmp rax
where rdi is the base address for primops.

I also need to dispatch based on the new value of the implicit stack
counter. I decided to make the dispatch as easy as possible by
storing the variants of each primop at regular intervals (e.g. one
page). I rounded that up to 64 * 67 = 4288 bytes to minimise
aliasing accidents. NEXT becomes something like
mov eax, [rsi]
add rsi, 4
lea rax, [rax + rdi + variant_offset]
jmp rax

The trick is that variant_offset = 4288 * stack_counter, and the
stack counter is (usually) known when the primitive is compiled. If
the stack is left as is, so is the counter; pushing a value decrements
the counter and popping one increments it.

That seems reasonable enough. Let’s see if we can make it work.

I want to explore a problem for which I’ll emit a lot of repetitive
machine code. SLIME’s REPL and SBCL’s assembler are perfect for the
task! (I hope it’s clear that I’m using unsupported internals; if it
breaks, you keep the pieces.)

The basic design of the VM is:

r8–r15: stack slots (32 bits);
rsi: base address for machine code primitives;
rdi: virtual instruction pointer (points to the next instruction);
rax,rbx,rcx,rdx: scratch registers;
rsp: (virtual) return stack pointer.

(import '(sb-assem:inst sb-vm::make-ea)) ; we'll use these two a lot

;; The backing store for our stack
(defvar *stack* (make-array 8 :initial-contents (list sb-vm::r8d-tn
                                                      sb-vm::r9d-tn
                                                      sb-vm::r10d-tn
                                                      sb-vm::r11d-tn
                                                      sb-vm::r12d-tn
                                                      sb-vm::r13d-tn
                                                      sb-vm::r14d-tn
                                                      sb-vm::r15d-tn)))

;; The _primop-generation-time_ stack pointer
(defvar *stack-pointer*)

;; (@ 0) returns the (current) register for TOS, (@ 1) returns
;; the one just below, etc.
(defun @ (i)
  (aref *stack* (mod (+ i *stack-pointer*) (length *stack*))))

(defvar *code-base* sb-vm::rsi-tn)
(defvar *virtual-ip* sb-vm::rdi-tn)

(defvar *rax* sb-vm::rax-tn)
(defvar *rbx* sb-vm::rax-tn)
(defvar *rcx* sb-vm::rax-tn)
(defvar *rdx* sb-vm::rax-tn)

;; Variants are *primitive-code-offset* bytes apart
(defvar *primitive-code-offset* (* 64 67))

;; Each *stack-pointer* value gets its own code page
(defstruct code-page
  (alloc 0) ; index of the next free byte.
  (code (make-array *primitive-code-offset* :element-type '(unsigned-byte 8))))

The idea is that we’ll define functions to emit assembly code for each
primitive; these functions will be implicitly parameterised on
*stack-pointer* thanks to @. We can then call them as many times
as needed to cover all values of *stack-pointer*. The only
complication is that code sequences will differ in length, so we must
insert padding to keep everything in sync. That’s what emit-code
does:

(defun emit-code (pages emitter)
  ;; there must be as many code pages as there are stack slots
  (assert (= (length *stack*) (length pages)))
  ;; find the rightmost starting point, and align to 16 bytes
  (let* ((alloc (logandc2 (+ 15 (reduce #'max pages :key #'code-page-alloc))
                          15))
         (bytes (loop for i below (length pages)
                      for page = (elt pages i)
                      collect (let ((segment (sb-assem:make-segment))
                                    (*stack-pointer* i))
                                ;; assemble the variant for this value
                                ;; of *stack-pointer* in a fresh code
                                ;; segment
                                (sb-assem:assemble (segment)
                                  ;; but first, insert padding
                                  (sb-vm::emit-long-nop segment (- alloc (code-page-alloc page)))
                                  (funcall emitter))
                                ;; tidy up any backreference
                                (sb-assem:finalize-segment segment)
                                ;; then get the (position-independent) machine
                                ;; code as a vector of bytes
                                (sb-assem:segment-contents-as-vector segment)))))
    ;; finally, copy each machine code sequence to the right code page
    (map nil (lambda (page bytes)
               (let ((alloc (code-page-alloc page)))
                 (replace (code-page-code page) bytes :start1 alloc)
                 (assert (<= (+ alloc (length bytes)) (length (code-page-code page))))
                 (setf (code-page-alloc page) (+ alloc (length bytes)))))
         pages bytes)
    ;; and return the offset for that code sequence
    alloc))

This function is used by emit-all-code to emit the machine code for
a bunch of primitives, while tracking the start offset for each
primitive.

(defun emit-all-code (&rest emitters)
  (let ((pages (loop repeat (length *stack*)
                     for page = (make-code-page)
                     ;; prefill everything with one-byte NOPs
                     do (fill (code-page-code page) #x90)
                     collect page)))
    (values (mapcar (lambda (emitter)
                      (emit-code pages emitter))
                    emitters)
            pages)))

Now, the pièce de résistance:

(defun next (&optional offset)
  (setf offset (or offset 0)) ; accommodate primops that frob IP
  (let ((rotation (mod *stack-pointer* (length *stack*))))
    (inst movzx *rax* (make-ea :dword :base *virtual-ip*
                                      :disp offset))
    (unless (= -4 offset)
      (inst add *virtual-ip* (+ 4 offset)))
    (if (zerop rotation)
        (inst add *rax* *code-base*)
        (inst lea *rax* (make-ea :qword :base *code-base*
                                        :index *rax*
                                        :disp (* rotation *primitive-code-offset*))))
    (inst jmp *rax*)))

Let’s add a few simple primitives.

(defun swap ()
  (inst xchg (@ 0) (@ 1)) ; exchange top of stack and stack[1]
  (next))

(defun dup ()
  (decf *stack-pointer*) ; grow stack (which grows down)
  (inst mov (@ 0) (@ 1)) ; and overwrite TOS
  (next))

(defun drop (&optional offset)
  (incf *stack-pointer*) ; just shrink the stack
  (next offset))

(defun add ()
  (inst add (@ 1) (@ 0)) ; second element becomes TOS
  (drop))

(defun sub ()
  (inst sub (@ 1) (@ 0))
  (drop))

CL-USER> (setf *print-length* 100)
100
CL-USER> (emit-all-code 'swap 'dup 'drop 'add 'sub)
(0 32 64 96 128)
(#S(CODE-PAGE
    :ALLOC 152
    :CODE #(69 135 193 139 4 61 0 0 0 0 72 131 199 4 72 1 240 255 224 102 15 31
            132 0 0 0 0 0 15 31 64 0 69 139 248 139 4 61 0 0 0 0 72 131 199 4
            72 141 132 6 64 117 0 0 255 224 15 31 132 0 0 0 0 0 139 4 61 0 0 0
            0 72 131 199 4 72 141 132 6 192 16 0 0 255 224 102 15 31 132 0 0 0
            0 0 102 144 69 1 193 139 ...))
 ...)
CL-USER> (defparameter *code0* (code-page-code (first (second /))))
*CODE0*
CL-USER> (defparameter *code1* (code-page-code (second (second //))))
*CODE1*

The code for swap lives between bytes 0 and 32. Let’s take a look
at the version for *stack-pointer* = 0 and *stack-pointer* = 1.

CL-USER> (sb-sys:with-pinned-objects (*code0*)
           (sb-disassem:disassemble-memory (sb-sys:vector-sap *code0*)
                                           32))
; Size: 32 bytes
; 0669C700:       4587C1           XCHG R8D, R9D
;       03:       8B043D00000000   MOV EAX, [RDI]
;       0A:       4883C704         ADD RDI, 4
;       0E:       4801F0           ADD RAX, RSI
;       11:       FFE0             JMP RAX
;       13:       660F1F840000000000 NOP  ; padding NOPs
;       1C:       0F1F4000         NOP
NIL
CL-USER> (sb-sys:with-pinned-objects (*code1*)
           (sb-disassem:disassemble-memory (sb-sys:vector-sap *code1*)
                                           32))
; Size: 32 bytes
; 0669D810:       4587CA           XCHG R9D, R10D
;       13:       8B043D00000000   MOV EAX, [RDI]
;       1A:       4883C704         ADD RDI, 4
;       1E:       488D8406C0100000 LEA RAX, [RSI+RAX+4288]
;       26:       FFE0             JMP RAX
;       28:       0F1F840000000000 NOP
NIL

dup is at 32-64, and sub at 128-152:

CL-USER> (sb-sys:with-pinned-objects (*code0*)
           (sb-disassem:disassemble-memory (sb-sys:sap+ (sb-sys:vector-sap *code0*) 32)
                                           32))
; Size: 32 bytes
; 0669C720:       458BF8           MOV R15D, R8D
;       23:       8B043D00000000   MOV EAX, [RDI]
;       2A:       4883C704         ADD RDI, 4
;       2E:       488D840640750000 LEA RAX, [RSI+RAX+30016]
;       36:       FFE0             JMP RAX
;       38:       0F1F840000000000 NOP
NIL
CL-USER> (sb-sys:with-pinned-objects (*code0*)
           (sb-disassem:disassemble-memory (sb-sys:sap+ (sb-sys:vector-sap *code0*) 128)
                                           24))
; Size: 24 bytes
; 0669C780:       4529C1

SBCL: The Assembly Code Breadboard by medo-bear

SBCL: The Assembly Code Breadboard by medo-bear

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

SBCL: The Assembly Code Breadboard by medo-bear

SBCL: The Assembly Code Breadboard by medo-bear

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter