(A Script that Solicits GPT-3 for Comments on Decompiled Code)
In this post, I introduce a new Ghidra script that elicits high-level explanatory comments for decompiled function code from the GPT-3 large language model. This script is called G-3PO. In the first few sections of the post, I discuss the motivation and rationale for building such a tool, in the context of existing automated tooling for software reverse engineering. I look at what many of our tools — disassemblers, decompilers, and so on — have in common, insofar as they can be thought of as automatic paraphrase or translation tools. I spend a bit of time looking at how well (or poorly) GPT-3 handles these various tasks, and then sketch out the design of this new tool.
If you want to just skip the discussion and get yourself set up with the tool, feel free to scroll down to the last section, and then work backwards from there if you like.
The Github repository for G-3PO can be found HERE.
At the present state of things, the domain of reverse engineering seems like a fertile site for applying machine learning techniques. ML tends to excel, after all, at problems where getting the gist of things counts, where the emphasis is on picking out patterns that might otherwise go unnoticed, and where error is either tolerable or can be corrected by other means. This kind of loose and conjectural pattern recognition is where reverse engineering begins. We start by trying to get a feel for a system, a sense of how it hangs together, and then try to tunnel down. Impressions can be deceptive, of course, but this is a field where they’re easily tested, and where suitable abstractions are both sought and mistrusted.
The goal, after all, is to understand (some part of) a system better than its developers do, to piece together its specification and where the specification breaks down.
At many stages along the way, you could say that what the reverse engineer is doing is searching for ways to paraphrase what they’re looking at, or translate it from one language into another.
We might begin, for example, with an opaque binary “blob” (to use a semi-technical term for unanalyzed data) that we dumped off a router’s NAND storage. The first step might be to tease out its file format, and through a process of educated guesses and experiments, find a way to parse it. Maybe it turns out to contain a squashfs
file system, containing the router’s firmware. We have various tools, like Binwalk, to help with this stage of things, which we know can’t be trusted entirely but which might provide useful hints, or even get us to the next stage.
Suppose we then unpack the firmware, mount it as a filesystem, and then explore the contents. Maybe we find an interesting-looking application binary, called something like telnetd_startup
. Instead of reading it as an opaque blob of bits, we look for a way to make sense of it, usually beginning by parsing its file structure (let’s say it’s an ELF) and disassembling it — translating the binary file into a sequence, or better, a directed graph of assembly instructions. For this step we might lean on tools like objdump, rizin, IDA Pro (if we have an expense account), or, my personal favourite, Ghidra. There’s room for error here as well, and sometimes even the best tools we have will get off on the wrong foot and parse data as code, or misjudge the offset of a series of instructions and produce a garbled listing, but you get to recognize the kinds of errors that these sorts of tools are prone to, especially when dealing with unknown file formats. You learn various heuristics and rules of thumb to minimize and correct those errors. But tools that can automate the translation of a binary blob into readable assembly are nevertheless essential — to the extent that if you were faced with a binary that used an unknown instruction set, your first priority as a reverse engineer may very well be to figure out how to write at least a flawed and incomplete disassembler for it.
The disassembly listing of a binary gives us a fine grained picture of its application logic, and sometimes that’s the furthest that automated tools can take us. But it’s still a far cry from the code that its developer may have been working with — very few programs are written in assembly these days, and its easy to get lost in the weeds without a higher-level vantage point. This might be where the reverser begins the patient manual work of discovering interesting components of the binary — components where its handling user input, for example — by stepping through the binary with a debugger like GDB (perhaps with the help of an emulator, like QEMU), and then annotating the disassembly listing with comments. In doing so the reverser tries to produce a high-level paraphrase of the program.
Nowadays, however, we often have access to another set of tools called decompilers, which can at least approximately translate the dissassembly listing into something that looks like source code, typically something like C (but extended with a few pseudo types, like Ghidra’s undefined
and undefined*
to indicate missing information). (Other tools, static analysis frameworks like BAP or angr (or, internally, Ghidra or Binary Ninja), for example, might be used to “lift” or translate the binary to an intermediate representation more amenable to further automated analysis, but we’ll leave those aside for now.) Decompilation is a heuristically-driven and inexact art, to a significantly greater extent than disassembly. When source code (in C, for example) is compiled down to x86 or ARM machine code, there’s an irreversible loss of information, and moving back in the other direction involves a bit of guess work, guided by contextual clues and constraints. When reverse engineers work with decompilers, we take it for granted that the decompiler is probably getting at least a few things wrong. But I doubt anyone would say that they’re unhelpful. We can, and often must, go back to the disassembly listing whenever needed after all. And when something seems fishy there, we can go back to the binary’s file format, and see if something’s been parsed incorrectly.
In my day to day work, this is usually where automated analysis stops and where manual annotation and paraphrase begins. I slowly read through the decompiler’s output and try to figure out, in ordinary language, what the code is “supposed” to be doing, and what it’s actually doing. It’s a long process of conjecture and refutation, often involving the use of debuggers, emulators, and tracers to test interpretations of the code. I might probe the running or emulated binary with various inputs and observe the effects. I might even try to do this in a brute force way, at scale, “fuzzing” the binary and looking for anomalous behaviour. But a considerable amount of time is spent just adding comments to the binary in Ghidra, correcting misleading type information and coming up with informative names for the functions and variables in play (especially if the binary’s been stripped and symbols are missing). Let’s call this the process of annotation.
We might notice that many of the automated stages in the reverse engineer’s job — parsing and unpacking the firmware blob, disassembling binary executables, and then decompiling them — can at least loosely be described as processes of translation or paraphrase. And the same can be said for annotation.
This brings us back to machine learning.
If there’s one thing that large language models, like OpenAI’s GPT-3, have shown themselves to be especially good at, it’s paraphrase — whether it’s a matter of translating between one language and another, summarising an existing knowledge base, or rewriting a text in the style of a particular author. Once you notice this, as I did last week while flitting back and forth between a project I was working on in Ghidra and a browser tab opened to ChatGPT, it might seem natural to see how an LLM handles the kinds of “paraphrasing” involved in a typical software reverse engineering workflow.
The example I’ll be working with here, unless otherwise noted, is a function carved from a firmware binary I dumped from a Canon ImageClass MF743Cdw printer.
GPT-3 Makes a Poor Disassembler
Let’s begin with disassembly:
Disassembly seems to fall squarely outside of ChatGPT’s scope, which isn’t surprising. It was trained on “natural language” in the broad sense, after all, and not on binary dumps.
The GPT-3 text-davinci-003
model does no better:
This, again, would be great, if it weren’t entirely wrong. Here’s what capstone
(correctly) returns for the same input:
0x44b2d4b0: cmp r2, #3
0x44b2d4b4: bls #0x44b2d564
0x44b2d4b8: ands ip, r0, #3
0x44b2d4bc: beq #0x44b2d4e4
0x44b2d4c0: ldrb r3, [r1], #1
0x44b2d4c4: cmp ip, #2
0x44b2d4c8: add r2, r2, ip
0x44b2d4cc: ldrbls ip, [r1], #1
0x44b2d4d0: strb r3, [r0], #1
0x44b2d4d4: ldrblo r3, [r1], #1
0x44b2d4d8: strbls ip, [r0], #1
0x44b2d4dc: sub r2, r2, #4
0x44b2d4e0: strblo r3, [r0], #1
0x44b2d4e4: ands r3, r1, #3
0x44b2d4e8: beq #0x44b36318
0x44b2d4ec: subs r2, r2, #4
0x44b2d4f0: blo #0x44b2d564
0x44b2d4f4: ldr ip, [r1, -r3]!
0x44b2d4f8: cmp r3, #2
0x44b2d4fc: beq #0x44b2d524
0x44b2d500: bhi #0x44b2d544
0x44b2d504: lsr r3, ip, #8
0x44b2d508: ldr ip, [r1, #4]!
0x44b2d50c: subs r2, r2, #4
0x44b2d510: orr r3, r3, ip, lsl #24
0x44b2d514: str r3, [r0], #4
0x44b2d518: bhs #0x44b2d504
0x44b2d51c: add r1, r1, #1
0x44b2d520: lsr r3, ip, #0x10
0x44b2d524: ldr ip, [r1, #4]!
0x44b2d528: subs r2, r2, #4
0x44b2d52c: orr r3, r3, ip, lsl #16
0x44b2d530: str r3, [r0], #4
0x44b2d534: bhs #0x44b2d520
0x44b2d538: add r1, r1, #2
0x44b2d53c: lsr r3, ip, #0x18
0x44b2d540: ldr ip, [r1, #4]!
0x44b2d544: subs r2, r2, #4
0x44b2d548: orr r3, r3, ip, lsl #8
0x44b2d54c: str r3, [r0], #4
0x44b2d550: bhs #0x44b2d53c
0x44b2d554: add r1, r1, #3
0x44b2d558: lsls r2, r2, #0x1f
0x44b2d55c: ldrbhs r3, [r1], #1
0x44b2d560: ldrbhs ip, [r1], #1
0x44b2d564: ldrbmi r2, [r1], #1
0x44b2d568: strbhs r3, [r0], #1
0x44b2d56c: strbhs ip, [r0], #1
0x44b2d570: strbmi r2, [r0], #1
0x44b2d574: bx lr
0x44b2d578: subs r2, r2, #0x20
0x44b2d57c: blo #0x44b2d5b0
0x44b2d580: cmp r2, #0x80
0x44b2d584: blo #0x44b2d5a0
0x44b2d588: pld [r1, #0x80]
0x44b2d58c: ldm r1!, {r3, r4, r5, r6, r7, r8, ip, lr}
0x44b2d590: sub r2, r2, #0x20
0x44b2d594: stm r0!, {r3, r4, r5, r6, r7, r8, ip, lr}
0x44b2d598: ldm r1!, {r3, r4, r5, r6, r7, r8, ip, lr}
0x44b2d59c: subs r2, r2, #0x20
0x44b2d5a0: stm r0!, {r3, r4, r5, r6, r7, r8, ip, lr}
0x44b2d5a4: bhs #0x44b2d598
0x44b2d5a8: lsls ip, r2, #0x1c
0x44b2d5ac: ldmhs r1!, {r3, r4, ip, lr}
0x44b2d5b0: stmhs r0!, {r3, r4, ip, lr}
0x44b2d5b4: ldmmi r1!, {r3, r4}
0x44b2d5b8: stmmi r0!, {r3, r4}
0x44b2d5bc: lsls ip, r2, #0x1e
0x44b2d5c0: ldrhs r3, [r1], #4
0x44b2d5c4: strhs r3, [r0], #4
0x44b2d5c8: bxeq lr
0x44b2d5cc: lsls r2, r2, #0x1f
0x44b2d5d0: ldrhhs r3, [r1], #2
0x44b2d5d4: ldrbmi r2, [r1], #1
0x44b2d5d8: strhhs r3, [r0], #2
0x44b2d5dc: strbmi r2, [r0], #1
0x44b2d5e0: bx lr
GPT-3 Makes a Pretty Good Decompiler
Things look a lot brighter when we turn to decompilation.
If I first disassemble the binary using Capstone and then ask the LLM to decompile the result, I start to get surprisingly accurate and informative answers.
>>> prompt_openai_model(f"Here is some ARM assembly code:nn```n{s}n```nnIt can be decompiled into C code as follows:nn```n",
model="text-davinci-003")
This gives us the following:
if (r2 == 3) {
if ((r0 & 3) == 0) {
while (r2 > 0) {
r3 = *r1++;
*r0++ = r3;
r2--;
}
} else {
while (r2 > 0) {
if ((r0 & 3) == 2) {
r3 = *r1++;
*r0++ = r3;
r2--;
} else {
ip = *r1++;
r2 -= 4;
if (r3 == 2) {
*r0++ = (ip << 24) | (*r1++);
} else if (r3 < 2) {
*r0++ = (ip << 16) | (*r1++);
r1++;
} else {
*r0++ = (ip << 8) | (*r1++);
r1 += 3;
}
}
}
}
} else {
while (r2 > 0) {
if (r2 >= 32) {
if (r2 >= 128) {
for (int i = 0; i < 4; i++) {
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
r2 -= 32;
}
} else {
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
r2 -= 32;
}
} else {
if (r2 >= 16) {
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
*r0++ = *r1++;
r2 -= 16;
} else {
if (r2 >= 8) {
*r0++ = *r1++;
*r0++ = *r1++;
r2 -= 8;
} else {
if (r2 >= 4) {
*r0++ = *r1++;
r2 -= 4;
} else {
if (r2 >= 2) {
*r0++ = *r1++;
r2 -= 2;
} else {
*r0++ = *r1++;
r2--;
}
}
}
}
}
}
}
Compare this with what Ghidra gives us:
undefined8 FUN_44b2d4b0(uint *param_1,uint *param_2,uint param_3,uint param_4){
uint *puVar1;
uint *puVar2;
byte *pbVar3;
uint *puVar4;
byte bVar5;
undefined2 uVar6;
uint uVar7;
uint uVar8;
uint uVar9;
uint uVar10;
uint uVar11;
uint uVar12;
byte bVar13;
uint in_r12;
uint uVar14;
uint uVar15;
uint uVar16;
bool bVar17;
bool bVar18;
if (3 < param_3) {
uVar14 = param_1 & 3;
in_r12 = uVar14;
if (uVar14 != 0) {
bVar5 = *param_2;
puVar2 = param_2 + 1;
if (uVar14 < 3) {
puVar2 = param_2 + 2;
in_r12 = *(param_2 + 1);
}
*param_1 = bVar5;
param_2 = puVar2;
if (uVar14 < 2) {
param_2 = puVar2 + 1;
bVar5 = *puVar2;
}
puVar2 = param_1 + 1;
if (uVar14 < 3) {
puVar2 = param_1 + 2;
*(param_1 + 1) = in_r12;
}
param_3 = (param_3 + uVar14) - 4;
param_1 = puVar2;
if (uVar14 < 2) {
param_1 = puVar2 + 1;
*puVar2 = bVar5;
}
}
param_4 = param_2 & 3;
if (param_4 == 0) {
uVar14 = param_3 - 0x20;
if (0x1f < param_3) {
for (; 0x7f < uVar14; uVar14 = uVar14 - 0x20) {
HintPreloadData(param_2 + 0x20);
uVar7 = *param_2;
uVar8 = param_2[1];
uVar9 = param_2[2];
uVar10 = param_2[3];
uVar11 = param_2[4];
uVar12 = param_2[5];
uVar15 = param_2[6];
uVar16 = param_2[7];
param_2 = param_2 + 8;
*param_1 = uVar7;
param_1[1] = uVar8;
param_1[2] = uVar9;
param_1[3] = uVar10;
param_1[4] = uVar11;
param_1[5] = uVar12;
param_1[6] = uVar15;
param_1[7] = uVar16;
param_1 = param_1 + 8;
}
do {
param_4 = *param_2;
uVar7 = param_2[1];
uVar8 = param_2[2];
uVar9 = param_2[3];