Assembly code scares people. There’s a good reason for that. For many people, writing code in assembly language seems equivalent to writing code in ancient dwarven runes, or calculating pi in roman numerals. The fact that Roller Coast Tycoon was almost completely written in assembly language sounds almost too amazing to be true. Many programmers view assembly language as some combination of ancient, arcane, inscrutable, useless, and complex.
Despite all that, I have a secret to share with you. Reading assembly language is not really that hard. Or at the very least, it’s an order of magnitude easier than writing assembly language. There are a few reasons why that’s true, but before we dive into that, let me first tell you why you should care about assembly language.
For most people that write in languages that compile to native code, assembly language represents the fundamental building blocks for every program that we build and run1. If you’ve ever had to troubleshoot something where you absolutely, 100%, MUST understand what a line of code does… reading the assembly for the code is what you should do. Not reading the C++ source, Rust source, or even C source. And the reason for that is that source code in every language will lie to you. Not necessarily due to any fault in the programming language or compiler, but due to the limits of our own comprehension. Complex or unfamiliar language features, undefined behavior, or simply poorly written code can be difficult to understand to see what’s really going on. But the assembly code will always tell you the truth.
And besides that, there are the typical cases where you read assembly language: when you don’t have the source code. Reverse engineering something to understand how it works shouldn’t be seen as an unapproachable skill. It’s something every programmer should have some level of understanding of, especially if you run code on a closed source operating system or use libraries without source code.
But most importantly, understanding assembly language is essential to understanding how things really work, and can give you better insights into how things work, whether you’re building systems up or tearing systems down. Reading assembly language is not a replacement for proper reverse engineering tools like Ghidra or IDA, but it is a necessary complimentary skill.
One of the hardest parts of assembly language are the fact that there are so many different instructions. The 8086 instruction set started with 81 different instructions. On a modern Intel CPU, that number is closer to 1000. You could imagine trying to find the right instruction for a specific situation would be difficult. In reality, the number of instructions you need to learn to read is quite small. In one binary I looked at, 83% of the instructions used were the 10 most frequent instructions, and many of the top 30 instructions are just slight variations (like AND
and OR
).
Here’s a chart I made showing the relative frequency of the 30 most common instruction types I saw in one binary. I suspect you would see a similar graph on other architectures, but on x86 you’ll see a particularly long tail due to the large number of instruction types. You can understand a very large chunk of assembly code if you just know the most common instructions.
Hopefully I’ve convinced you that learning to read assembly language is important and not as hard as you think. So let me give you a little crash course in x86 assembly.
Two flavors: AT&T and Intel syntax
For historical reasons, there are two “flavors” of disassembly syntax for x86. One is called “Intel” and the other is called “AT&T”. If you live in the Windows world, you may never see AT&T syntax, but some open source tools will default to AT&T syntax so it’s good to recognize when you’re dealing with it.
The biggest difference you’ll see between the two flavors is that the order of operands is reversed! Here’s an example of AT&T syntax:
addl $4, %eax
And here’s an example of Intel syntax:
add eax, 4
Besides the order being swapped, constants also get prefixed with $
, and registers are prefixed with %
. Some mnemonics also have a letter appended to indicate the size of the operands, such as l
for 32-bit operands.
Depending on the tools you are using, you may not have a choice of which syntax to use. WinDbg only supports Intel syntax, for instance. Many open source tools will default to AT&T syntax, but will have an option to enable Intel syntax. For objdump, you can use -M intel
. For instance:
objdump -d -M intel ./a.out
While I’m sure someone has an argument in favor of the AT&T syntax, I’d suggest avoiding it for one simple reason: The Intel manual (SDM) uses Intel syntax, and is a crucial resource to understa