July 24, 2023 – 26 min – 5427 words
Code obfuscation is the technique of applying transformations to a code to make any static analysis (automatic or otherwise) more difficult and more complex. Static analysis groups a series of operations that an analyst can perform on an executable to try to obtain information such as high-level structures and original variables.
Among the techniques of static analysis of software, we can find reverse engineering, also called reverse engineering. This technique makes it possible to recover some of the information that allows an analyst to better understand how software works, how it interacts with data, and what algorithms it uses. Obtaining the original code often presents several challenges and pitfalls since machine code does not allow for the expression of loops and high-level data structures.
For some programming languages (such as Java), it is possible to recover the initial code without any problems. It is very likely that by analyzing any Java or .NET-based application, you will be able to find most of the original logic. This is given by the expressiveness of their machine language, such as in Java Bytecode, in which more information can be stored than in the primitive machine language used by the CPU (Assembly).
All reverse engineering tools such as IDA, Ghidra, and Binary Ninja base their operation on certain components such as disassemblers (which allow translation from a raw stream of bytes to machine language instructions) and decompilers (which allow translation from machine language to a high-level C-like language, including information such as strings and loops). These tools have become very powerful over time, and with luck, much of the program logic can be recovered. Decompilers are still approximate; they certainly cannot generate recompilable code; however, through “decompiled” code, it is possible to recreate software.
An example? 3D Pinball Space Cadet is the attempted reconstruction of the popular 3D Pinball game for Windows XP. Analysts were able, through Ghidra and IDA, to reconstruct most of the original data structures, populate them with the correct values, and resurrect the game to port it to different platforms (Nintendo 3DS, AmigaOS, Android). How did they do this? They used a process that mixed reverse engineering and debugging of the original application to be able to figure out how collision management, score updating, and sound management worked.
Reverse engineering allows analysts to go deep with respect to how the software works, what data it relies on, and potentially any weaknesses in the software. A huge Pandora’s box opens here. Some of the information that is included within the software might be restricted to a specific user or might conceal some details that we don’t want to give away to the first person running strings software_enterprise
. If we then think about corporate assets, the first listed companies are software producers, and paradoxically, software (and algorithms) is their flagship product; they have to protect it.
Imagine that we find within a piece of software a specific sequence of instructions that make the software faster, more performant, or simply better for customers. As a competitor, I might be interested in replicating the same instructions or ideas in my own software to narrow the gap between my software and the competitor’s software. The software might also contain some secrets, such as API keys or internal or private methods, that the originating company might want to hide, which I am interested in as the main adversary.
Very thinly veiled, I introduced some points to motivate the choice of obfuscating the code. Intellectual property protection is one of the main reasons for wanting to obfuscate the code and make static analysis more complex. The algorithms that the company develops to make an application perform better or to make certain choices that have a socio-economic impact are algorithms that constitute a very valuable resource for the attacker. To make analysis more difficult, it is necessary to complicate the information on which decompilers rely: instructions and data. The technique for making any program more complicated is called obfuscation and allows a given program P to generate a program P’ that is more difficult to analyze.
In this article, we will introduce POCKET, a program that allows you to have a first approach to obfuscation. Pocket is a command-line program that accepts an MBA (Mixed Boolean Arithmetic) expression, and by applying some transformation rules we are able to make the expression more complex. To better understand how it works and what advantages it has, let us introduce a real-life example.
The task of decompilers
Once the disassembler has translated the raw byte sequence into assembly microinstructions, the result can look something like this:
inc eax
and eax, ebx
and eax, ecx
The decompiler can then begin to translate the microinstructions into a simpler language. At first the decompiler works by “pattern matching”: it notes in fact an increment in eax, an and between eax and ebx, and another and between eax and ebx. The commented instructions become thus:
inc eax ; eax = eax + 1
and eax, ebx ; eax = eax & ebx
and eax, ecx ; eax = eax & ecx
The decompiler notices that in all three expressions eax
is the left register used for assignment; the decompiler then rewrites the expressions with a single expression: eax = ((eax + 1) & ebx) & ecx
. Assume further that we assign the value 2
to eax, the value 5 to ebx
, and the value 3
to ecx. The end result of this expression is 3
. Within a program this value might represent the index of an element, the number of times to iterate within a loop. You might think of it as a nothing value.
But instead of very simple values why can’t we assign a value like 0x80021002
to eax? The result is 0x80021002
, however we only need to add a single instruction to make things more interesting. We add jmp %eax
. This way, the execution of the next instruction is dynamic: it depends on the value of eax. Thus, as mentioned, we do not worry about the possible values a data item can take.
We are concerned instead with the readability of the expression for a decompiler. The expression eax = ((eax + 1) & ebx) & ecx
is very readable: it is 2 simple and operations and a sum. Imagining that it knows the values of eax, ebx, and ecx, the decompiler could even replace eax with the computed value:
eax = 0x80021002
eax = ((eax + 1) & ebx) & ecx
// result: 0x80021003
The expression we have just analyzed contains algebraic operations (such as sum) and logical operations (AND): therefore, the expression is named Mixed Boolean Arithmetic expression. It is very simple to compute as an expression and requires no special care. How can we make it more inaccessible? For now, let us consider only transformations on the expression.
For example, we know that any sum operation between X and Y X+Y
can be transcribed as (X & Y) + (X | Y)
. Let us try applying this transformation: in our case X will be eax while Y an immediate value, namely 1. The result of this transformation is (eax & 1) + (eax | 1)
and rewriting the expression eax = (((eax & 1) + (eax | 1) & ebx) & ecx
. We can then move on to the second transformation involving X & Y
. The AND operation between two integers is equivalent to (X + Y) - (X | Y)
, in this case X = (eax & 1) + (eax | 1)
while Y = ebx
: the final result is (((eax & 1) + (eax | 1)) + ebx) - (((eax & 1) + (eax | 1)) | ebx)
.
We apply this rule one last time to get the final obfuscated expression:
((((((eax & 1) + (eax | 1)) + ebx) - (((eax & 1) + (eax | 1)) | ebx)) + ecx) - (((((eax & 1) + (eax | 1)) + ebx) - (((eax & 1) + (eax | 1)) | ebx)) | ecx))
We then understood with a real example the potential of MBA expressions. Obfuscation of these expressions consists of iteratively applying rewriting rules to increase the complexity of these expressions while preserving their semantics. In fact, the expression rewriting rules must not change the result of the expression (because otherwise the obfuscation would be invalid).
The transformation rules are based on certain operations that when performed in a certain order turn out to be equivalent to the original operations.
Introducing POCKET
POCKET is a tool for automatically applying these transformation rules and was written in Rust with the grammar based on Pest. The development of the project was not particularly difficult since I allowed much of the complexity of the parser to be delegated to Pest. The difficulty was more trying to understand Rust’s method of operation, figuring out how to get around some borrowing issues, and an annoying problem with a stack overflow. (Spoiler: I was recursively calling the same operation).
By compiling the project with cargo compile
, the pocket
binary is generated. We can start it without any problems; it will stand by waiting for an input expression. To resume the expression from before, we can consider: (A + 1) & B & C
. POCKET constructs a syntactic tree from the expression that will be traversed to apply the transformation rules. A syntactic tree for the expression in the example is:
.
|-- &
|------ +
|--------- A
|--------- 1
|------ &
|-------- B
|-------- C
Each node can be one of 4 different types:
- Immediate: represents an immediate value such as ‘1’
- Cont: represents a SET consisting of a letter of the ASCII alphabet and its value (A, B, C… or a, b, c…)
- BinaryOperation: represents a binary operation. It consists of the contents of the operation (AND, OR, SUM, SUBTract..) and a left and right child. The children are other nodes that define other operations or other terminals.
- UnaryOperation: represents a unary operation. Composed of the symbol before the SET.
Immediate and cont are terminals as they have no children, while BinaryOperation has two children and UnaryOperator one child. Once the tree has been constructed, the tree is visited in its entirety in order to apply the transformation rules: imagine a switch constructed based on the binary operation to be replaced.
- if the operation is
X & Y
then apply the substitution(X + Y) - (X | Y)
- if the operation is
X ^ Y
then apply the substitution(X | Y) - (X & Y)
- if the operation is
X - Y
then apply the substitution(X ^ -Y) + 2*(X & -Y)
- if the operation is
X + Y
then apply the substitution(X & Y) + (X | Y)
- if the operation is
X | Y
then apply the substitutionX + Y + 1 + (~X | ~Y)
The values of X and Y are replaced each time by the left and right children of the operations, so I can perform the substitution without losing the context I am in. I am working on being able to add additional transformations on the immediate values. If the transformations for binary operations are limited, the transformations for immediate values are potentially infinite: I can write 2 as the difference between 48 and 46, or between 54 and 52 or the difference between 52 and 54 changed sign.
Equivalence transformations result from studies of the compatibility of operators in binary arithmetic. Suppose we have an expression of the type 2 * (x ^ y)
: this is equivalent to (2x ^ 2y)
. The less attentive reader may have confused these two formulas with the simple distributive property: I multiply x and y by two and keep the XOR operation.
It doesn’t always work that way!!! Already with 3 * (x ^ y) != (3x ^ 3y)
the distributive property disappears. Compatibility between formulas and operators is based on the type of set we are considering: the set based on binary arithmetic!
Returning to the example, the result of blurring results as follows:
((((((A & 1) + (A | 1)) + B) - (((A & 1) + (A | 1)) | B)) + C) - (((((A & 1) + (A | 1)) + B) - (((A & 1) + (A | 1)) | B)) | C))
While if you wanted to have another level of blurring starting from the expression from before, here is the result:
((((((((((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 +
(1 + ((~A) | (~1))))))) & B) + (((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1)
- (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) | B)) ^ (-(((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A
) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) + (B + (1 + ((~((((A + 1) -
(A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))
)) | (~B))))))) + (2 * (((((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A |
1)) | (A + (1 + (1 + ((~A) | (~1))))))) & B) + (((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)
))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) | B)) & (-(((((A + 1) - (A | 1)) &
A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) + (B + (1
+ ((~((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1
+ ((~A) | (~1)))))))) | (~B))))))))) & C) + (((((((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)
)))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) & B) + (((((A + 1) - (A | 1)) & (A +
(1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) | B)) ^ (-(((
(A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) |
(~1))))))) + (B + (1 + ((~((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) -
(A | 1)) | (A + (1 + (1 + ((~A) | (~1)))))))) | (~B))))))) + (2 * (((((((A + 1) - (A | 1)) & (A +
(1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) & B) + ((
(((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + (
(~A) | (~1))))))) | B)) & (-(((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A
+ 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) + (B + (1 + ((~((((A + 1) - (A | 1)) & (A + (
1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1)))))))) | (~B)
)))))))) | C)) ^ (-(((((((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (
A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) & B) + (((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A)
| (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) | B)) ^ (-(((((A + 1)
- (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1
))))))) + (B + (1 + ((~((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) -
(A | 1)) | (A + (1 + (1 + ((~A) | (~1)))))))) | (~B))))))) + (2 * (((((((A + 1) - (A | 1)) & (A + (1
+ (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) & B) + (((((A + 1
) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1)
)))))) | B)) & (-(((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) |
(A + (1 + (1 + ((~A) | (~1))))))) + (B + (1 + ((~((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)
)))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1)))))))) | (~B))))))))) + (C + (1 + ((~(((((
((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A)
| (~1))))))) & B) + (((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1))
| (A + (1 + (1 + ((~A) | (~1))))))) | B)) ^ (-(((((A + 1) - (A | 1)) & (A + (1 + (1 + ((~A) | (~1)))
)) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1))))))) + (B + (1 + ((~((((A + 1) - (A | 1)) &
(A + (1 + (1 + ((~A) | (~1)))))) + (((A + 1) - (A | 1)) | (A + (1 + (1 + ((~A) | (~1)))))))) | (~B))
)))) + (2 * (((((((A + 1) - (A