Building a binary
Binary Analysis 101

Math guy who's into Cryptography, into iOS/MacOS development, and obviously into hacking/pentesting. Writing stuff in C/C++/ObjectiveC/Swift/Python/Assembly.
Abstract
In this article, we discuss the process of compilation. We need to understand how something is built, prior to its reverse-engineering.
We will work with clang on Debian, a quite common setup. We are also taking a look at the same process on MacOS.
The process of Compilation
In this article, we approach the process of C compilation. Changing the opportune details, this exercise can be replicated with other programming languages. Here we focus on the C language because in this case the process can be easily streamlined and, consequently, analysed.
Here we work with clang; the steps with GCC are identical. For practical reasons (we're running a Debian on a VM hosted on a MacOS), we will tend to give the MacOS examples, highlighting the differences where there are some.
We will work with the following program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MESSAGE1 "Initialising random number generator"
#define NL "\n"
#define ENDL printf("%s",NL)
#define MAXNUM 73
int mySubroutine(){
int value;
// Initialising random number generator
printf("%s", MESSAGE1);
ENDL;
srand((unsigned) time(0));
value = ((int)rand()) % MAXNUM;
return value;
}
int main(int argc, char const *argv[])
{
int myNum = mySubroutine();
printf("the random number is %d", myNum);
ENDL;
return 0;
}
The program is very straightforward. In the first three lines, some C libraries are included. These instructions are usually called include directives and do what you expect them to do: they include source files in the current program in order to supply some functionalities.
The #define instructions (called macros) represent placeholders. During the compilation process, all instances of the macros are replaced with the specified text. To fix the ideas, take the NL macro. It is only used in another macro, ENDL.
The text preprocessor recursively replaces all the macros, hence ENDL becomes printf("%s","\n"), and all its occurrences are replaced with this value.
The remaining part are one subroutine - namely mySubroutine and the main program.
Observe that this program is quite unusual - everything is in the same file, which is not actually the best practice for C programming.
In short, the process of building a binary works as follows:

Preprocessing
The very first thing that is required to create an executable (binary) consistent with the intentions of the developer is the actual inclusion of all required files and the replacement of the macros.
This phase is called preprocessing. During the text preprocessing phase, the libraries are included, and all the occurrences of the macros are replaced with the corresponding definition. We can see the output of this phase by running the command clang -E filename (or, if you prefer using GCC, gcc -E filename).
The result in the Debian environment contains some interesting chunks of code. For instance, it shows the contents of all included libraries (below we show parts of the stdlib.h library):

More interestingly, the result of the preprocessing on our program:
int mySubroutine(){
int value;
printf("%s", "Initialising random number generator");
printf("%s","\n");
srand((unsigned) time(0));
value = ((int)rand()) % 73;
return value;
}
int main(int argc, char const *argv[])
{
int myNum = mySubroutine();
printf("the random number is %d", myNum);
printf("%s","\n");
return 0;
}
All the occurrences of macros have been successfully replaced. A similar behavior can be observed on the MacOS machine. Here we show the initial inclusions:

Observe that the result of the preprocessing phase is still a C program: all the libraries have been included, and all macros have been replaced. In fact, the only difference is that the code is more verbose and there is nothing that isn't explicitly defined. In the next step – the actual compilation – the output of the previous phase is taken and transformed into assembly code.
Compilation
Once all libraries have been included and macro replaced, the code can be compiled. With the process of compilation, the preprocessed code is translated from C to assembly. We can stop clang (or gdb) after the compilation phase and obtain an assembly source (.s) file using the -S flag. A '.s' file (.s stands for 'source') is generated.
The result of this on our MacOS setup returns the following:
gbiondo@tripleX BA % clang -S main.c
gbiondo@tripleX BA % head -30 main.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0 sdk_version 12, 1
.globl _mySubroutine ## -- Begin function mySubroutine
.p2align 4, 0x90
_mySubroutine: ## @mySubroutine
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
subq $16, %rsp
leaq L_.str(%rip), %rdi
leaq L_.str.1(%rip), %rsi
movb $0, %al
callq _printf
leaq L_.str(%rip), %rdi
leaq L_.str.2(%rip), %rsi
movb $0, %al
callq _printf
xorl %eax, %eax
movl %eax, %edi
callq _time
movl %eax, %edi
callq _srand
callq _rand
cltd
movl $73, %ecx
idivl %ecx
It is immediate noticing that the syntax here utilised is the AT&T one. To switch to the usual intel syntax, we can use the switch -masm=intel. To me, it's just a matter of habit: I am more used to this syntax, but the contents don't really change.
Putting it all together, we obtain:
gbiondo@tripleX BA % clang -S -masm=intel main.c
gbiondo@tripleX BA % cat main.s
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0 sdk_version 12, 1
.intel_syntax noprefix
.globl _mySubroutine ## -- Begin function mySubroutine
.p2align 4, 0x90
_mySubroutine: ## @mySubroutine
.cfi_startproc
## %bb.0:
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
sub rsp, 16
lea rdi, [rip + L_.str]
lea rsi, [rip + L_.str.1]
mov al, 0
call _printf
lea rdi, [rip + L_.str]
lea rsi, [rip + L_.str.2]
mov al, 0
call _printf
xor eax, eax
mov edi, eax
call _time
mov edi, eax
call _srand
call _rand
cdq
mov ecx, 73
idiv ecx
mov dword ptr [rbp - 4], edx
mov eax, dword ptr [rbp - 4]
add rsp, 16
pop rbp
ret
.cfi_endproc
## -- End function
.globl _main ## -- Begin function main
.p2align 4, 0x90
_main: ## @main
.cfi_startproc
## %bb.0:
push rbp
.cfi_def_cfa_offset 16
.cfi_offset rbp, -16
mov rbp, rsp
.cfi_def_cfa_register rbp
sub rsp, 32
mov dword ptr [rbp - 4], 0
mov dword ptr [rbp - 8], edi
mov qword ptr [rbp - 16], rsi
call _mySubroutine
mov dword ptr [rbp - 20], eax
mov esi, dword ptr [rbp - 20]
lea rdi, [rip + L_.str.3]
mov al, 0
call _printf
lea rdi, [rip + L_.str]
lea rsi, [rip + L_.str.2]
mov al, 0
call _printf
xor eax, eax
add rsp, 32
pop rbp
ret
.cfi_endproc
## -- End function
.section __TEXT,__cstring,cstring_literals
L_.str: ## @.str
.asciz "%s"
L_.str.1: ## @.str.1
.asciz "Initialising random number generator"
L_.str.2: ## @.str.2
.asciz "\n"
L_.str.3: ## @.str.3
.asciz "the random number is %d"
.subsections_via_symbols
Interestingly, we can see the code for the two subroutines (mySubroutine and main) well delineated, as much as the definition of the C strings we have used.
The result of compilation is still something one can understand (as long as you can understand assembly code, indeed), but not yet something a machine can run – in fact we have:
gbiondo@tripleX BA % file main.s
main.s: assembler source text, ASCII text
so, the file is still a TEXT file - not a binary!
Assembly
The next stage is the so-called assembly phase, in which the assembly code that has been produced in the previous stage is now converted into opcodes. The output of this phase is an object (.o) file, which can be obtained by running clang (or gcc) with the -c switch.
We have:
gbiondo@tripleX BA % clang -c main.c
gbiondo@tripleX BA % file main.o
main.o: Mach-O 64-bit object x86_64
and - obviously! - in the Debian environment, we'll have:
DebianShellcode% gcc -c main.c
DebianShellcode% file main.o
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
Analysing this file is a bit more complex. One way could be dumping its hex representation with hexdump -c main.o. Part of the result is reported in the image below:

Observe that apart for some strings (like the highlighted ones), the format is not human readable. Here the difference between ELFs and MachOs is a bit more evident. Consider the last lines produced by the hexdump on a MacOS system:
00004a0 \0 \0 \0 \0 \0 \0 \0 \0 \0 _ m a i n \0 _
00004b0 p r i n t f \0 _ m y S u b r o u
00004c0 t i n e \0 _ t i m e \0 _ s r a n
00004d0 d \0 _ r a n d \0
containing a null-byte terminated list of all subroutines invoked in the program. Also the file size is different - but once again: we are talking about two different executable formats, this should be expected.
We can now produce an executable file.
Linking
The result of the previous phases is usually a collection of object files. During the linking phase, they are all combined into a single executable file. Shared libraries may be linked together with the code (static linking) or not (dynamic linking). This topic is outside of the scope of this article - more information can be found, for instance, in BUFFER OVERFLOW 4.
The linking can be then obtained as follows:
gbiondo@tripleX BA % clang main.c -o main
gbiondo@tripleX BA % ls -al
total 128
drwxr-xr-x 6 gbiondo staff 192 22 Mar 10:37 .
drwxr-xr-x 33 gbiondo staff 1056 22 Mar 09:12 ..
-rwxr-xr-x 1 gbiondo staff 49600 22 Mar 10:37 main
-rw-r--r-- 1 gbiondo staff 508 21 Mar 14:34 main.c
-rw-r--r-- 1 gbiondo staff 1240 22 Mar 10:13 main.o
-rw-r--r-- 1 gbiondo staff 1892 22 Mar 10:01 main.s
gbiondo@tripleX BA % file main
main: Mach-O 64-bit executable x86_64
obviously in the Debian machine we'll have:
DebianShellcode% gcc main.c -o main
DebianShellcode% file main
main: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=a88cde6d224edbea15116fd89ea89ed50b32703e, not stripped
Note: Linux file command gives a more interesting output.
Conclusions
This article is just a foundation for future developments. Actually, it's a bit counterintuitive, but one cannot reverse a process (in this case, binary creation) without knowing the process itself. In theory, I didn't write anything new, but this is a very vast and fascinating field - I just hope I gave you another view on one of the most basic process of the software development.




