# Building a binary

### Abstract
*In this article, we discuss the process of compilation. We need to understand how something is built, prior to its reverse-engineering.*

*We will work with ```
clang
``` on Debian, a quite common setup. We are also taking a look at the same process on MacOS.*

# The process of Compilation

In this article, we approach the process of C compilation. Changing the opportune details, this exercise can be replicated with other programming languages. Here we focus on the C language because in this case the process can be easily streamlined and, consequently, analysed. 

Here we work with ```
clang
```; the steps with GCC are identical. For practical reasons (we're running a Debian on a VM hosted on a MacOS), we will tend to give the MacOS examples, highlighting the differences where there are some.

We will work with the following program:

```
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define		MESSAGE1		"Initialising random number generator"
#define		NL			  "\n"
#define		ENDL			printf("%s",NL)

#define		MAXNUM			73

int mySubroutine(){
	int value;
	// Initialising random number generator
	printf("%s", MESSAGE1);
	ENDL;
	srand((unsigned) time(0));
	value = ((int)rand()) % MAXNUM;

	return value;
}

int main(int argc, char const *argv[])
{
	int myNum = mySubroutine();
	printf("the random number is %d", myNum);
	ENDL;
	return 0;
}
```
The program is very straightforward. 
In the first three lines, some C libraries are included. These instructions are usually called **include directives** and do what you expect them to do: they include source files in the current program in order to supply some functionalities.

The ```#define``` instructions (called **macros**) represent placeholders. During the compilation process, all instances of the macros are replaced with the specified text. To fix the ideas, take the ```NL``` macro. It is only used in another macro, ```ENDL```.

The text preprocessor recursively replaces all the macros, hence ```ENDL``` becomes ```printf("%s","\n")```, and all its occurrences are replaced with this value.


The remaining part are one subroutine - namely ```mySubroutine``` and the ```main``` program.

*Observe that this program is quite unusual - everything is in the same file, which is not actually the best practice for C programming. 
*

In short, the process of building a binary works as follows:

![CompilationProcess.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1647945820549/f0oZ2rGG9.png)

## Preprocessing
The very first thing that is required to create an executable (binary) consistent with the intentions of the developer is the actual inclusion of all required files and the replacement of the macros. 

This phase is called **preprocessing**. During the text preprocessing phase, the libraries are included, and all the occurrences of the macros  are replaced with the corresponding definition. We can see the output of this phase by running the command ```clang -E filename``` (or, if you prefer using GCC, ```gcc -E filename```). 

The result in the Debian environment contains some interesting chunks of code. For instance, it shows the contents of all included libraries (below we show parts of the ```stdlib.h``` library):

![Screenshot 2022-03-21 at 14.49.19.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1647874176151/eGyH-3cPu.png)

More interestingly, the result of the preprocessing on our program:

```
int mySubroutine(){
 int value;

 printf("%s", "Initialising random number generator");
 printf("%s","\n");
 srand((unsigned) time(0));
 value = ((int)rand()) % 73;

 return value;
}

int main(int argc, char const *argv[])
{
 int myNum = mySubroutine();
 printf("the random number is %d", myNum);
 printf("%s","\n");
 return 0;
}
```

All the occurrences of macros have been successfully replaced. A similar behavior can be observed on the MacOS machine. Here we show the initial inclusions:

![Screenshot 2022-03-21 at 14.59.54.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1647874816463/ykYiHXTej.png)

Observe that the result of the preprocessing phase is still a C program: all the libraries have been included, and all macros have been replaced. In fact, the only difference is that the code is more verbose and there is nothing that isn't explicitly defined. In the next step – the actual compilation – the output of the previous phase is taken and transformed into assembly code. 

## Compilation
Once all libraries have been included and macro replaced, the code can be compiled. With the process of **compilation**, the preprocessed code is translated from C to assembly. We can stop ```clang``` (or ```gdb```) after the compilation phase and obtain an assembly source (.s) file using the ```-S``` flag. A '```.s```' file (```.s``` stands for 'source') is generated. 

The result of this on our MacOS setup returns the following:

```
gbiondo@tripleX BA % clang -S main.c
gbiondo@tripleX BA % head -30 main.s
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 1
	.globl	_mySubroutine                   ## -- Begin function mySubroutine
	.p2align	4, 0x90
_mySubroutine:                          ## @mySubroutine
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	subq	$16, %rsp
	leaq	L_.str(%rip), %rdi
	leaq	L_.str.1(%rip), %rsi
	movb	$0, %al
	callq	_printf
	leaq	L_.str(%rip), %rdi
	leaq	L_.str.2(%rip), %rsi
	movb	$0, %al
	callq	_printf
	xorl	%eax, %eax
	movl	%eax, %edi
	callq	_time
	movl	%eax, %edi
	callq	_srand
	callq	_rand
	cltd
	movl	$73, %ecx
	idivl	%ecx
```
It is immediate noticing that the syntax here utilised is the AT&T one. To switch to the usual intel syntax, we can use the switch ```-masm=intel```. To me, it's just a matter of habit: I am more used to this syntax, but the contents don't really change.

Putting it all together, we obtain:

```
gbiondo@tripleX BA % clang -S -masm=intel main.c
gbiondo@tripleX BA % cat main.s            
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 1
	.intel_syntax noprefix
	.globl	_mySubroutine                   ## -- Begin function mySubroutine
	.p2align	4, 0x90
_mySubroutine:                          ## @mySubroutine
	.cfi_startproc
## %bb.0:
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset rbp, -16
	mov	rbp, rsp
	.cfi_def_cfa_register rbp
	sub	rsp, 16
	lea	rdi, [rip + L_.str]
	lea	rsi, [rip + L_.str.1]
	mov	al, 0
	call	_printf
	lea	rdi, [rip + L_.str]
	lea	rsi, [rip + L_.str.2]
	mov	al, 0
	call	_printf
	xor	eax, eax
	mov	edi, eax
	call	_time
	mov	edi, eax
	call	_srand
	call	_rand
	cdq
	mov	ecx, 73
	idiv	ecx
	mov	dword ptr [rbp - 4], edx
	mov	eax, dword ptr [rbp - 4]
	add	rsp, 16
	pop	rbp
	ret
	.cfi_endproc
                                        ## -- End function
	.globl	_main                           ## -- Begin function main
	.p2align	4, 0x90
_main:                                  ## @main
	.cfi_startproc
## %bb.0:
	push	rbp
	.cfi_def_cfa_offset 16
	.cfi_offset rbp, -16
	mov	rbp, rsp
	.cfi_def_cfa_register rbp
	sub	rsp, 32
	mov	dword ptr [rbp - 4], 0
	mov	dword ptr [rbp - 8], edi
	mov	qword ptr [rbp - 16], rsi
	call	_mySubroutine
	mov	dword ptr [rbp - 20], eax
	mov	esi, dword ptr [rbp - 20]
	lea	rdi, [rip + L_.str.3]
	mov	al, 0
	call	_printf
	lea	rdi, [rip + L_.str]
	lea	rsi, [rip + L_.str.2]
	mov	al, 0
	call	_printf
	xor	eax, eax
	add	rsp, 32
	pop	rbp
	ret
	.cfi_endproc
                                        ## -- End function
	.section	__TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
	.asciz	"%s"

L_.str.1:                               ## @.str.1
	.asciz	"Initialising random number generator"

L_.str.2:                               ## @.str.2
	.asciz	"\n"

L_.str.3:                               ## @.str.3
	.asciz	"the random number is %d"

.subsections_via_symbols
```

Interestingly, we can see the code for the two subroutines (```mySubroutine``` and ```main```) well delineated, as much as the definition of the C strings we have used.

The result of compilation is still something one can understand (as long as you can understand assembly code, indeed), but not yet something a machine can run – in fact we have:

```
gbiondo@tripleX BA % file main.s
main.s: assembler source text, ASCII text
```
so, the file is still a TEXT file - not a binary! 

## Assembly
The next stage is the so-called **assembly** phase, in which the assembly code that has been produced in the previous stage is now converted into opcodes. The output of this phase is an object (```.o```) file, which can be obtained by running ```clang``` (or ```gcc```) with the ```-c``` switch. 

We have:
```
gbiondo@tripleX BA % clang -c main.c 
gbiondo@tripleX BA % file main.o
main.o: Mach-O 64-bit object x86_64
```
and - obviously! - in the Debian environment, we'll have:
```
DebianShellcode% gcc -c main.c 
DebianShellcode% file main.o 
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
```
Analysing this file is a bit more complex. One way could be dumping its hex representation with ```hexdump -c main.o```. Part of the result is reported in the image below:

![Screenshot 2022-03-22 at 10.20.01.png](https://cdn.hashnode.com/res/hashnode/image/upload/v1647944473264/iTkP09pQb.png)

Observe that apart for some strings (like the highlighted ones), the format is not human readable. Here the difference between ELFs and MachOs is a bit more evident. Consider the last lines produced by the hexdump on a MacOS system:

```
00004a0  \0  \0  \0  \0  \0  \0  \0  \0  \0   _   m   a   i   n  \0   _
00004b0   p   r   i   n   t   f  \0   _   m   y   S   u   b   r   o   u
00004c0   t   i   n   e  \0   _   t   i   m   e  \0   _   s   r   a   n
00004d0   d  \0   _   r   a   n   d  \0      
``` 
containing a null-byte terminated list of all subroutines invoked in the program. Also the file size is different - but once again: we are talking about two different executable formats, this should be expected.

We can now produce an executable file. 

## Linking
The result of the previous phases is usually a collection of object files. During the **linking **phase, they are all combined into a single executable file. Shared libraries may be linked together with the code (static linking) or not (dynamic linking). This topic is outside of the scope of this article - more information can be found, for instance, in [BUFFER OVERFLOW 4](https://www.tenouk.com/Bufferoverflowc/Bufferoverflow1c.html).

The linking can be then obtained as follows:

```
gbiondo@tripleX BA % clang main.c -o main
gbiondo@tripleX BA % ls -al
total 128
drwxr-xr-x   6 gbiondo  staff    192 22 Mar 10:37 .
drwxr-xr-x  33 gbiondo  staff   1056 22 Mar 09:12 ..
-rwxr-xr-x   1 gbiondo  staff  49600 22 Mar 10:37 main
-rw-r--r--   1 gbiondo  staff    508 21 Mar 14:34 main.c
-rw-r--r--   1 gbiondo  staff   1240 22 Mar 10:13 main.o
-rw-r--r--   1 gbiondo  staff   1892 22 Mar 10:01 main.s
gbiondo@tripleX BA % file main 
main: Mach-O 64-bit executable x86_64
```
obviously in the Debian machine we'll have:

```
DebianShellcode% gcc main.c -o main   
DebianShellcode% file main
main: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=a88cde6d224edbea15116fd89ea89ed50b32703e, not stripped
```
*Note: Linux ```file``` command gives a more interesting output.*

# Conclusions
This article is just a foundation for future developments. Actually, it's a bit counterintuitive, but one cannot reverse a process (in this case, binary creation) without knowing the process itself.
In theory, I didn't write anything new, but this is a very vast and fascinating field - I just hope I gave you another view on one of the most basic process of the software development.
