Skip to main content

Command Palette

Search for a command to run...

Building a binary

Binary Analysis 101

Published
8 min read
Building a binary
G

Math guy who's into Cryptography, into iOS/MacOS development, and obviously into hacking/pentesting. Writing stuff in C/C++/ObjectiveC/Swift/Python/Assembly.

Abstract

In this article, we discuss the process of compilation. We need to understand how something is built, prior to its reverse-engineering.

We will work with clang on Debian, a quite common setup. We are also taking a look at the same process on MacOS.

The process of Compilation

In this article, we approach the process of C compilation. Changing the opportune details, this exercise can be replicated with other programming languages. Here we focus on the C language because in this case the process can be easily streamlined and, consequently, analysed.

Here we work with clang; the steps with GCC are identical. For practical reasons (we're running a Debian on a VM hosted on a MacOS), we will tend to give the MacOS examples, highlighting the differences where there are some.

We will work with the following program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define        MESSAGE1        "Initialising random number generator"
#define        NL              "\n"
#define        ENDL            printf("%s",NL)

#define        MAXNUM            73

int mySubroutine(){
    int value;
    // Initialising random number generator
    printf("%s", MESSAGE1);
    ENDL;
    srand((unsigned) time(0));
    value = ((int)rand()) % MAXNUM;

    return value;
}

int main(int argc, char const *argv[])
{
    int myNum = mySubroutine();
    printf("the random number is %d", myNum);
    ENDL;
    return 0;
}

The program is very straightforward. In the first three lines, some C libraries are included. These instructions are usually called include directives and do what you expect them to do: they include source files in the current program in order to supply some functionalities.

The #define instructions (called macros) represent placeholders. During the compilation process, all instances of the macros are replaced with the specified text. To fix the ideas, take the NL macro. It is only used in another macro, ENDL.

The text preprocessor recursively replaces all the macros, hence ENDL becomes printf("%s","\n"), and all its occurrences are replaced with this value.

The remaining part are one subroutine - namely mySubroutine and the main program.

Observe that this program is quite unusual - everything is in the same file, which is not actually the best practice for C programming.

In short, the process of building a binary works as follows:

CompilationProcess.png

Preprocessing

The very first thing that is required to create an executable (binary) consistent with the intentions of the developer is the actual inclusion of all required files and the replacement of the macros.

This phase is called preprocessing. During the text preprocessing phase, the libraries are included, and all the occurrences of the macros are replaced with the corresponding definition. We can see the output of this phase by running the command clang -E filename (or, if you prefer using GCC, gcc -E filename).

The result in the Debian environment contains some interesting chunks of code. For instance, it shows the contents of all included libraries (below we show parts of the stdlib.h library):

Screenshot 2022-03-21 at 14.49.19.png

More interestingly, the result of the preprocessing on our program:

int mySubroutine(){
 int value;

 printf("%s", "Initialising random number generator");
 printf("%s","\n");
 srand((unsigned) time(0));
 value = ((int)rand()) % 73;

 return value;
}

int main(int argc, char const *argv[])
{
 int myNum = mySubroutine();
 printf("the random number is %d", myNum);
 printf("%s","\n");
 return 0;
}

All the occurrences of macros have been successfully replaced. A similar behavior can be observed on the MacOS machine. Here we show the initial inclusions:

Screenshot 2022-03-21 at 14.59.54.png

Observe that the result of the preprocessing phase is still a C program: all the libraries have been included, and all macros have been replaced. In fact, the only difference is that the code is more verbose and there is nothing that isn't explicitly defined. In the next step – the actual compilation – the output of the previous phase is taken and transformed into assembly code.

Compilation

Once all libraries have been included and macro replaced, the code can be compiled. With the process of compilation, the preprocessed code is translated from C to assembly. We can stop clang (or gdb) after the compilation phase and obtain an assembly source (.s) file using the -S flag. A '.s' file (.s stands for 'source') is generated.

The result of this on our MacOS setup returns the following:

gbiondo@tripleX BA % clang -S main.c
gbiondo@tripleX BA % head -30 main.s
    .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 12, 0    sdk_version 12, 1
    .globl    _mySubroutine                   ## -- Begin function mySubroutine
    .p2align    4, 0x90
_mySubroutine:                          ## @mySubroutine
    .cfi_startproc
## %bb.0:
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register %rbp
    subq    $16, %rsp
    leaq    L_.str(%rip), %rdi
    leaq    L_.str.1(%rip), %rsi
    movb    $0, %al
    callq    _printf
    leaq    L_.str(%rip), %rdi
    leaq    L_.str.2(%rip), %rsi
    movb    $0, %al
    callq    _printf
    xorl    %eax, %eax
    movl    %eax, %edi
    callq    _time
    movl    %eax, %edi
    callq    _srand
    callq    _rand
    cltd
    movl    $73, %ecx
    idivl    %ecx

It is immediate noticing that the syntax here utilised is the AT&T one. To switch to the usual intel syntax, we can use the switch -masm=intel. To me, it's just a matter of habit: I am more used to this syntax, but the contents don't really change.

Putting it all together, we obtain:

gbiondo@tripleX BA % clang -S -masm=intel main.c
gbiondo@tripleX BA % cat main.s            
    .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 12, 0    sdk_version 12, 1
    .intel_syntax noprefix
    .globl    _mySubroutine                   ## -- Begin function mySubroutine
    .p2align    4, 0x90
_mySubroutine:                          ## @mySubroutine
    .cfi_startproc
## %bb.0:
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov    rbp, rsp
    .cfi_def_cfa_register rbp
    sub    rsp, 16
    lea    rdi, [rip + L_.str]
    lea    rsi, [rip + L_.str.1]
    mov    al, 0
    call    _printf
    lea    rdi, [rip + L_.str]
    lea    rsi, [rip + L_.str.2]
    mov    al, 0
    call    _printf
    xor    eax, eax
    mov    edi, eax
    call    _time
    mov    edi, eax
    call    _srand
    call    _rand
    cdq
    mov    ecx, 73
    idiv    ecx
    mov    dword ptr [rbp - 4], edx
    mov    eax, dword ptr [rbp - 4]
    add    rsp, 16
    pop    rbp
    ret
    .cfi_endproc
                                        ## -- End function
    .globl    _main                           ## -- Begin function main
    .p2align    4, 0x90
_main:                                  ## @main
    .cfi_startproc
## %bb.0:
    push    rbp
    .cfi_def_cfa_offset 16
    .cfi_offset rbp, -16
    mov    rbp, rsp
    .cfi_def_cfa_register rbp
    sub    rsp, 32
    mov    dword ptr [rbp - 4], 0
    mov    dword ptr [rbp - 8], edi
    mov    qword ptr [rbp - 16], rsi
    call    _mySubroutine
    mov    dword ptr [rbp - 20], eax
    mov    esi, dword ptr [rbp - 20]
    lea    rdi, [rip + L_.str.3]
    mov    al, 0
    call    _printf
    lea    rdi, [rip + L_.str]
    lea    rsi, [rip + L_.str.2]
    mov    al, 0
    call    _printf
    xor    eax, eax
    add    rsp, 32
    pop    rbp
    ret
    .cfi_endproc
                                        ## -- End function
    .section    __TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
    .asciz    "%s"

L_.str.1:                               ## @.str.1
    .asciz    "Initialising random number generator"

L_.str.2:                               ## @.str.2
    .asciz    "\n"

L_.str.3:                               ## @.str.3
    .asciz    "the random number is %d"

.subsections_via_symbols

Interestingly, we can see the code for the two subroutines (mySubroutine and main) well delineated, as much as the definition of the C strings we have used.

The result of compilation is still something one can understand (as long as you can understand assembly code, indeed), but not yet something a machine can run – in fact we have:

gbiondo@tripleX BA % file main.s
main.s: assembler source text, ASCII text

so, the file is still a TEXT file - not a binary!

Assembly

The next stage is the so-called assembly phase, in which the assembly code that has been produced in the previous stage is now converted into opcodes. The output of this phase is an object (.o) file, which can be obtained by running clang (or gcc) with the -c switch.

We have:

gbiondo@tripleX BA % clang -c main.c 
gbiondo@tripleX BA % file main.o
main.o: Mach-O 64-bit object x86_64

and - obviously! - in the Debian environment, we'll have:

DebianShellcode% gcc -c main.c 
DebianShellcode% file main.o 
main.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

Analysing this file is a bit more complex. One way could be dumping its hex representation with hexdump -c main.o. Part of the result is reported in the image below:

Screenshot 2022-03-22 at 10.20.01.png

Observe that apart for some strings (like the highlighted ones), the format is not human readable. Here the difference between ELFs and MachOs is a bit more evident. Consider the last lines produced by the hexdump on a MacOS system:

00004a0  \0  \0  \0  \0  \0  \0  \0  \0  \0   _   m   a   i   n  \0   _
00004b0   p   r   i   n   t   f  \0   _   m   y   S   u   b   r   o   u
00004c0   t   i   n   e  \0   _   t   i   m   e  \0   _   s   r   a   n
00004d0   d  \0   _   r   a   n   d  \0

containing a null-byte terminated list of all subroutines invoked in the program. Also the file size is different - but once again: we are talking about two different executable formats, this should be expected.

We can now produce an executable file.

Linking

The result of the previous phases is usually a collection of object files. During the linking phase, they are all combined into a single executable file. Shared libraries may be linked together with the code (static linking) or not (dynamic linking). This topic is outside of the scope of this article - more information can be found, for instance, in BUFFER OVERFLOW 4.

The linking can be then obtained as follows:

gbiondo@tripleX BA % clang main.c -o main
gbiondo@tripleX BA % ls -al
total 128
drwxr-xr-x   6 gbiondo  staff    192 22 Mar 10:37 .
drwxr-xr-x  33 gbiondo  staff   1056 22 Mar 09:12 ..
-rwxr-xr-x   1 gbiondo  staff  49600 22 Mar 10:37 main
-rw-r--r--   1 gbiondo  staff    508 21 Mar 14:34 main.c
-rw-r--r--   1 gbiondo  staff   1240 22 Mar 10:13 main.o
-rw-r--r--   1 gbiondo  staff   1892 22 Mar 10:01 main.s
gbiondo@tripleX BA % file main 
main: Mach-O 64-bit executable x86_64

obviously in the Debian machine we'll have:

DebianShellcode% gcc main.c -o main   
DebianShellcode% file main
main: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=a88cde6d224edbea15116fd89ea89ed50b32703e, not stripped

Note: Linux file command gives a more interesting output.

Conclusions

This article is just a foundation for future developments. Actually, it's a bit counterintuitive, but one cannot reverse a process (in this case, binary creation) without knowing the process itself. In theory, I didn't write anything new, but this is a very vast and fascinating field - I just hope I gave you another view on one of the most basic process of the software development.