09 August 2020/ miscellaneous

building and running c++ programs

Given a problem which we are trying to solve, we software developers/programmers ultimately have to write code. This task can be accomplished using any text editor of choice, but at the end code is just text. This source code is something which we can understand, if the same thing needs to be understood by the computer, the source code needs to be transformed into machine code. In order to transform the source code into machine code, a special software called as compiler is used, and this transformation process is called compilation. Compiling a C++ program is a sequence of complex tasks that results in machine code generation.

Typically, a C++ compiler parses and analyzes the source code, generates intermediate code, optimizes it, finally generates machine code in a file called an Object file. These are the files which have individual extensions .o in Linux/Unix, .obj in Windows. Compilation involves several source files, and compiling each source file results in a separate Object file. These object files are then Linked together by another software called Linker to form a single executable file which can be executed on a computer. Usually the Object files contain some additional information alongside the source code, this additional Meta data is used by the Linker to generate the executable.

Above diagram is a pictorial representation of the program building phase in C++.

The C++ application-building process consists of three major steps: pre-processing, compiling, and linking. All of these steps are done using different tools, but modern compilers encapsulate them in a single tool, thereby providing a single and more straightforward interface for programmers.

The generated executable file persists on the hard drive of the computer. In order to run it, it should be copied to the main memory, the RAM. The copying is done by another tool, named the loader. The loader is a part of the operating system that knows what and where should be copied from the contents of the executable file. After loading the executable file to the main memory, the original executable file won’t be deleted from the hard drive.

The loading and running of a program is done by the operating system (OS). The OS manages the execution of the program, prioritizes it over other programs, unloads it when it’s done, and so on. The running copy of the program is called a process. A process is an instance of an executable file.

Pre-Processing

A pre-processor is intended to process source files to make them ready for compilation. A pre-processor works with pre-processor Directives(#include, #define etc). Directives don’t represent program statements, but they are commands for the pre-processor, telling it what to do with the text of the source file. The Compiler does not understand or recognize these directives, so whenever we use pre-processor directives in our code, the pre-processor resolves them accordingly before the actual compilation of the code by the Compiler. For example, the following code will be changed before the compiler starts to compile it:

#define NUMBER 41

int main(){
    int a = NUMBER + 1;
    return 0;
}

Everything that is defined using the #define directive is called a macro, After pre-processing, the compiler gets the transformed source code in the form:

int main(){
    int a = 41 + 1;
    return 0;
}

Even though this pre-processor directive seems kind of cool, it should be used with proper care. We should always keep one thing in mind Pre-processor just parses text without recognizing language rules or grammar. The following example is valid according to the pre-processor even though logically it does not seem correct

#define NUMBER 41

struct T{};
T t = NUMBER; // this will be successfully pre-processed, but it will result in a compile error.

Here’s another example where the macro is syntactically correct, but will result in logical errors.

#define DOUBLE_IT(arg) (arg * arg) 
int main(){
    int st = DOUBLE_IT(4);
    std::cout << st; // 16 which is correct
    
    int bad_result = DOUBLE_IT(4 + 1); 
    std::cout << bad_result; // 9 which is the result of 4 + 1 * 4 + 1
}

Compilation

As depicted in the above diagram, compilation in turn consists of multiple intermediate steps, let’s discuss each of them in detail

Tokenization

The analysis phase of the compiler aims to split the source code into small units called tokens. A token may be a word or just a single symbol, such as = (the equals sign). A token is the smallest unit of the source code that carries meaningful value for the compiler. For example, the expression int a = 42; will be divided into the tokens int, a, =, 42, and ;. The expression isn’t just split by spaces, because the following expression is being split into the same tokens (though it is advisable not to forget the spaces between operands):

int a=42;

The splitting of the source code into tokens is done using sophisticated methods using regular expressions. It is known as lexical analysis, or tokenization (dividing into tokens). For compilers, using a tokenized input presents a better way to construct internal data structures used to analyze the syntax of the code. Let’s see how.

Syntax analysis

When speaking about programming language compilation, we usually differentiate two terms: syntax and semantics. The syntax is the structure of the code; it defines the rules by which tokens combined make structural sense. For example, day nice is a syntactically correct phrase in English because it does not contain errors in either of the tokens. Semantics, on the other hand, concerns the actual meaning of the code. That is, day nice is semantically incorrect and should be corrected as a nice day.

Syntax analysis is a crucial part of source analysis, because tokens will be analyzed syntactically and semantically, that is, as to whether they bear any meaning that conforms to the general grammar rules. Take the following, for example:

int b = a + 0;

It may not make sense for us, because adding zero to the variable won’t change its value, but the compiler does not look on logical meaning here—it looks for the syntactic correctness of the code (a missing semicolon, a missing closing parenthesis, and more). Checking the syntactic correctness of the code is done in the syntax analysis phase of compilation. The lexical analysis divides the code into tokens; syntax analysis checks for syntactic correctness, which means that the aforementioned expression will produce a syntax error if we have missed a semicolon:

int b = a + 0

g++ will complain with the expected ’;’ at end of declaration error.

Semantic analysis

If the previous expression was something like it b = a + 0; , the compiler would divide it into the tokens it, b, =, and others. We already see that it is something unknown, but for the compiler, it is fine at this point. This would lead to the compilation error unknown type name “it” in g++. Finding the meaning behind expressions is the task of semantic analysis (parsing).

Intermediate Code Generation

After all the analysis is completed, the compiler generates intermediate code that is a light version of C++ mostly C. A simple example would be the following:

class A { 
public:
  int get_member() { return mem_; }
private: 
  int mem_; 
};

After analyzing the code, intermediate code will be generated (this is an abstract example meant to show the idea of the intermediate code generation; compilers may differ in implementation):

struct A { 
  int mem_; 
};

int A_get_member(A* this) { return this->mem_; }

Optimization

Generating intermediate code helps the compiler to make optimizations in the code. Compilers try to optimize code a lot. Optimizations are done in more than one pass. For example, take the following code:

int a = 41; 
int b = a + 1;

This will be optimized into this during compilation:

int a = 41; 
int b = 41 + 1;

This again will be optimized into the following:

int a = 41; 
int b = 42;

Machine code Generation

Compiler optimizations are done in both intermediate code and generated machine code. So what is it like when we compile the project? Assuming a simple project structure containing several source files, including two headers, rect.h and square.h, each with its .cpp files, and main.cpp, which contained the program entry point (the main() function). After the pre-processing, the following units are left as input for the compiler: main.cpp, rect.cpp, and square.cpp, as depicted in the following diagram:

The compiler will compile each separately. Compilation units, also known as source files, are independent of each other in some way. When the compiler compiles main.cpp, which has a call to the getarea() function in Rect, it does not include the getarea() implementation in main.cpp. Instead, it is just sure that the function is implemented somewhere in the project. When the compiler gets to rect.cpp, it does not know that the get_area() function is used somewhere.

Here’s what the compiler gets after main.cpp passes the pre-processing phase:

// contents of the iostream 
struct Rect {
private:
  double side1_;
  double side2_;
public:
  Rect(double s1, double s2);
  const double get_area() const;
};

struct Square : Rect {
  Square(double s);
};

int main() {
  Rect r(3.1, 4.05);
  std::cout << r.get_area() << std::endl;
  return 0;
}

After analyzing main.cpp, the compiler generates the following intermediate code (many details are omitted to simply express the idea behind compilation):

struct Rect { 
  double side1_; 
  double side2_; 
};
void _Rect_init_(Rect* this, double s1, double s2); 
double _Rect_get_area_(Rect* this); 

struct Square { 
  Rect _subobject_; 
};
void _Square_init_(Square* this, double s); 

int main() {
  Rect r;
  _Rect_init_(&r, 3.1, 4.05); 
  printf("%d\n", _Rect_get_area(&r)); 
  // we've intentionally replace cout with printf for brevity and 
  // supposing the compiler generates a C intermediate code
  return 0;
}

The compiler will remove the Square Struct with its constructor function (we named it Squareinit_) while optimizing the code because it was never used in the source code.

At this point, the compiler operates with main.cpp only, so it sees that we called the Rectinit_ and Rectgetarea functions but did not provide their implementation in the same file. However, as we did provide their declarations beforehand, the compiler trusts us and believes that those functions are implemented in other compilation units. Based on this trust and the minimum information regarding the function signature (its return type, name, and the number and types of its parameters), the compiler generates an object file that contains the working code in main.cpp and somehow marks the functions that have no implementation but are trusted to be resolved later. The resolving is done by the linker.

In the following example, we have the simplified variant of the generated object file, which contains two sections—code and information. The code section has addresses for each instruction (the hexadecimal values):

code: 
0x00 main
  0x01 Rect r; 
  0x02 _Rect_init_(&r, 3.1, 4.05); 
  0x03 printf("%d\n", _Rect_get_area(&r)); 
information:
  main: 0x00
  _Rect_init_: ????
  printf: ????
  _Rect_get_area_: ????

Take a look at the information section. The compiler marks all the functions used in the code section that were not found in the same compilation unit with ????. These question marks will be replaced by the actual addresses of the functions found in other units by the linker. Finishing with main.cpp, the compiler starts to compile the rect.cpp file:

// file: rect.cpp 
struct Rect {
  // #include "rect.h" replaced with the contents  
  // of the rect.h file in the preprocessing phase 
  // code omitted for brevity 
};
Rect::Rect(double s1, double s2) 
  : side1_(s1), side2_(s2)
{}
const double Rect::get_area() const { 
  return side1_ * side2_;
}

Following the same logic here, the compilation of this unit produces the following output (don’t forget, I’m still providing abstract examples):

code:  
  0x00 _Rect_init_ 
  0x01 side1_ = s1 
  0x02 side2_ = s2 
  0x03 return 
  0x04 _Rect_get_area_ 
  0x05 register = side1_ 
  0x06 reg_multiply side2_ 
  0x07 return 
information: 
  _Rect_init_: 0x00
  _Rect_get_area_: 0x04

This output has all the addresses of the functions in it, so there is no need to wait for some functions to be resolved later.

Linking

The compiler outputs an object file for each compilation unit. In the previous example, we had three .cpp files and the compiler produced three object files. The task of the linker is to combine these object files together into a single object file. Combining files together results in relative address changes; for example, if the linker puts the rect.o file after main.o, the starting address of rect.o becomes 0x04 instead of the previous value of 0x00:

code: 
  0x00 main
  0x01 Rect r; 
  0x02 _Rect_init_(&r, 3.1, 4.05); 
  0x03 printf("%d\n", _Rect_get_area(&r)); 
  0x04 _Rect_init_ 
  0x05 side1_ = s1 
  0x06 side2_ = s2 
  0x07 return 
  0x08 _Rect_get_area_ 
  0x09 register = side1_ 
  0x0A reg_multiply side2_ 
  0x0B return 
information (symbol table):
  main: 0x00
  _Rect_init_: 0x04
  printf: ????
  _Rect_get_area_: 0x08 
  _Rect_init_: 0x04
  _Rect_get_area_: 0x08

The linker correspondingly updates the symbol table addresses (the information: section in our example). As mentioned previously, each object file has its symbol table, which maps the string name of the symbol to its relative location (address) in the file. The next step of linking is to resolve all the unresolved symbols in the object file.

Now that the linker has combined main.o and rect.o together, it knows the relative location of unresolved symbols because they are now located in the same file. The printf symbol will be resolved the same way, except this time it will link the object files with the standard library. After all the object files are combined together (we omitted the linking of square.o for brevity), all addresses are updated, and all the symbols are resolved, the linker outputs the one final object file that can be executed by the operating system. As mentioned earlier, the OS uses a tool called the loader to load the contents of the executable file into the memory

Linking Libraries

A library is similar to an executable file, with one major difference: it does not have a main() function, which means that it cannot be invoked as a regular program. Libraries are used to combine code that might be reused with more than one program. We already linked your programs with the standard library by including the header, for example.

Libraries can be linked with the executable file either as static or dynamic libraries. When we link them as a static library, they become a part of the final executable file. A dynamically linked library should also be loaded into memory by the OS to provide your program with the ability to call its functions. Let’s suppose we want to find the square root of a function:

int main() {
  double result = sqrt(49.0);
}

The C++ standard library provides the sqrt() function, which returns the square root of its argument. If you compile the preceding example, it will produce an error insisting that the sqrt function has not been declared. We know that to use the standard library function, we should include the corresponding header. But the header file does not contain the implementation of the function; it just declares the function (in the std namespace), which is then included in our source file:

#include <cmath>
int main() {
  double result = std::sqrt(49.0);
}

The compiler marks the address of the sqrt symbol as unknown, and the linker should resolve it in the linking stage. The linker will fail to resolve it if the source file is not linked with the standard library implementation (the object file containing the library functions).

The final executable file generated by the linker will consist of both our program and the standard library if the linking was static. On the other hand, if the linking is dynamic, the linker marks the sqrt symbol to be found at runtime.

Now when we run the program, the loader also loads the library that was dynamically linked to our program. It loads the contents of the standard library into the memory as well and then resolves the actual location of the sqrt() function in memory. The same library that is already loaded into the memory can be used by other programs as well.

building and running c++ programs

object allocations in golang

Hash tables and hash functions