The most important part of a good computer is good software. Without good software, specifically optimized for its hardware, the full computational power cannot be utilized. In this book, I will show you how to build a supercomputer cluster that gains its high-speed computational power from distributing certain tasks to other its via networking. For this purpose, special software is required and has to be compiled from the source code. How this works and what nodes are will be explained in Chapter 2, Building a Beowulf Cluster, and Chapter 3, Operating System
Setup and Configuration.
The open source philosophy
Although there are a lot of already existing helpful software packages, it is very important to understand that Linux is an open source operating system written for an open source community. Usually, Windows users are frustrated when they gain first contact with open source software, because they are used to having already working and easy-to-install software. A huge disadvantage is that these software packets are compiled for a standard platform and might not be optimized to a specific computer that they are installed on. Another problem is that if software components of the operating system are updated but older versions are required by the user software, instabilities might arise or completely different interfaces might disrupt the software functionality completely.
Software modularity and dependencies
Linux is a highly modular operating system. The whole system is built on the philosophy of open software, which means that every part of the operating system can be compiled from available open source code. This source code is then compiled by standard programming languages such as C, C++, FORTRAN, Assembler, and others in order to build binary code specifically optimized for certain hardware. The technique by which software is built does not differ much from Windows or Linux operating systems. For the beginner, it might be hard to produce a working compiled program starting from source code because usually, there are a lot of software dependencies such as missing software libraries or other programs that code is based upon. In this case, it might be hard to find all the required libraries, especially when newer versions that have changed in interfaces such as function definitions are available. On the other hand, a dependency can soon lead to several others so that the search for all the required libraries grows exponentially and takes a lot of time.
Also, on certain hardware, some well-established compiling parameters do not work and have to be modified or bug fixes have to be found. This can make the simple task of "just compiling software" an unsolvable problem for beginners. Thanks to the rising community of hobby programmers and Linux enthusiasts, there are a lot of forums online that can be searched for such problems. Often, solutions are present, and if not, there can be hints that point us in the right direction, at least.
The following sections will explain the basics of creating software on Linux operating systems with standard programming environments. It is written for hobby enthusiasts who might or might not have already tried and programmed their own software. Although existing knowledge is very helpful, it is not required in order to understand the following explanations.
The source code and programming languages
Each computer program consists of binary code, which means a sequence of two states usually described as zero and one. A specific state is called a
bit. Four of these bits make up a so-called
nibble and eight make up a byte. Several bytes can be described as a word, a double word, or a
quad word. The following table gives you a small summary of the most important data sizes:
A central processor does nothing else except interpreting bit sequences as commands. These commands are called instructions and tell the CPU what to do. This is the lowest level of programming, which is the so-called machine language. Machine language is, except to certain freaky people, not human-readable. For example, it is not obvious that the binary code 1011 0100 0100 1100 1100 1101 0010 0001 is the end of an MS-DOS program.
Low-level programming means the direct programming of machine language. Of course, this has to happen in a human-readable way. One possibility to simplify 1011 0100 0100 1100 is given by using another number system, such as the hexadecimal system, resulting in 0xB4 0x4C. This is better, but it's still not readable by humans. The final simplification is the invention of so-called mnemonics. For example, on Intel x86-platforms, 0xB4 0x4C would mean mov ah
, 0x4C
in this mnemonics language. Now, one can understand that this code sets the CPU register named ah
to the value of 0x4C
. This language as well as the software that translates this back into bits is called Assembler.
Assembler has its advantages and disadvantages. One big advantage is that the resulting software does exactly what you programmed. This means that there is no optimization that modifies your code and you can program very effectively in size and speed. One big disadvantage, however, is the problem that each CPU has its own instruction set. This means that our Sitara CPU will not understand the preceding example, because it has no ah
register. For any real problems we want to solve using computers, we are primarily interested in the nature of the problem and not the nature of the CPU used in the computer. To make programming independent of the used computer platform, there exist so-called high-level programming languages.
High-level programming languages consist of keywords, syntax, and grammar, as with every spoken language. The keywords define the vocabulary that can be used, whereas syntax and grammar define the exact utilization and order of these keywords. To understand this, we should have a look at how a simple loop will look in a low-level language compared to a high-level language.
The following example shows us that a simple loop already needs four different instructions in Assembler, while it can be realized by a relatively simple for
keyword in the high-level language C++:
A low-level language compared to a high-level language
While C++ is a general-purpose high-level programming language, there are also languages that are more specifically optimized. One example is FORTRAN, which is mainly used for mathematical problems due to its ability to define matrices and other mathematical structures very easily.
Once the required code has been written and is ready to be translated from its human-readable form into machine language, there is a certain sequence of tools that have to be used. This toolset is called the compiler toolchain:
- Firstly, the code is treated by the compiler itself. The compiler translates the high-level language to a low-level language, mostly Assembler.
- It is then translated to object files. Usually, these two processes are performed by only one compiler internally.
- The object files then have the binary format and can be executed theoretically. However, the OS must know a few things in order to execute programs correctly. It must be told where in the main memory the program has to be loaded, how much memory it uses, which libraries it needs, and so on. To fulfill these requirements, we need to link the program.
- The so-called linker combines one or more object files and adds information that's specific to the OS in use.
- The final program then consists of a special executable format generated by the linker and incorporating the object files produced by the compiler. In Linux, this executable format is called Executable and Linking Format (ELF).
The whole process is depicted in the following diagram:
Another important feature of the linker is its capability to embed the required libraries into the executable file. This is called
static linking. The program can then be used on other computers that do not have that specific library installed. The opposite of static linking is
dynamic linking. In this case, only a stub of the library is linked into the program that tells the OS which library to provide. Dynamically-linked programs are smaller in size but always need their libraries.
In all examples, the main focus will be on C++. Some of the modules are only available as FORTRAN code; however, once compiled, their functions can also be accessed from C++ programs.