Compilers and Self-hosting

It can indeed seem paradoxical at first glance, but the process of creating an assembler, or indeed any software development tool, follows a logical sequence that builds on itself. Here's a breakdown of how this works:

Historical Context: Bootstrap Compilers and Assemblers

  1. Manual Encoding:

    • In the early days of computing, the first programs, including assemblers and compilers, were written in machine code by hand. Programmers would write the binary instructions directly, or they would use switches on the front panel of the computer to enter machine code instructions.
  2. First Assembler:

    • Once the first assembler was created using machine code, it simplified the process of writing further software. This assembler could then translate assembly language into machine code, which made programming easier.
  3. Higher-Level Languages:

    • With an assembler available, programmers could write more sophisticated programs, including those that would be used to develop higher-level languages like C.
    • The first compilers for these higher-level languages were often written in assembly language. Once a compiler existed, it could compile higher-level language code into machine code.

Modern Development: Bootstrapping

  1. Writing Assemblers in Higher-Level Languages:

    • Modern assemblers are often written in higher-level languages like C for ease of development, maintainability, and portability.
    • The C compiler itself is a complex piece of software but follows the same principles.
  2. Bootstrapping Process:

    • Initial Compiler/Assembler: The initial compiler or assembler for a new language can be written in assembly or an already established higher-level language.
    • Cross-Compilation: Often, a new compiler or assembler is developed on an existing system using an existing compiler. This is called cross-compilation, where the compiler runs on one system but generates code for another system.
  3. Self-Hosting:

    • Once a compiler or assembler for a higher-level language like C is written, it can be compiled using an existing compiler.
    • After it’s compiled and functional, subsequent versions of the compiler can be written in its own language. For example, a C compiler can be written in C. This is known as self-hosting.

more on Self-hosting later... 

Example: GCC Compiler

  • Initial Development: The GNU Compiler Collection (GCC) was initially written in C.
  • Bootstrapping: The first version of GCC was compiled using an existing C compiler. Once GCC was working, subsequent versions of GCC could be compiled using GCC itself.

Conceptual Process

  1. Manual Bootstrapping: The very first tools were written manually in machine code.
  2. First Assemblers/Compilers: These tools enabled the creation of slightly more complex tools.
  3. Iterative Improvement: With each step, new tools enabled the creation of even more sophisticated software.
  4. Self-Hosting Compilers: Eventually, languages could be used to write their own compilers, creating a virtuous cycle of improvement.

 

Self-Hosting in Simple Terms

  1. Initial Compiler Development:

    • When creating a new programming language like Amber, the first step is to write a compiler (or interpreter) for that language.
    • This initial compiler is typically written in an existing, well-established language like C, C++, or even another high-level language such as Python or Java.
    • Let's say we choose to write the first version of the Amber compiler in C.
  2. Bootstrap Compilation:

    • Once the initial version of the Amber compiler is written in C, you compile it using a C compiler that already exists on your computer. This produces an executable or binary file that is the Amber compiler itself.
  3. Using the Amber Compiler:

    • Now that you have the Amber compiler executable, you can use it to compile programs written in Amber language into executable programs or other forms of output.
    • This compiler is capable of understanding and processing Amber language syntax and semantics because it was specifically designed to do so.
  4. Self-Hosting Aspect:

    • Continued Development: As you continue to develop the Amber language, you might want to introduce new features, improve performance, fix bugs, etc.
    • Writing in Amber: For future versions of the Amber compiler, you can indeed write the new compiler in Amber itself. This means using the existing Amber compiler to compile the newer version of the Amber compiler.
    • Advantages: Writing the compiler in Amber allows you to fully leverage the capabilities and features of the Amber language. It also ensures that the compiler remains consistent with the language's evolving syntax and semantics.
  5. Iterative Improvement:

    • Each new version of the Amber compiler can be written in Amber, compiled by the previous version of the Amber compiler. This iterative process is what we refer to as self-hosting.

Why Self-Hosting?

  • Language Consistency: Ensures that the compiler and the language it compiles are closely aligned and can evolve together.
  • Leveraging Language Features: Allows you to use Amber's features to implement more advanced optimizations and features in the compiler itself.
  • Community Contribution: Makes it easier for others in the Amber community to contribute to the development of the compiler since they can work directly in Amber.

Practical Example

  • Initially, you might write Amber compiler v1.0 in C.
  • Using this v1.0 compiler, you compile programs written in Amber.
  • When developing Amber v2.0, you write the compiler in Amber and compile it using v1.0.
  • This ensures that Amber v2.0 compiler understands and can process the new features introduced in Amber v2.0.

In summary, self-hosting means writing a compiler for a programming language in the same language it is intended to compile. This approach allows for a more integrated and consistent development process as the language evolves over time. It’s a practical way to ensure that the compiler stays up-to-date with the latest language features and improvements.

Assembly

Assembler code, often referred to as assembly language, is written in a specific assembly language that corresponds to a particular CPU architecture. Each type of CPU architecture (e.g., x86, ARM, MIPS) has its own unique assembly language syntax and instruction set. Here’s a closer look at what this entails:

Assembly Language


Assembly language is a low-level programming language that uses mnemonic codes and symbols to represent machine-level instructions.
It is specific to a given CPU architecture, meaning that the assembly language for an x86 processor differs from the assembly language for an ARM processor.
 
Assembler

The assembler is the tool that translates the written assembly language code into machine code (binary code) that the CPU can execute.
Each assembler is designed to handle the syntax and instruction set of a specific assembly language.
Unlike higher-level languages, assembly language does not have an intermediate step like compilation or interpretation. There's no separate executable file generated from assembly code; the output of the assembler is the machine code itself.