Code Generation #

As I said earlier, there is no way that you can write a proper compiler on your own, due to the amount of work involved.

But new languages are invented by individual people all the time. How is this possible? The answer is, they use a backend framework or library for this part.

This is the part where the Intermediate Representation (IR) comes it. It is determined by the library being used, which means the language itself can be abstracted away. Just convert any syntax you’re doing to that IR.

Fortunately, most of these libraries use similar concepts and representations, because they make the backends easy to implement and programs easy to optimize. In order to know what we’re looking at, let’s go through basic concepts first.

Basic Building Blocks: names, instructions, and blocks #

A common representation of a program initially is a control-flow graph. Each node is called a block, and represents a single “chunk” of instructions with one entry point and multiple exits.

For example, consider this psudeocode:

1
2
3
4
5
6


while n < 10
    x = read integer from file
    array[n] = x
    n = n + 1
end while
print array[0]

This would be broken into 3 blocks. Note, all the logic in the loop can be thought of as a single block, as the operations are consecutive without any conditional logic (as long as me assume “read integer from file” can’t fail)

graph LR A("Start") --> B[n < 10] C --> D(End) B -->|true| E subgraph Loop E["x = read integer from file;"] E --> F["array[n] = x;"] F --> G["n = n + 1;"] end G --> B B -->|false| C["print array[0]"] style A fill:#9F9 style D fill:#F99 style Loop color:#fff,fill:#333

You can perhaps see how this is useful to the compiler: when it wants to do things like eliminating dead code, rearranging conditionals, or inlining a function to its caller, it just has to move some arrows around independent of the contents.

As for the blocks themselves, what do they contain? Surprise surprise, each statement turns out to be something like a tree! The root is called an r-value, that is the result of an expression. Exactly which operations depend on the specific backend, but walking the tree is an easy way to build these.

Finally, we have functions. A function is basically a block with a name – usually a symbol by the time the compiler passes the code off to the linker. In particular, these names can be imported or exported.

For example, if we want our language to print something, we need to import the print function from something that can interact with the rest of the system. Simliarly, if we want to (spoiler alert) support subroutines at some point, that means we need to export the name of our subroutine to be called by other code.

Remember that by the time we get to this phase, we already have our program as a giant tree, with operations as parent nodes and integers as leaf nodes.

It should be pretty easy to make one big block that contains a single rvalue each time the user presses enter. So let’s do it!

Setting up libgccjit #

While LLVM is popular, we’ll use libgccgit. It is a library that ship with GCC to create an API for a just-in-time compiler, which is what we’re really building here.

The reason we’re picking it is not because it’s more popular – in fact, being GPL, a lot of projects don’t like it. However, it does have one good feature: a C API that it is possible to bind to Python. To achieve this, we will use an FFI translation tool called SWIG.

First, verify your installation of GCC, and libgccjit. There may be different packages depending on your distro. If you’re on Arch, the package libgccjit is available in the core repo. The docs have a hello world test example you can run to verify this.

Once it works, find the path to the header file that the example used. On my system running Arch, it’s /usr/include/libgccjit.h. But it likely depends on your distro.

SWIG is based on “interface files”, which describes some properties about the library being bound. There is no point going deep into the syntax, because the library “just works” – almost.

Just a few extra definitions are needed. Copy-paste this and trust me, just this once.

gcccjit.i

DOWNLOAD

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92


%module gccjit

%{
#define SWIG_FILE_WITH_INIT
#include <libgccjit.h>
%}

// This tells SWIG to treat char ** as a special case
%typemap(in) char ** {
  /* Check if is a list */
  if (PyList_Check($input)) {
    int size = PyList_Size($input);
    int i = 0;
    $1 = (char **) malloc((size+1)*sizeof(char *));
    for (i = 0; i < size; i++) {
      PyObject *o = PyList_GetItem($input, i);
      if (PyString_Check(o)) {
        $1[i] = PyString_AsString(PyList_GetItem($input, i));
      } else {
        free($1);
        PyErr_SetString(PyExc_TypeError, "list must contain strings");
        SWIG_fail;
      }
    }
    $1[i] = 0;
  } else {
    PyErr_SetString(PyExc_TypeError, "not a list");
    SWIG_fail;
  }
}

// This cleans up the char ** array we malloc'd before the function call
// Copied from the SWIG documentation, so beware the license!
%typemap(freearg) char ** {
  free((char *) $1);
}

// This tells SWIG to treat gcc_jit_param ** as a special case
%typemap(in) gcc_jit_param ** {
  /* Check if is a list */
  if (PyList_Check($input)) {
    int size = PyList_Size($input);
    int i = 0;
    $1 = (gcc_jit_param **) malloc((size+1)*sizeof(gcc_jit_param *));
    for (i = 0; i < size; i++) {
      if(SWIG_ConvertPtr(PyList_GetItem($input, i), (void **)&$1[i], $descriptor(gcc_jit_param *), 0) < 0) {
        PyErr_SetString(PyExc_TypeError, "in function '$symname', expecting type gcc_jit_param");
        SWIG_fail;
      }
    }
    $1[i] = 0;
  } else {
    PyErr_SetString(PyExc_TypeError, "not a list");
    SWIG_fail;
  }
}

// This cleans up the gcc_jit_param ** array we malloc'd before the function call
// Copied from the SWIG documentation, so beware the license!
%typemap(freearg) gcc_jit_param ** {
  free((gcc_jit_param *) $1);
}

// This tells SWIG to treat gcc_jit_rvalue ** as a special case
// Derived from the SWIG documentation, so beware the license!
%typemap(in) gcc_jit_rvalue ** {
  /* Check if is a list */
  if (PyList_Check($input)) {
    int size = PyList_Size($input);
    int i = 0;
    $1 = (gcc_jit_rvalue **) malloc((size+1)*sizeof(gcc_jit_rvalue *));
    for (i = 0; i < size; i++) {
      if(SWIG_ConvertPtr(PyList_GetItem($input, i), (void **)&$1[i], $descriptor(gcc_jit_rvalue *), 0) < 0) {
        PyErr_SetString(PyExc_TypeError, "in function '$symname', expecting type gcc_jit_rvalue");
        SWIG_fail;
      }
    }
    $1[i] = 0;
  } else {
    PyErr_SetString(PyExc_TypeError, "not a list");
    SWIG_fail;
  }
}

// This cleans up the gcc_jit_rvalue ** array we malloc'd before the function call
// Derived from the SWIG documentation, so beware the license!
%typemap(freearg) gcc_jit_rvalue ** {
  free((gcc_jit_rvalue *) $1);
}


%include <libgccjit.h>

Between the example and the interface file, it’s best to just create a GNU Makefile which will build the Python library:

Makefile

DOWNLOAD

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


# This line may need changed depending on your distribution.
GCC_INCLUDE_DIR ?= /usr/include
GCC = gcc-10

all: _gccjit.so
.PHONY: all

clean:
	rm -f _gccjit.so gccjit_wrap.c example gccjit.py
.PHONY: clean

example: example.c
	$(GCC) -o $@ $< -lgccjit

_gccjit.so: gccjit_wrap.c
	$(GCC) -fPIC -shared -o $@ -I/usr/include/python3.8 -I$(GCC_INCLUDE_DIR) gccjit_wrap.c -lgccjit

gccjit_wrap.c gccjit.py: gccjit.i
	swig -python -outcurrentdir -I$(GCC_INCLUDE_DIR) gccjit.i

After running make, it should be possible to enter Python and type:

1

import gccjit

This should then load successfully!

Use of libgccjit #

In brief, here is the approach we should take according to the documentation. In response to each line being parsed:

Create a “context” for compilation.
Create a block which evaluates the tree of integer operations.
Give that block a name, and export it as a function.
Call that new function.
Put the result on the top of our stack.

Since we have relied on SWIG to be mostly automatic, it gives us a “raw” C API, so it’s not pretty. But, it’s functional, and hopefully easy to follow.

[TODO finish this section]