Parsing

Syntax analysis #

Our syntax analysis phase, or “parsing”, is basically a no-op in the current version. This is possible because, thanks to our stack having a “guard zero” at the bottom, there is really no such thing as a program with only valid tokens that is invalid syntax.

This makes it a very rare language indeed.

Since we are doing math in a specific order of operations, it would be easy to make this into a tree. Each leaf is a number, and each middle node is an operator.

This would then allow separating the evaluation from the actual parsing and determining order of operations. In this simple language it is extra work, but real compilers will mostly focus on manipulating and reading the structure of the Abstract Syntax Tree (Wikipedia) (AST) when creating their intermediate representation.

In order to save time and keep the code simple, let’s use the binarytree python library. I’m using version 6.5.1 for this exercise.

Once it’s installed, add this to the top:

1

from binarytree import Node

And then, re-write the main parsing loop to fold nodes together in a tree, rather than just doing the calculation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# requires package: binarytree==6.5.1
from binarytree import Node, get_parent

def parse_tokens(tokens):
    # parse, but now build a tree
    nodes = []
    for token in tokens:
        # decide what the token of the string is. there are several options
        # 1. is it a number? append in on the stack
        if token['kind'] == 'int':
            nodes.append(Node(token['value']))
        # 2. is it an operator? then put the last children into a tree with it
        elif token['kind'] == "op":
            left = nodes.pop()
            right = nodes.pop()
            nodes.append(Node(token['value'], left=left, right=right))
        # 3. otherwise it should be whitespace. there is no other token type
        elif token['kind'] != 'space':
            print(" " * token['pos'] + "^")
            print("internal error: unknown token")
            break
    return nodes

At this point, doing print(nodes[-1]) – the last generated expression – should print the entire tree (assuming the user correctly completed it).

> 2 3 +

  +
 / \
3   2

> 2 3 4 + *

    __*
   /   \
  +     2
 / \
4   3

Thus, we have completed the parsing stage.

From this point on, our program can stop worrying about tokens, or syntax errors in the file, and instead, focus on semantics – what does the program mean, and what instructions should we generate to execute it.