Other personal projects, of both a software and familial nature have taken priority lately over my project to implement a BBC BASIC-a-like compiler for the .NET CLR, now called OWL BASIC in honour of the owl logo of the BBC’s Computer Literacy Project. Now I have some spare capacity to devote to the project, and my brother Ian, who is part-way through his Computing Science degree and is another long-time BBC/RISC OS/Acorn computing enthusiast has joined the project – he’s starting work on the optional front-end detokenizer. We have opted to handle tokenized source files by detokenizing into text, and then retokenizing with our lexer – the reason being that the tokenized BBC BASIC files do not contain all of the tokens we need for our grammar, and detokenizing even very large programs is mimimum overhead compared to the parsing and complilation.
In this post, I want to take a look at the structure of the PLY parser and the classes defining storage for the Abstract Syntax Tree that is the product of streaming the lexer tokens through the parser.
IronPython with PLY
PLY requires that the grammar rules are placed in Python docstrings within methods that follow a specific naming convention; these docstrings are then parsed during parser initialisation to build the required parsing tables. These tables are then cached in files for quicker start-up on subsequent runs. PLY stores is cached tables as large files of generated Python code containing some very long expressions. Unfortunately, these large expressions resulted in pathological behaviour of the IronPython parser, and stack overflow problems (although they were fine with regular CPython), and so some modifications to the PLY source was required to use Python pickling, rather than code generation, for storage and retrieval of the cached parsing tables. Performance is now much better, with a compiler startup time of many minutes reduced to a few seconds.
BBC BASIC grammar structure
I’ll start by pointing out that our current grammar definition is not perfect; even after the suitably caffeinated debugging session recommended in the PLY manual we still have 14 shift/reduce conflicts and 12 reduce/reduce conflicts. The former group are not serious, but the latter need to be addressed. There is no guarantee that it’s possible to parse BBC BASIC with a look-ahead left-to-right (LALR) grammar, owing the lack of a formal grammar definition for the language. The issues are resolvable, but may require stepping outside of the facilities provided by PLY – as we have done already for
The main structural part of our grammar is as follows:
program : statement_list;
statement_list : compound_statement
| statement_list compound_statement;
compound_statement : lone_stmt_body stmt_terminator
| multi_statement stmt_terminator;
multi_statement : statement statements_tail;
statements_tail : statement_separator statement statements_tail
statement_separator : COLON;
stmt_terminator : EOL;
statement : stmt_body
lone_stmt_body : case_stmt;
stmt_body : beats_stmt
We have two recursively defined rules here:
statement_list is defined in terms of itself, as is
statements_tail. The former is used for collecting lists of statements or compound statements separated by line-breaks. The latter is used for accumulating statements forming compound statements separated by colons. The correct grammar for this was awkward to determine because many published grammars are for C-style languages which use statement terminators (e.g. semicolon) whereas BASIC uses Pascal-style statement separators, which can run in continguous sequences with empty statements. Some inspiration from a Pascal grammar and experimentation with UltraGram led to workable solution. A minority of statements in BBC BASIC, such as
CASE must occupy their own line, which must be handled in the grammar as
lone_stmt_body – the less fussy majority of statements make up
Abstract Syntax Tree construction
As the parser matches each rule, an Abstract Syntax Tree (AST) node is generated. I’ll use the
BPUT rule as a simple example which shows many features:
'''bput_stmt : BPUT channel COMMA expr
| BPUT channel COMMA expr SEMICOLON'''
if len(p) == 5:
p = Bput(channel = p, data = p, newline=True)
elif len(p) == 6:
p = Bput(channel = p, data = p, newline=False)
p.lineNum = p.lineno(1)
Grammar rule encoding
PLY encodes grammar rules in the doc-string of module level functions which accepts p, a list of the productions of other rules referenced such as
channel, or terminals (i.e. tokens) such as
BPUT token represents the string “BPUT” in the source, without the following hash. The way the manual is written seems to suggest that the hash is part of the
BPUT keyword, but simple experimentation with the BASIC interpreter, by inserting a space as in
BPUT #f%, 65 shows that the hash is a prefix to the integer representing a channel number.
BPUT command has two syntaxes: The first form accepts a channel number (an single value prefixed by a hash) and an expression. The second form accepts an expression followed by a semicolon. There is no mention here of the type of the expressions – type-checking is a phase distinct from parsing which comes later. These two forms of BPUT are not one-for-one equivalent with the two forms given in the BBC BASIC guide,
since our first rule matches syntaxes 1 and 2 (without the semicolon) and our second rule matches only syntax 2 (with the semicolon), illustrating how the syntax information in the manual must be converted into a YACC grammar suitable for PLY.
Creating the AST node
Crucially, our two syntax rules are of different length allowing us to use the length to determine which rule was matched. By PLY convention, the rule must place its resulting AST node in
p. For either match we create a
Bput AST node which is a Python object representing
BPUT and configure it with keyword parameters for the channel number, the data to be output, and a flag indicating whether a newline should be output following the data (n.b. this also depends on the type of
data – we only output a newline for string data – by we account for that later in the type-checker once we have determined the type.) Each of these items will itself be an the root of an AST substree of nodes representing anything from a fairly mundane literal integer, to a complex expression. As a final step in configuring our new AST node, we attach line number information which is only available from terminal tokens – in this case we use the line-number from the
BPUT token itself.
Declaring the AST node classes
Our requirements for our AST node classes are as follows:
- uniform interface to facilitate AST traversal for the later compiler stages
- convenient syntax for declaring the structure of the many different types of AST nodes the langauge requires
- conventient syntax for initializing new AST nodes as they are constructed by the grammar rules
- attach formal type information and other information such as line numbers to the nodes
Each AST node class started being writting out long hand, but it was clear that this was going to be a lot of boilerplate code for each class in functions which could not easily be factored out into base classes, such as
__init__ methods. What was needed was a code generator for producing the classes based upon a simple specification, and this is where Python metaclasses come into their own.
Our specification for
BPUT is something like this,
- A sub-node expression evaluating to a channel integer
- A sub-node expression evaluating to a string or a number (i.e. a non-array type – a scalar
- A flag indicating whether a newline should be appended after the data
which we encode in Python using a declarative style as:
"AstNode - base class for all AST nodes"
__metaclass__ = AstMeta
formal_type = TypeOption()
actual_type = TypeOption()
line_num = IntegerOption()
# ... Code omitted here ...
"AstStatement - base class for all statements"
formal_type = TypeOption(VoidType) # All statements have a type of 'void'
actual_type = formal_type
channel = Node(formalType=ChannelType)
data = Node(formalType=IntegerType)
newline = BoolOption(False)
The work in transforming this declarative style into the boilerplate code we need is done by the
AstMeta metaclass which introspects each class as it is constructed and converts our declarative style class attributes above, such as
Bput.channel into the getters, setters, properties and intializers and storage we need, in the resulting class. Our declared
Nodes can accept AST subtrees, whereas our
Option types store simple integer, string, or bool properties attached to the node.
Thus, the contents of the
Bput that is generated are quite different to what is shown above. When the Bput class is instantiated, its metaclass actually produces something which behaves like this:
"AstNode - base class for all AST nodes"
def __init__(self, lineNum = None):
self.lineNum = lineNum
# Line Number
def _setLineNum(self, line_num):
formalType = property(getLineNum, setLineNum)
"AstStatement - base class for all statements"
def __init__(self, formalType = None, actualType = None, **kwargs):
self.formalType = None
self.actualType = None
# Formal Type
def _setFormalType(self, formal_type):
formalType = property(getFormalType, setFormalType)
# Actual Type
def _setActualType(self, actual_type):
actualType = property(getActualType, setActualType)
def _setChannel(self, channel):
channel = property(getChannel, setChannel)
def _setChannel(self, data):
data = property(getData, setData)
def _setNewline(self, data):
newline = property(getNewline, setNewline)
This provides a uniform programming interface for the Bput object allowing simple manipulation of the Bput object itself, but also a standard interface to its child nodes allowing simple traversal by the various visitors which transform and analyse the AST in the later stages of the compiler.
Visiting the AST
Once the AST has been created by the parser, it can be analysed or manipulated by a visitor. One such visitor, which is an aid to debugging, transcribes each AST node into an XML representation. This is useful since it allows the AST to be conveniently examined using standard XML capable tools (e.g. a web browser) to aid debugging.
The following code, taken from line 374 of Acornsoft’s Sphinx Adventure,
378 IF ABS(L-26.5)<1 ANDO?34=26 THEN PROCR(27): GOTO378
results in the following AST when rendered as XML following various other visitor passes to perform some simplification and type-checking,
<If line_num="374" actual_type="Void" formal_type="Void">
<CallProcedure line_num="374" actual_type="Void" formal_type="Void" name="R">
<LiteralInteger line_num="374" value="27" actual_type="Integer" />
<Goto line_num="374" actual_type="Void" formal_type="Void">
<LiteralInteger line_num="374" value="378" actual_type="Integer" />
<And line_num="374" actual_type="Integer" formal_type="Integer">
<LessThan line_num="374" actual_type="Integer" formal_type="Integer">
<Minus line_num="374" actual_type="Float">
<Variable line_num="374" actual_type="Float" identifier="L" />
<LiteralFloat line_num="374" value="26.5" actual_type="Float" />
<LiteralInteger line_num="374" value="1" actual_type="Integer" />
<Equal line_num="374" actual_type="Integer" formal_type="Integer">
<DyadicByteIndirection line_num="374" actual_type="Byte" formal_type="Byte">
<Cast line_num="374" actual_type="Address" source_type="Float" target_type="Address" formal_type="Address">
<Variable line_num="374" actual_type="Float" identifier="O" />
<LiteralInteger line_num="374" value="34" actual_type="Integer" />
<LiteralInteger line_num="374" value="26" actual_type="Integer" />